total db lockup

Started by Eugeneover 20 years ago11 messagesgeneral
Jump to latest
#1Eugene
eugene1@sympatico.ca

(NOTE: reposting this for the *fifth* time because my previous messages didn't go through).

Hi all,

We have experienced a really weird problem with
postgresql yesterday. When I was called in to take a
look, all the non-superuser connections were used up
and they were all in a waiting state (SELECT waiting,
UPDATE waiting, etc.). I couldn't figure out what the
problem is, but I saved the ps, pg_stat_activity, and
pg_locks state at the time (attached as
condor_db_stats.txt). BTW, when looking at the queries
in pg_stat_activity, postgresql cuts them off so you
can't see the entire string. Is there any way to
prevent that, or at least increase the character
limit?

We restarted postgresql server and the problem
reappeared a few minutes later. I saved that state
also (condor_db_stats2.txt). At that point, after we
restarted postgresql (again), I ran VACUUM on the
entire database and did a few more things that seem to
have solved the problem (see below).

This particular database is essentially just one flat
table (level) with a few small supporting tables. Only
the level table is heavily used. (table definition is
attached as table.txt). I noticed that one of the
indexes (level_owner_index) was a hash index. I
remembered what postgresql manual says about hash
indexes and concurrency
(http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html)
and, after VACUUM finished, replaced the hash index
with a btree. I then did a REINDEX of the level table
and ANALYZE. This seems to have solved the problem --
at least as of this morning we still have not seen any
deadlocks.

My question is, what could have caused this to happen?
Can anyone explain this paragraph from the manual:

"Share/exclusive page-level locks are used for
read/write access. Locks are released after the page
is processed. Page-level locks provide better
concurrency than index-level ones but are liable to
deadlocks."

Any other pointers to help me figure out what went
wrong and how to fix it?

thanks,

Eugene

WTF? My message doesn't appear. Trying again without
attachments or inline text.

#2Eugene
eugene1@sympatico.ca
In reply to: Eugene (#1)
Re: total db lockup

So can anyone offer any insight on this? BTW, I tried reposting it with attachments and it didn't show up.

thanks,

Eugene

Show quoted text

From: <eugene1@sympatico.ca>
Date: 2005/08/18 Thu AM 09:24:30 EST
To: <pgsql-general@postgresql.org>
Subject: [GENERAL] total db lockup

(NOTE: reposting this for the *fifth* time because my previous messages didn't go through).

Hi all,

We have experienced a really weird problem with
postgresql yesterday. When I was called in to take a
look, all the non-superuser connections were used up
and they were all in a waiting state (SELECT waiting,
UPDATE waiting, etc.). I couldn't figure out what the
problem is, but I saved the ps, pg_stat_activity, and
pg_locks state at the time (attached as
condor_db_stats.txt). BTW, when looking at the queries
in pg_stat_activity, postgresql cuts them off so you
can't see the entire string. Is there any way to
prevent that, or at least increase the character
limit?

We restarted postgresql server and the problem
reappeared a few minutes later. I saved that state
also (condor_db_stats2.txt). At that point, after we
restarted postgresql (again), I ran VACUUM on the
entire database and did a few more things that seem to
have solved the problem (see below).

This particular database is essentially just one flat
table (level) with a few small supporting tables. Only
the level table is heavily used. (table definition is
attached as table.txt). I noticed that one of the
indexes (level_owner_index) was a hash index. I
remembered what postgresql manual says about hash
indexes and concurrency
(http://www.postgresql.org/docs/7.4/interactive/locking-indexes.html)
and, after VACUUM finished, replaced the hash index
with a btree. I then did a REINDEX of the level table
and ANALYZE. This seems to have solved the problem --
at least as of this morning we still have not seen any
deadlocks.

My question is, what could have caused this to happen?
Can anyone explain this paragraph from the manual:

"Share/exclusive page-level locks are used for
read/write access. Locks are released after the page
is processed. Page-level locks provide better
concurrency than index-level ones but are liable to
deadlocks."

Any other pointers to help me figure out what went
wrong and how to fix it?

thanks,

Eugene

WTF? My message doesn't appear. Trying again without
attachments or inline text.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Eugene (#2)
Re: total db lockup

<eugene1@sympatico.ca> writes:

So can anyone offer any insight on this?

Not without information. You haven't even told us what PG version you
are running, much less provided the necessary details like the pg_locks
status.

BTW, I tried reposting it with attachments and it didn't show up.

Fix your mail setup and try again. Or wait a bit --- it's quite
possible the previous messages are just stuck in the moderator's
approval queue.

regards, tom lane

#4Eugene
eugene1@sympatico.ca
In reply to: Tom Lane (#3)
Re: total db lockup

It seems that I can't. For whatever reason, my messages are being blocked.

Eugene

Show quoted text

From: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2005/08/18 Thu PM 02:11:54 EST
To: eugene1@sympatico.ca
CC: pgsql-general@postgresql.org
Subject: Re: [GENERAL] total db lockup

Attachments:

replyAlltext/plain; name=replyAllDownload
#5Eugene
eugene1@sympatico.ca
In reply to: Eugene (#4)
Re: total db lockup

Guys, I really need help on this. Can whoever is in charge of the mailing list change it so that it doesn't delete my messages?

thanks,

Eugene

Show quoted text

From: Eugene <eugene1@sympatico.ca>
Date: 2005/08/18 Thu PM 04:45:58 EST
To: Tom Lane <tgl@sss.pgh.pa.us>
CC: <pgsql-general@postgresql.org>
Subject: Re: [GENERAL] total db lockup

It seems that I can't. For whatever reason, my messages are being blocked.

Eugene

From: Tom Lane <tgl@sss.pgh.pa.us>
Date: 2005/08/18 Thu PM 02:11:54 EST
To: eugene1@sympatico.ca
CC: pgsql-general@postgresql.org
Subject: Re: [GENERAL] total db lockup

Attachments:

replyAlltext/plain; name=replyAllDownload
replyAlltext/plain; name=replyAllDownload
#6Martijn van Oosterhout
kleptog@svana.org
In reply to: Eugene (#5)
Re: total db lockup

Well, perhaps they were blocked for being too large?

Seriously, some of your messages appear to be getting through fine so
if you can't attach them put them on a web or ftp server and post a
link. Much better than copying it to several hundred mailboxes.

On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:

Guys, I really need help on this. Can whoever is in charge of the
mailing list change it so that it doesn't delete my messages?

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Eugene (#5)
Re: total db lockup

On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:

Guys, I really need help on this. Can whoever is in charge of the
mailing list change it so that it doesn't delete my messages?

I am not in charge of the mail server. However: May I suggest you
change to a less broken mail client? Your messages show up featuring
really strange MIME content description. Even the standard gmail.com
accounts seem to work better, which is surprising for a web based email
setup. This brokenness may hint the spam processor on postgresql.org
servers to silently droo your emails.

Alternatively, you may want to "paste" your error messages and stuff on
places like http://rafb.net/paste, and then post the URLs here.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hern�ndez-Novich)

#8Eugene
eugene1@sympatico.ca
In reply to: Alvaro Herrera (#7)
Re: total db lockup

Thanks Alvaro.

Here it is again:

problem description: http://rafb.net/paste/results/bLAtIk26.html
db state before first restart: http://rafb.net/paste/results/D1Bqe125.html
db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
table definition: http://rafb.net/paste/results/W35ccD49.html

thanks,

Eugene

Show quoted text

From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2005/08/18 Thu PM 08:52:17 EST
To: Eugene <eugene1@sympatico.ca>
CC: pgsql-general@postgresql.org
Subject: Re: [GENERAL] total db lockup

On Thu, Aug 18, 2005 at 05:20:55PM -0400, Eugene wrote:

Guys, I really need help on this. Can whoever is in charge of the
mailing list change it so that it doesn't delete my messages?

I am not in charge of the mail server. However: May I suggest you
change to a less broken mail client? Your messages show up featuring
really strange MIME content description. Even the standard gmail.com
accounts seem to work better, which is surprising for a web based email
setup. This brokenness may hint the spam processor on postgresql.org
servers to silently droo your emails.

Alternatively, you may want to "paste" your error messages and stuff on
places like http://rafb.net/paste, and then post the URLs here.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"El sentido de las cosas no viene de las cosas, sino de
las inteligencias que las aplican a sus problemas diarios
en busca del progreso." (Ernesto Hernández-Novich)

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Eugene (#8)
Re: total db lockup

On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote:

Thanks Alvaro.

Here it is again:

problem description: http://rafb.net/paste/results/bLAtIk26.html
db state before first restart: http://rafb.net/paste/results/D1Bqe125.html
db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
table definition: http://rafb.net/paste/results/W35ccD49.html

Ok, so it seems the lockup occured only with the hash indexes? Then it
means we still have bugs in the locking code for those. It doesn't
surprise me. There's a reason they are not recommended, you know?

Let us know if you find the problem showing up again with btree indexes.
The hash indexes bugs should be fixed, but they are not high priority ...

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"If it wasn't for my companion, I believe I'd be having
the time of my life" (John Dunbar)

#10Eugene
eugene1@sympatico.ca
In reply to: Alvaro Herrera (#9)
Re: total db lockup

actually, I'm using postgresql 7.3.2. I noticed there were some hash index fixes in 7.4, so that might be it. Would there be any other reason for this deadlock? This database has been in production for almost 2 years and this is the first time we've seen the problem.

thanks,

Eugene

Show quoted text

From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: 2005/08/19 Fri AM 11:22:04 EST
To: Eugene <eugene1@sympatico.ca>
CC: pgsql-general@postgresql.org
Subject: Re: [GENERAL] total db lockup

On Fri, Aug 19, 2005 at 10:54:35AM -0400, Eugene wrote:

Thanks Alvaro.

Here it is again:

problem description: http://rafb.net/paste/results/bLAtIk26.html
db state before first restart: http://rafb.net/paste/results/D1Bqe125.html
db state before second restart: http://rafb.net/paste/results/D1Bqe125.html
table definition: http://rafb.net/paste/results/W35ccD49.html

Ok, so it seems the lockup occured only with the hash indexes? Then it
means we still have bugs in the locking code for those. It doesn't
surprise me. There's a reason they are not recommended, you know?

Let us know if you find the problem showing up again with btree indexes.
The hash indexes bugs should be fixed, but they are not high priority ...

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"If it wasn't for my companion, I believe I'd be having
the time of my life" (John Dunbar)

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#9)
Re: total db lockup

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Ok, so it seems the lockup occured only with the hash indexes? Then it
means we still have bugs in the locking code for those.

"Still" meaning "in the version he's using", which he hasn't told us
anywhere that I saw. (Internal evidence suggests it's 7.3 something)

It doesn't surprise me.

A deadlock in hash indexes wouldn't be surprising in pre-7.4 code,
since 7.4 was the first version that even pretended to be deadlock free
in hash indexes. But it does seem pretty surprising that the deadlock
checker didn't fire and boot somebody out of the deadlock. We haven't
had any bugs reported in deadlock detection in a very long time (since
7.1 I think). This could mean there's still something wrong in there.

regards, tom lane