PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Started by Michael Milliganover 17 years ago18 messagesbugs
Jump to latest
#1Michael Milligan
milli@acmeps.com

Uh oh. This is a first (for me). I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR: lock AccessShareLock on object 16385/16467/0 is already held

What was I doing? A large transaction where I was pushing about 20
million rows into two tables using prepared transactions. I have been
doing this for some time (over a year) on 8.2.6 without any problems.

I may be able to reproduce, but the transaction takes a long time to
complete... about a day. I'm running it again now.

There was no information in the log. What else can I do to get debug
information for you, provided it's reproducable...

Some db details:

db=> select version();
version
-------------------------------------------------------------------------------------------------
PostgreSQL 8.3.3 on amd64-portbld-freebsd6.3, compiled by GCC cc (GCC)
3.4.6 [FreeBSD] 20060305
(1 row)

This instance was also compiled with the following changes
(pg_config_manual.h):

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 354a3d3..20c40c4 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -23,7 +23,7 @@
  *
  * Changing BLCKSZ requires an initdb.
  */
-#define BLCKSZ 8192
+#define BLCKSZ 32768

/*
* RELSEG_SIZE is the maximum number of blocks allowed in one disk
@@ -69,7 +69,7 @@
*
* Changing this requires an initdb.
*/
-#define NAMEDATALEN 64
+#define NAMEDATALEN 128

/*
* Maximum number of arguments to a function.

This goes along with a UFS block size of 32768.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Milligan (#1)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Michael Milligan wrote:

Uh oh. This is a first (for me). I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR: lock AccessShareLock on object 16385/16467/0 is already held

What was I doing? A large transaction where I was pushing about 20
million rows into two tables using prepared transactions. I have been
doing this for some time (over a year) on 8.2.6 without any problems.

Could you provide more details on the transaction? At what point did it
fail, during the insertss, or at commit perhaps? Are you really using
prepared transactions, aka two-phase commit, or did you mean prepared
statements?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Milligan (#1)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Michael Milligan <milli@acmeps.com> writes:

Uh oh. This is a first (for me). I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Huh, that shouldn't happen. What object is that? The 16385 should be
a database OID, and the 16467 is most likely a table's OID within that
database.

What was I doing? A large transaction where I was pushing about 20
million rows into two tables using prepared transactions. I have been
doing this for some time (over a year) on 8.2.6 without any problems.

I may be able to reproduce, but the transaction takes a long time to
complete... about a day. I'm running it again now.

I wonder if it's possible that you overflowed nLocks in the local lock
table. There are a lot of other things that likely would break first,
though, so that doesn't seem very probable.

Is this transaction longer than any comparable one you ever did before?

regards, tom lane

#4Michael Milligan
milli@acmeps.com
In reply to: Heikki Linnakangas (#2)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Heikki Linnakangas wrote:

Michael Milligan wrote:

Uh oh. This is a first (for me). I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR: lock AccessShareLock on object 16385/16467/0 is already held

What was I doing? A large transaction where I was pushing about 20
million rows into two tables using prepared transactions. I have been
doing this for some time (over a year) on 8.2.6 without any problems.

Could you provide more details on the transaction? At what point did it
fail, during the insertss, or at commit perhaps? Are you really using
prepared transactions, aka two-phase commit, or did you mean prepared
statements?

Yes, mistyped there, prepared _statements_. It failed during an
EXECUTE, so during a prepared INSERT statement.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#5Michael Milligan
milli@acmeps.com
In reply to: Tom Lane (#3)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

Michael Milligan <milli@acmeps.com> writes:

Uh oh. This is a first (for me). I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Huh, that shouldn't happen. What object is that? The 16385 should be
a database OID, and the 16467 is most likely a table's OID within that
database.

What was I doing? A large transaction where I was pushing about 20
million rows into two tables using prepared transactions. I have been
doing this for some time (over a year) on 8.2.6 without any problems.

I may be able to reproduce, but the transaction takes a long time to
complete... about a day. I'm running it again now.

I wonder if it's possible that you overflowed nLocks in the local lock
table. There are a lot of other things that likely would break first,
though, so that doesn't seem very probable.

Is this transaction longer than any comparable one you ever did before?

For 8.3.3, yes. This is only the second such large transaction I've
done on 8.3.3, the first one I did was around 6.7 million rows and it
worked.

And a correction, the transaction that caused this error was inserting
13.5 million rows and failed near the end.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Milligan (#5)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Michael Milligan <milli@acmeps.com> writes:

Tom Lane wrote:

Huh, that shouldn't happen. What object is that? The 16385 should be
a database OID, and the 16467 is most likely a table's OID within that
database.

Please answer the above question.

And a correction, the transaction that caused this error was inserting
13.5 million rows and failed near the end.

Once you've determined which table the error message is talking about,
please show us what the transaction does with that table.

regards, tom lane

#7Michael Milligan
milli@acmeps.com
In reply to: Tom Lane (#6)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

Michael Milligan <milli@acmeps.com> writes:

Tom Lane wrote:

Huh, that shouldn't happen. What object is that? The 16385 should be
a database OID, and the 16467 is most likely a table's OID within that
database.

Please answer the above question.

16385 is the database (db) and 16467 is the table (email) the prepared
inserts are executed against. That table (email) is an inherited table
definition for partitioned tables that are named by date ranges. E.g.,

CREATE TABLE email_2008_week27 ( PRIMARY KEY (id), CHECK ( ts >=
'2008-06-30'::timestamp AND ts < '2008-07-07'::timestamp ) ) INHERITS
(email)

And a correction, the transaction that caused this error was inserting
13.5 million rows and failed near the end.

Once you've determined which table the error message is talking about,
please show us what the transaction does with that table.

You mean like:

BEGIN;
PREPARE msg (...) INSERT INTO email VALUES (...);
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
... repeated millions of times
COMMIT;

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Milligan (#7)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Michael Milligan <milli@acmeps.com> writes:

Tom Lane wrote:

Once you've determined which table the error message is talking about,
please show us what the transaction does with that table.

You mean like:

BEGIN;
PREPARE msg (...) INSERT INTO email VALUES (...);
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
... repeated millions of times
COMMIT;

What I'm wondering about is the sequence of operations that are executed
per row. Could it be long enough that the email table is being touched
by more than 2 billion separate SQL operations within the transaction?

regards, tom lane

#9Michael Milligan
milli@acmeps.com
In reply to: Tom Lane (#8)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

Michael Milligan <milli@acmeps.com> writes:

Tom Lane wrote:

Once you've determined which table the error message is talking about,
please show us what the transaction does with that table.

You mean like:

BEGIN;
PREPARE msg (...) INSERT INTO email VALUES (...);
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
... repeated millions of times
COMMIT;

What I'm wondering about is the sequence of operations that are executed
per row. Could it be long enough that the email table is being touched
by more than 2 billion separate SQL operations within the transaction?

I can't imagine that, but then again, I'm not a PostgreSQL internals
expert either. The partitioned email tables have indexes on them, but
in this part of the transaction, all I'm doing is EXECUTEs where the
prepared statement is INSERTing a single row. Nothing else fancy with
the email table is going on. There are other things going on with
values being stored to TEMP tables within the same transaction, would
those count toward the 2 billion?

I ran the transaction again and it failed again, so it is reproducable.
I'm going to try it with a much smaller dataset and see what happens.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#10Michael Milligan
milli@acmeps.com
In reply to: Tom Lane (#8)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

What I'm wondering about is the sequence of operations that are executed
per row. Could it be long enough that the email table is being touched
by more than 2 billion separate SQL operations within the transaction?

FWIW, I've used the exact same code against PG 8.2.6 and have half a
dozen similar transactions that inserted more than 13.5 million rows,
with the largest transaction at a little over 25 million rows inserted
into the email table.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Milligan (#10)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Michael Milligan <milli@acmeps.com> writes:

FWIW, I've used the exact same code against PG 8.2.6 and have half a
dozen similar transactions that inserted more than 13.5 million rows,
with the largest transaction at a little over 25 million rows inserted
into the email table.

Hmph. That seems to eliminate the overflow theory, because 8.2 has
essentially the same lock-counting code as 8.3. Unless 8.3 is taking
out the lock a heckuva lot more than 8.2 did, but I can't think of a
reason for that to happen.

Now that we know you can reproduce it, we should think about how to get
some information out. Are you in a position to build a locally modified
Postgres? I could send you a patch to make that particular error report
dump out more information about the lock state, but a patch won't do you
any good if you aren't able to build from source.

regards, tom lane

#12Michael Milligan
milli@acmeps.com
In reply to: Tom Lane (#11)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

Michael Milligan <milli@acmeps.com> writes:

FWIW, I've used the exact same code against PG 8.2.6 and have half a
dozen similar transactions that inserted more than 13.5 million rows,
with the largest transaction at a little over 25 million rows inserted
into the email table.

Hmph. That seems to eliminate the overflow theory, because 8.2 has
essentially the same lock-counting code as 8.3. Unless 8.3 is taking
out the lock a heckuva lot more than 8.2 did, but I can't think of a
reason for that to happen.

Now that we know you can reproduce it, we should think about how to get
some information out. Are you in a position to build a locally modified
Postgres? I could send you a patch to make that particular error report
dump out more information about the lock state, but a patch won't do you
any good if you aren't able to build from source.

I can rebuild with a patch, sure. I've got a spare machine I can mirror
the db over to (same hardware) to debug this.

Regards,
Mike

--
Michael Milligan -> milli@acmeps.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Milligan (#1)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

[ reincluding the mailing list ]

Michael Milligan <milli@acmeps.com> writes:

Okay, it reproduces and surprise surprise nLocks does overflow...

ERROR: lock AccessShareLock on object 16385/16467/0 is already held
lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)

Hah. Okay, that shows that we'd never have reproduced it with a small
test case.

So I'm guessing this is not a bug? Just that for some reason 8.3.3 is
grabbing more locks than 8.2.6 does and I have to adjust my batch
processes to break things up into chunks of less than 10 million per
transaction... :-/

Well, I'm wondering exactly why 8.3 is taking more locks than 8.2 does,
because AFAICS it shouldn't. Could you extract a complete example of
what your code does per tuple? I would like to run a small test case
and just see where the LockAcquire calls come from in each version.
(Or, if you prefer, you can get out gdb and do the legwork yourself...)
It's true that breaking up the transaction is likely to be the best
short-term answer for you, but I think there might be a bug here.

In any case, now that we know that nLocks overflow is actually possible
within real-world transaction lengths, it'd behoove us to do something
about that in 8.4 or beyond.

regards, tom lane

#14Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#13)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

[ reincluding the mailing list ]

Michael Milligan <milli@acmeps.com> writes:

Okay, it reproduces and surprise surprise nLocks does overflow...

ERROR: lock AccessShareLock on object 16385/16467/0 is already held
lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)

Hah. Okay, that shows that we'd never have reproduced it with a small
test case.

So I'm guessing this is not a bug? Just that for some reason 8.3.3 is
grabbing more locks than 8.2.6 does and I have to adjust my batch
processes to break things up into chunks of less than 10 million per
transaction... :-/

Well, I'm wondering exactly why 8.3 is taking more locks than 8.2 does,
because AFAICS it shouldn't. Could you extract a complete example of
what your code does per tuple? I would like to run a small test case
and just see where the LockAcquire calls come from in each version.
(Or, if you prefer, you can get out gdb and do the legwork yourself...)
It's true that breaking up the transaction is likely to be the best
short-term answer for you, but I think there might be a bug here.

In any case, now that we know that nLocks overflow is actually possible
within real-world transaction lengths, it'd behoove us to do something
about that in 8.4 or beyond.

Is this a TODO?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#14)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

In any case, now that we know that nLocks overflow is actually possible
within real-world transaction lengths, it'd behoove us to do something
about that in 8.4 or beyond.

Is this a TODO?

Yes, although I'm still waiting for more info from Michael.

regards, tom lane

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#13)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

[ reincluding the mailing list ]

Michael Milligan <milli@acmeps.com> writes:

Okay, it reproduces and surprise surprise nLocks does overflow...

ERROR: lock AccessShareLock on object 16385/16467/0 is already held
lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)

Hah. Okay, that shows that we'd never have reproduced it with a small
test case.

This hasn't been fixed yet, has it? Do you have any ideas on how to
actually fix the problem? I wonder if it looks like enlarging nLocks,
or you're intending to attempt to reduce the number of locks taken.

A customer of ours started hitting this bug too, last week.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#16)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

Michael Milligan <milli@acmeps.com> writes:

Okay, it reproduces and surprise surprise nLocks does overflow...

Hah. Okay, that shows that we'd never have reproduced it with a small
test case.

This hasn't been fixed yet, has it?

Well, it has been addressed anyway...

2008-09-15 21:56 tgl

* src/include/storage/: lock.h (REL8_1_STABLE), lock.h
(REL8_3_STABLE), lock.h (REL8_0_STABLE), lock.h (REL8_2_STABLE),
lock.h: Widen the nLocks counts in local lock tables from int to
int64. This forestalls potential overflow when the same table (or
other object, but usually tables) is accessed by very many
successive queries within a single transaction. Per report from
Michael Milligan.

Back-patch to 8.0, which is as far back as the patch conveniently
applies. There have been no reports of overflow in pre-8.3
releases, but clearly the risk existed all along. (Michael's
report suggests that 8.3 may consume lock counts faster than prior
releases, but with no test case to look at it's hard to be sure
about that. Widening the counts seems a good future-proofing
measure in any event.)

A customer of ours started hitting this bug too, last week.

Tut tut, customer not on latest point release?

regards, tom lane

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#17)
Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Tom Lane wrote:

2008-09-15 21:56 tgl

* src/include/storage/: lock.h (REL8_1_STABLE), lock.h
(REL8_3_STABLE), lock.h (REL8_0_STABLE), lock.h (REL8_2_STABLE),
lock.h: Widen the nLocks counts in local lock tables from int to
int64.

Doh, I didn't see it because I only checked the change log for lock.c,
not lock.h :-(

A customer of ours started hitting this bug too, last week.

Tut tut, customer not on latest point release?

Wouldn't be a first for them (we try hard, they don't necessarily heed
advice).

Thanks!

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.