Error while altering an inheritance hierarchy in mid-query
My database implements partitioning using inheritance and constraint exclusion, as described in the fine manual. Each partition holds data for a single day, and there are well over 300 child partitions.
The exception mentioned below notwithstanding, the technique works exceedingly well and is a real life saver when you have half a billion rows and climbing.
While a query was running on the partition for today a maintenance job was run that summarizes data in a different partition using the technique described in section 5.9.3. - "Managing Partitions" of the manual, with a twist. Basically, it goes like this:
1. A select into query is run which summarizes the data from a partition into a table outside the inheritance hierarchy, which is then indexed.
2. Then
a. a transaction is begun,
b. the original partition is dropped,
c. the new table renamed to the original partition's name,
d. the new table's unique index is renamed,
e. the appropriate check constraint is added,
f. select privilege is granted, and
g. the transaction is committed.
So far so good, the maintenance job works fine and is quite speedy. The problem occurs when a select query is started prior to the sequence of events above, those steps complete, and an error is thrown saying "PGRES_FATAL_ERROR - ERROR: could not open relation with OID 64412". Of course, seconds later the query runs fine, and there is no object with OID 64412 in the database.
It seems to me that removing and inserting partitions into the inheritance hierarchy may not be a consistent and/or isolated action with respect to queries that may be using the inheritance hierarchy. To be fair, this sequence of events doesn't normally happen in daily processing, but the behavior is not what I expected, nor is it mentioned in the manual. Am I missing or misinterpreting something?
Thanks in advance for your advice!
Bob Lunney
Bob Lunney <bob_lunney@yahoo.com> writes:
1. A select into query is run which summarizes the data from a partition into a table outside the inheritance hierarchy, which is then indexed.
2. Then
a. a transaction is begun,
b. the original partition is dropped,
c. the new table renamed to the original partition's name,
d. the new table's unique index is renamed,
e. the appropriate check constraint is added,
f. select privilege is granted, and
g. the transaction is committed.
I'd suggest taking an exclusive lock on the inheritance hierarchy's
parent table between 2a and 2b. The "could not open relation with OID
nnn" error is to be expected when a table is dropped just as a query
is in the act of trying to open it, which is what could happen here if
a query on the parent table runs concurrently with the DROP.
You're also at risk that a concurrent query might see both or neither
of the old and new versions of the partition, leading to bogus answers.
Both of these things would be fixed if incoming queries are blocked
while trying to open the parent table, rather than while iterating
through the list of inheritance children for it.
regards, tom lane
So I did miss something! Thanks for the brilliant explanation and simple solution, Tom.
Bob
--- On Thu, 3/18/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [BUGS] Error while altering an inheritance hierarchy in mid-query
To: "Bob Lunney" <bob_lunney@yahoo.com>
Cc: pgsql-bugs@postgresql.org
Date: Thursday, March 18, 2010, 3:26 PM
Bob Lunney <bob_lunney@yahoo.com>
writes:1. A select into query is run which summarizes
the data from a partition into a table outside the
inheritance hierarchy, which is then indexed.2. Then
a. a transaction is begun,
b. the original partition isdropped,
c. the new table renamed to the
original partition's name,
d. the new table's unique index
is renamed,
e. the appropriate check
constraint is added,
f. select privilege is granted,
and
g. the transaction is
committed.
I'd suggest taking an exclusive lock on the inheritance
hierarchy's
parent table between 2a and 2b. The "could not open
relation with OID
nnn" error is to be expected when a table is dropped just
as a query
is in the act of trying to open it, which is what could
happen here if
a query on the parent table runs concurrently with the
DROP.
You're also at risk that a concurrent query might see both
or neither
of the old and new versions of the partition, leading to
bogus answers.
Both of these things would be fixed if incoming queries are
blocked
while trying to open the parent table, rather than while
iterating
through the list of inheritance children for it.
regards, tom lane