could not create directory "...": File exists

Started by Stephen Frostalmost 13 years ago14 messages
#1Stephen Frost
sfrost@snowman.net
1 attachment(s)

Greetings,

We've come across this rather annoying error happening during our
builds:

ERROR: could not create directory "pg_tblspc/25120/PG_9.3_201212081/231253": File exists

It turns out that this is coming from copydir() when called by
createdb() during a CREATE DATABASE .. FROM TEMPLATE where the
template has tables in tablespaces.

It turns out that createdb() currently only takes an AccessShareLock
on pg_tablespace when scanning it with SnapshotNow, making it possible
for a concurrent process to make some uninteresting modification to a
tablespace (such as an ACL change) which will cause the heap scan in
createdb() to see a given tablespace multiple times. copydir() will
then, rightfully, complain that it's being asked to create a directory
which already exists.

Given that this is during createdb(), I'm guessing we don't have any
option but to switch the scan to using ShareLock, since there isn't a
snapshot available to do an MVCC scan with (I'm guessing that there
could be other issues trying to do that anyway).

Attached is a patch which does this and corrects the problem for us
(of course, we're now going to go rework our build system to not
modify tablespace ACLs, since with this patch concurrent builds end up
blocking on each other, which is annoying).

Thanks,

Stephen

Attachments:

fix-concurrent-tablespace-update.patchtext/x-diff; charset=us-asciiDownload
colordiff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
new file mode 100644
index 1f6e02d..60a5099
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*************** createdb(const CreatedbStmt *stmt)
*** 549,556 ****
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
  		 */
! 		rel = heap_open(TableSpaceRelationId, AccessShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
--- 549,563 ----
  		/*
  		 * Iterate through all tablespaces of the template database, and copy
  		 * each one to the new database.
+ 		 *
+ 		 * We need to use ShareLock to prevent other processes from updating a
+ 		 * tablespace (such as changing an ACL, for example) or we will end up
+ 		 * running into the same tablespace multiple times during our heap scan
+ 		 * (the prior-to-update tuple and then the new tuple after the update)
+ 		 * and copydir() will, rightfully, complain that the directory already
+ 		 * exists.
  		 */
! 		rel = heap_open(TableSpaceRelationId, ShareLock);
  		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
*************** createdb(const CreatedbStmt *stmt)
*** 607,613 ****
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, AccessShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means
--- 614,620 ----
  			}
  		}
  		heap_endscan(scan);
! 		heap_close(rel, ShareLock);
  
  		/*
  		 * We force a checkpoint before committing.  This effectively means
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#1)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

It turns out that createdb() currently only takes an AccessShareLock
on pg_tablespace when scanning it with SnapshotNow, making it possible
for a concurrent process to make some uninteresting modification to a
tablespace (such as an ACL change) which will cause the heap scan in
createdb() to see a given tablespace multiple times. copydir() will
then, rightfully, complain that it's being asked to create a directory
which already exists.

Ugh. Still another problem with non-MVCC catalog scans.

Given that this is during createdb(), I'm guessing we don't have any
option but to switch the scan to using ShareLock, since there isn't a
snapshot available to do an MVCC scan with (I'm guessing that there
could be other issues trying to do that anyway).

It seems that the only thing we actually use from each tuple is the OID.
So there are other ways to fix it, of which probably the minimum-change
one is to keep a list of already-processed tablespace OIDs and skip a
tuple if we get a match in the list. This would be O(N^2) in the number
of tablespaces, but I rather doubt that's a problem.

[ thinks for a bit... ] Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that transiently increasing the lock here is enough,
either. The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#2)
Re: could not create directory "...": File exists

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Ugh. Still another problem with non-MVCC catalog scans.

Indeed.

It seems that the only thing we actually use from each tuple is the OID.

Yes, that's true.

So there are other ways to fix it, of which probably the minimum-change
one is to keep a list of already-processed tablespace OIDs and skip a
tuple if we get a match in the list. This would be O(N^2) in the number
of tablespaces, but I rather doubt that's a problem.

I did consider this option also but it felt like a lot of additional
code to write and I wasn't entirely convinced it'd be worth it. It's
very frustrating that we can't simply get a list of *currently valid*
tablespaces with a guarantee that no one else is going to mess with that
list while we process it. That's what MVCC would give us, but we can't
rely on that yet..

If we agree that keeping a list and then skipping over anything on the
list already seen, I can go ahead and code that up.

[ thinks for a bit... ] Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that I see how that could happen..? I agree that it'd be
really bad if it did though. Or are you thinking if we were to take a
snapshot and then walk the table?

I'm not sure that transiently increasing the lock here is enough,
either. The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it? Or put
another way, how would they not hold their locks till commit or
rollback? We do only look at tablespaces which exist for the database
we're copying, as I recall, so any newly created tablespaces shouldn't
impact this.

Thanks,

Stephen

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#3)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

[ thinks for a bit... ] Ugh, no, because the *other* risk you've got
here is not seeing a row at all, which would be really bad.

I'm not sure that I see how that could happen..? I agree that it'd be
really bad if it did though. Or are you thinking if we were to take a
snapshot and then walk the table?

The case where you see a tuple twice is if an update drops a new version
of a row beyond your seqscan, and then commits before you get to the new
version. But if it drops the new version of the row *behind* your
seqscan, and then commits before you get to the original tuple, you'll
not see either row version as committed. Both of these cases are
inherent risks in SnapshotNow scans.

I'm not sure that transiently increasing the lock here is enough,
either. The concurrent transactions you're worried about probably
aren't holding their locks till commit, so you could get this lock
and still see tuples with unstable committed-good states.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it?

What I'm worried about is that we very commonly release locks on
catalogs as soon as we're done with the immediate operation --- not only
read locks, but update locks as well. So there are lots of cases where
locks are released before commit. It's possible that that never happens
in operations applied to pg_tablespace, but I'd not want to bet on it.

I wonder whether it's practical to look at the on-disk directories
instead of relying on pg_tablespace for this particular purpose.

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans. It seems that any other fix is
going to be hopelessly ugly.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: could not create directory "...": File exists

On 2013-01-17 13:46:44 -0500, Tom Lane wrote:

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans. It seems that any other fix is
going to be hopelessly ugly.

ISTM we could just use a MVCC snapshot in this specific case?

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#4)
Re: could not create directory "...": File exists

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

The case where you see a tuple twice is if an update drops a new version
of a row beyond your seqscan, and then commits before you get to the new
version. But if it drops the new version of the row *behind* your
seqscan, and then commits before you get to the original tuple, you'll
not see either row version as committed. Both of these cases are
inherent risks in SnapshotNow scans.

Ah, I see.

If there are other transactiosn which have open locks against the table,
wouldn't that prevent my being able to acquire ShareLock on it?

What I'm worried about is that we very commonly release locks on
catalogs as soon as we're done with the immediate operation --- not only
read locks, but update locks as well. So there are lots of cases where
locks are released before commit. It's possible that that never happens
in operations applied to pg_tablespace, but I'd not want to bet on it.

Fair enough.

I wonder whether it's practical to look at the on-disk directories
instead of relying on pg_tablespace for this particular purpose.

So we'd traverse all of the tablespace directories and copy any
directories which we come across which have the oid of the template
database..? Sounds pretty reasonable, though I'm not sure that we'd
want to back-patch a large change like that.

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans. It seems that any other fix is
going to be hopelessly ugly.

I feel like we should be able to do better than what we have now, at
least. Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

Thanks,

Stephen

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#6)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Or maybe we should just write this off as a case we can't realistically
fix before we have MVCC catalog scans. It seems that any other fix is
going to be hopelessly ugly.

I feel like we should be able to do better than what we have now, at
least. Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

If it actually *reliably* fixed the problem, rather than just reducing
the probabilities, that would mean that the updates your other sessions
were doing didn't release RowExclusiveLock early. Have you dug into the
code to see if that's true? (Or even more to the point, if it's still
true in HEAD? I have no idea if all the recent refactoring has changed
this behavior for GRANT.)

My thought about a localized fix is similar to Andres' --- maybe we
could take a snapshot and use a real MVCC scan.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: could not create directory "...": File exists

I wrote:

Stephen Frost <sfrost@snowman.net> writes:

I feel like we should be able to do better than what we have now, at
least. Using ShareLock prevented the specific case that we were
experiencing and is therefore MUCH better (for us, anyway) than
released versions where we ran into the error on a regular basis.

My thought about a localized fix is similar to Andres' --- maybe we
could take a snapshot and use a real MVCC scan.

BTW, it strikes me that the reason this particular SnapshotNow scan is a
big problem for you is that, because we stop and copy a subdirectory
after each tuple, the elapsed time for the seqscan is several orders of
magnitude more than it is for most (perhaps all) other cases where
SnapshotNow is used. So the window for trouble with concurrent updates
is that much bigger. This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
kind of annotation. We know we need the SnapshotNow scan fix.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#8)
Re: could not create directory "...": File exists

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

I had discussed that idea a bit with Andres on IRC and my only concern
was if there's some reason that acquiring a snapshot during createdb()
would be problematic. It doesn't appear to currently and I wasn't sure
if there'd be any issues.

I'll start working on a patch for that.

Not that I wouldn't still want to mark it with a REVERT_ME_SOMEDAY
kind of annotation. We know we need the SnapshotNow scan fix.

Agreed.

Thanks,

Stephen

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#9)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

This seems to provide a reasonably principled
argument why we might want to fix this case with a localized use of an
MVCC scan before we have such a fix globally.

I had discussed that idea a bit with Andres on IRC and my only concern
was if there's some reason that acquiring a snapshot during createdb()
would be problematic. It doesn't appear to currently and I wasn't sure
if there'd be any issues.

Don't see what. The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#10)
1 attachment(s)
Re: could not create directory "...": File exists

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Don't see what. The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

Patch attached. Passes the regression tests and our internal testing
shows that it eliminates the error we were seeing (and doesn't cause
blocking, which is even better).

We have a workaround in place for our build system (more-or-less
"don't do that" approach), but it'd really be great if this was
back-patched and in the next round of point releases. Glancing
through the branches, looks like it should apply pretty cleanly.

Thanks,

Stephen

Attachments:

fix-concurrent-tablespace-update2.patchtext/x-diff; charset=us-asciiDownload
colordiff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
new file mode 100644
index 1f6e02d..94e1a5d
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*************** createdb(const CreatedbStmt *stmt)
*** 131,136 ****
--- 131,137 ----
  	int			notherbackends;
  	int			npreparedxacts;
  	createdb_failure_params fparms;
+ 	Snapshot	snapshot;
  
  	/* Extract options from the statement node tree */
  	foreach(option, stmt->options)
*************** createdb(const CreatedbStmt *stmt)
*** 535,540 ****
--- 536,555 ----
  	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
  
  	/*
+ 	 * Take a snapshot to use while scanning through pg_tablespace.  As we
+ 	 * create directories and copy files around, this process could take a
+ 	 * while, so also register that snapshot.
+ 	 *
+ 	 * Traversing pg_tablespace with an MVCC snapshot is necessary to provide
+ 	 * us with a consistent view of the tablespaces that exist.  Using
+ 	 * SnapshotNow here would risk seeing the same tablespace multiple times
+ 	 * or, worse, not seeing a tablespace at all if its tuple is moved around
+ 	 * by other concurrent operations (eg: due to the ACL on the tablespace
+ 	 * being updated).
+ 	 */
+ 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ 
+ 	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
  	 * up if we fail.  Use an ENSURE block to make sure this happens.  (This
  	 * is not a 100% solution, because of the possibility of failure during
*************** createdb(const CreatedbStmt *stmt)
*** 551,557 ****
  		 * each one to the new database.
  		 */
  		rel = heap_open(TableSpaceRelationId, AccessShareLock);
! 		scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
  			Oid			srctablespace = HeapTupleGetOid(tuple);
--- 566,572 ----
  		 * each one to the new database.
  		 */
  		rel = heap_open(TableSpaceRelationId, AccessShareLock);
! 		scan = heap_beginscan(rel, snapshot, 0, NULL);
  		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  		{
  			Oid			srctablespace = HeapTupleGetOid(tuple);
*************** createdb(const CreatedbStmt *stmt)
*** 656,661 ****
--- 671,679 ----
  	PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
  								PointerGetDatum(&fparms));
  
+ 	/* Free our snapshot */
+ 	UnregisterSnapshot(snapshot);
+ 
  	return dboid;
  }
  
#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#11)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

Patch attached. Passes the regression tests and our internal testing
shows that it eliminates the error we were seeing (and doesn't cause
blocking, which is even better).

We have a workaround in place for our build system (more-or-less
"don't do that" approach), but it'd really be great if this was
back-patched and in the next round of point releases. Glancing
through the branches, looks like it should apply pretty cleanly.

It looks like it will work back to 8.4; before that we didn't have
RegisterSnapshot. The patch could be adjusted for 8.3 if anyone
is sufficiently excited about it, but personally I'm not.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#11)
Re: could not create directory "...": File exists

Stephen Frost <sfrost@snowman.net> writes:

Tom,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Don't see what. The main reason we've not yet attempted a global fix is
that the most straightforward way (take a new snapshot each time we
start a new SnapshotNow scan) seems too expensive. But CREATE DATABASE
is so expensive that the cost of an extra snapshot there ain't gonna
matter.

Patch attached. Passes the regression tests and our internal testing
shows that it eliminates the error we were seeing (and doesn't cause
blocking, which is even better).

I committed this with a couple of changes:

* I used GetLatestSnapshot rather than GetTransactionSnapshot. Since
we don't allow these operations inside transaction blocks, there
shouldn't be much difference, but in principle GetTransactionSnapshot
is the wrong thing; in a serializable transaction it could be quite old.

* After reviewing the other uses of SnapshotNow in dbcommands.c, I
decided we'd better make the same type of change in remove_dbtablespaces
and check_db_file_conflict, because those are likewise doing filesystem
operations on the strength of what they find in pg_tablespace.

I also ended up deciding to back-patch to 8.3 as well.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#13)
Re: could not create directory "...": File exists

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I committed this with a couple of changes:

Great, thanks!

* I used GetLatestSnapshot rather than GetTransactionSnapshot. Since
we don't allow these operations inside transaction blocks, there
shouldn't be much difference, but in principle GetTransactionSnapshot
is the wrong thing; in a serializable transaction it could be quite old.

Makes sense.

* After reviewing the other uses of SnapshotNow in dbcommands.c, I
decided we'd better make the same type of change in remove_dbtablespaces
and check_db_file_conflict, because those are likewise doing filesystem
operations on the strength of what they find in pg_tablespace.

Thanks for that. I had noticed the other places where we were using
SnapshotNow, but I hadn't run down if they needed to be changed or not.

I also ended up deciding to back-patch to 8.3 as well.

Excellent, thanks again.

Stephen