including PID or backend ID in relpath of temp rels

Started by Robert Haasover 15 years ago17 messages

robertmhaas@gmail.com

over 15 years ago

Time for a new thread specific to this subject. For previous
discussion, see here:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg01140.php
http://archives.postgresql.org/pgsql-hackers/2010-04/msg01152.php

I attempted to implement this by adding an isTemp argument to relpath,
but ran into problems. It turns out that when we create a temporary
relation and then exit the backend, the relation is merely truncated
it, and it's the background writer which actually removes the file
following the next checkpoint. Therefore, relpath() for the temprel
must return the same answer in the background writer as it does in the
original backend, so passing isTemp isn't enough - we actually need to
pass whatever identifier we're including in the file name. As far as
I can see, though I'm not 100% sure of this, it looks like we never
actually ask the background writer to fsync any of these files because
we never fsync them at all; but we do ask it to remove them, which is
enough to create a problem. So, what to do about this? Ideas:

1. We could move the responsibility for removing the files associated
with temp rels from the background writer to the owning backend. I
think the reason why we initially truncate the files and only later
remove them is because somebody else might have 'em open, so it
mightn't be necessary for temp rels.

2. Instead of embedding a PID or backend ID in the filename, we could
just embed a boolean: isTemp or not? This seems like cutting
ourselves off from quite a bit of useful information but maybe it
would be OK. We could nuke all the temp stuff on cluster startup, but
we'd have to rely on catalog entries to identify orphaned files that
accumulated during normal running, which isn't ideal since one of our
long-term goals is to eliminate the need for those catalog entries.

3. We could change RelFileNode.relNode from an OID to an unsigned
32-bit integer drive off of a separate counter, and reserve some
portion of the 4 billion available values for temp relations. I doubt
we'd have enough bits to embed something like a PID though, so this
would end up being basically an embedded boolean, along the lines of
#2.

4. We could add an additional 32-bit value to RelFileNode to identify
the backend (or a sentinel value when not temp) and create a separate
structure XLogRelFileNode or PermRelFileNode or somesuch for use in
contexts where no temp rels are allowed.

Either #3 or #4 has some possible advantages for Hot Standby in terms
of perhaps making it feasible to assign relfilenodes on a standby
server without danger of conflicting with one already assigned on the
master.

5. ???

Thoughts?

...Robert

Jaime Casanova

jcasanov@systemguards.com.ec

over 15 years ago

In reply to: Robert Haas (#1)

Re: including PID or backend ID in relpath of temp rels

On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

1. We could move the responsibility for removing the files associated
with temp rels from the background writer to the owning backend. I
think the reason why we initially truncate the files and only later
remove them is because somebody else might have 'em open, so it
mightn't be necessary for temp rels.

what happens if the backend crash and obviously doesn't remove the
file associated with temp rels?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Jaime Casanova (#2)

Re: including PID or backend ID in relpath of temp rels

On Sun, Apr 25, 2010 at 10:19 PM, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:

On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

1. We could move the responsibility for removing the files associated
with temp rels from the background writer to the owning backend. I
think the reason why we initially truncate the files and only later
remove them is because somebody else might have 'em open, so it
mightn't be necessary for temp rels.

what happens if the backend crash and obviously doesn't remove the
file associated with temp rels?

Currently, they just get orphaned. As I understand it, if the catalog
entry survives the crash, autovacuum will remove them 2 BILLION
transactions later (and emit warning messages in the meantime);
otherwise we won't even know they're there.

As I further understand it, the main point of this change is that if
temporary tables have a distinctive name of some kind, then when we
can run through the directory and blow away files with those names
without fearing that it's *permanent* table data that somehow got
orphaned.

...Robert

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Robert Haas (#1)

Re: including PID or backend ID in relpath of temp rels

On Sun, Apr 25, 2010 at 9:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

4. We could add an additional 32-bit value to RelFileNode to identify
the backend (or a sentinel value when not temp) and create a separate
structure XLogRelFileNode or PermRelFileNode or somesuch for use in
contexts where no temp rels are allowed.

I experimented with this approach and created LocalRelFileNode and
GlobalRelFileNode and, for use in the buffer headers,
BufferRelFileNode (same as GlobalRelFileNode, but named differently
for clarity). LocallRelFileNode = GlobalRelFileNode + the ID of the
owning backend for temp rels; or InvalidBackendId if referencing a
non-temporary rel. These might not be the greatest names, but I think
the concept is good, because it really breaks the things that need to
be adjusted quite thoroughly. In the course of repairing the damage I
came across a couple of things I wasn't sure about:

[relcache.c] RelationInitPhysicalAddr can't initialize
relation->rd_node.backend properly for a non-local temporary relation,
because that information isn't available. But I'm not clear on why we
would need to create a relcache entry for a non-local temporary
relation. If we do need to, then we'll probably need to store the
backend ID in pg_class. That seems like something that would be best
avoided, all things being equal, especially since I can't see how to
generalize it to global temporary tables.

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

[dbsize.c] As with relcache.c, there's a problem if we're asked for
the size of a temporary relation that is not our own: we can't call
relpath() without knowing the ID of the owning backend, and there's no
way to acquire that information for pg_class. I guess we could just
refuse to answer the question in that case, but that doesn't seem real
cool. Or we could physically scan the directory for files that match
a suitably constructed wildcard, I suppose.

[storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out
parameter (its second argument) a list of RelFileNodes pending delete,
which we then write to WAL or to the two-phase state file. Of course,
if the backend ID (or pid, but I picked backend ID somewhat
arbitrarily) is part of the filename, then we need to write that to
WAL, too. It seems somewhat unfortunate to have to WAL-log temprels
here; as best I can tell, this is the only case where it's necessary.
But if we implement a more general mechanism for cleaning up temp
files, then might the need to do this go away? Not sure.

[syncscan.c] It seems we pursue this optimization even for temprels; I
can't think of why that would be useful in practice. If it's useless
overhead, should we skip it? This is really independent of this
project; just a side thought.

...Robert

Alvaro Herrera

alvherre@commandprompt.com

over 15 years ago

In reply to: Robert Haas (#4)

Re: including PID or backend ID in relpath of temp rels

Robert Haas escribiï¿½:

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

[dbsize.c] As with relcache.c, there's a problem if we're asked for
the size of a temporary relation that is not our own: we can't call
relpath() without knowing the ID of the owning backend, and there's no
way to acquire that information for pg_class. I guess we could just
refuse to answer the question in that case, but that doesn't seem real
cool. Or we could physically scan the directory for files that match
a suitably constructed wildcard, I suppose.

I don't very much like the wildcard idea; but I don't think it's
unreasonable to refuse to provide a file size. If the owning backend
has still got part of the table in local buffers, you'll get a
misleading answer, so perhaps it's best to not give an answer at all.

Maybe this problem could be solved if we could somehow force that
backend to write down its local buffers, in which case it'd be nice to
have a solution to the dbsize problem.

[syncscan.c] It seems we pursue this optimization even for temprels; I
can't think of why that would be useful in practice. If it's useless
overhead, should we skip it? This is really independent of this
project; just a side thought.

Maybe recently used buffers are more likely to be in the OS page cache,
so perhaps it's not good to disable it.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Alvaro Herrera (#5)

Re: including PID or backend ID in relpath of temp rels

On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Robert Haas escribió:

Hey, thanks for writing back! I just spent the last few hours
thinking about this and beating my head against the wall.

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

[dbsize.c] As with relcache.c, there's a problem if we're asked for
the size of a temporary relation that is not our own: we can't call
relpath() without knowing the ID of the owning backend, and there's no
way to acquire that information for pg_class. I guess we could just
refuse to answer the question in that case, but that doesn't seem real
cool. Or we could physically scan the directory for files that match
a suitably constructed wildcard, I suppose.

I don't very much like the wildcard idea; but I don't think it's
unreasonable to refuse to provide a file size. If the owning backend
has still got part of the table in local buffers, you'll get a
misleading answer, so perhaps it's best to not give an answer at all.

Maybe this problem could be solved if we could somehow force that
backend to write down its local buffers, in which case it'd be nice to
have a solution to the dbsize problem.

I'm sure we could add some kind of signaling mechanism that would tell
all backends to flush their local buffers, but I'm not too sure it
would help this case very much, because you likely wouldn't want to
wait for all the backends to complete that process before reporting
results.

[syncscan.c] It seems we pursue this optimization even for temprels; I
can't think of why that would be useful in practice. If it's useless
overhead, should we skip it? This is really independent of this
project; just a side thought.

Maybe recently used buffers are more likely to be in the OS page cache,
so perhaps it's not good to disable it.

I don't get it. If the whole relation fits in the page cache, it
doesn't much matter where you start a seqscan. If it doesn't,
starting where the last one ended is anti-optimal.

...Robert

Alvaro Herrera

alvherre@commandprompt.com

over 15 years ago

In reply to: Robert Haas (#6)

Re: including PID or backend ID in relpath of temp rels

Robert Haas escribiï¿½:

On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Robert Haas escribiï¿½:

Hey, thanks for writing back! I just spent the last few hours
thinking about this and beating my head against the wall.

:-)

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? ï¿½I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

+1 for doing it separately, but hopefully that doesn't mean the rest of
this patch is doomed ...

[dbsize.c] As with relcache.c, there's a problem if we're asked for
the size of a temporary relation that is not our own: we can't call
relpath() without knowing the ID of the owning backend, and there's no
way to acquire that information for pg_class. ï¿½I guess we could just
refuse to answer the question in that case, but that doesn't seem real
cool. ï¿½Or we could physically scan the directory for files that match
a suitably constructed wildcard, I suppose.

I don't very much like the wildcard idea; but I don't think it's
unreasonable to refuse to provide a file size. ï¿½If the owning backend
has still got part of the table in local buffers, you'll get a
misleading answer, so perhaps it's best to not give an answer at all.

Maybe this problem could be solved if we could somehow force that
backend to write down its local buffers, in which case it'd be nice to
have a solution to the dbsize problem.

I'm sure we could add some kind of signaling mechanism that would tell
all backends to flush their local buffers, but I'm not too sure it
would help this case very much, because you likely wouldn't want to
wait for all the backends to complete that process before reporting
results.

Hmm, I was thinking in the pg_relation_size function -- given this new
mechanism you could get an accurate size of temp tables for other
backends. I wasn't thinking in the pg_database_size function, and
perhaps it's better to *not* include temp tables in that report at all.

[syncscan.c] It seems we pursue this optimization even for temprels; I
can't think of why that would be useful in practice. ï¿½If it's useless
overhead, should we skip it? ï¿½This is really independent of this
project; just a side thought.

Maybe recently used buffers are more likely to be in the OS page cache,
so perhaps it's not good to disable it.

I don't get it. If the whole relation fits in the page cache, it
doesn't much matter where you start a seqscan. If it doesn't,
starting where the last one ended is anti-optimal.

Err, I was thinking that a syncscan started a bunch of pages earlier
than the point where the previous scan ended, but yeah, that's a bit
silly. Maybe we should just ignore syncscan in temp tables altogether,
as you propose.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Alvaro Herrera (#7)

Re: including PID or backend ID in relpath of temp rels

On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

+1 for doing it separately, but hopefully that doesn't mean the rest of
this patch is doomed ...

I wonder if it would be possible to reject access to temporary
relations at a higher level. Right now, if you create a temporary
relation in one session, you can issue a SELECT statement against it
in another relation, and get back 0 rows. If you then insert data
into it and select against it again, you'll get an error saying that
you can't access temporary tables of other sessions. If you try to
truncate somebody else's temporary relation, it fails; but if you try
to drop it, it works. In fact, you can even run ALTER TABLE ... ADD
COLUMN on somebody else's temp table, as long as you don't do anything
that requires a rewrite. CLUSTER fails; VACUUM and VACUUM FULL both
appear to work but apparently actually don't do anything under the
hood, so that database-wide vacuums don't barf. The whole thing seems
pretty leaky. It would be nice if we could find a small set of
control points where we basically reject ALL access to somebody else's
temp relations, period.

One possible thing we might do (bearing in mind that we might need to
wall off access at multiple levels) would be to forbid creating a
relcache entry for a non-local temprel. That would, in turn, forbid
doing pretty much anything to such a relation, although I'm not sure
what else would get broken in the process. But it would eliminate,
for example, all the checks for RELATION_IS_OTHER_TEMP, since that
Just Couldn't Happen. It would would eliminate the need to install
specific handling for this case in dbsize.c - we'd just automatically
croak. And it's also probably necessary to do this anyhow if we want
to ever eliminate those CacheInvalidSmgr() calls for temp rels,
because if I can drop your temprel, that implies I can smgropen() it.

...Robert

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Robert Haas (#8)

Re: including PID or backend ID in relpath of temp rels

Robert Haas <robertmhaas@gmail.com> writes:

One possible thing we might do (bearing in mind that we might need to
wall off access at multiple levels) would be to forbid creating a
relcache entry for a non-local temprel. That would, in turn, forbid
doing pretty much anything to such a relation, although I'm not sure
what else would get broken in the process.

Dropping temprels left behind by a crashed backend would get broken by
that; which is a deal-breaker, because we have to be able to clean those
up.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Alvaro Herrera (#5)

Re: including PID or backend ID in relpath of temp rels

Alvaro Herrera <alvherre@commandprompt.com> writes:

I don't very much like the wildcard idea; but I don't think it's
unreasonable to refuse to provide a file size. If the owning backend
has still got part of the table in local buffers, you'll get a
misleading answer, so perhaps it's best to not give an answer at all.

FWIW, that's not the case, anymore than it is for blocks in shared
buffer cache for regular rels. smgrextend() results in an observable
extension of the file EOF immediately, whether or not you can see
up-to-date data for those pages.

Now people have often complained about the extra I/O involved in that,
and it'd be nice to have a solution, but it's not clear to me that
fixing it would be harder for temprels than regular rels.

regards, tom lane

#11

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Tom Lane (#9)

Re: including PID or backend ID in relpath of temp rels

On Tue, May 4, 2010 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

One possible thing we might do (bearing in mind that we might need to
wall off access at multiple levels) would be to forbid creating a
relcache entry for a non-local temprel. That would, in turn, forbid
doing pretty much anything to such a relation, although I'm not sure
what else would get broken in the process.

Dropping temprels left behind by a crashed backend would get broken by
that; which is a deal-breaker, because we have to be able to clean those
up.

Phooey. It was such a good idea in my head.

...Robert

#12

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Robert Haas (#4)

Re: including PID or backend ID in relpath of temp rels

On Tue, Apr 27, 2010 at 9:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

[storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out
parameter (its second argument) a list of RelFileNodes pending delete,
which we then write to WAL or to the two-phase state file.

It appears that we are playing a little bit fast and loose with this.
I think that the two-phase code path is solid because we prohibit
PREPARE TRANSACTION if the transaction has referenced any temporary
tables, so when we read the two-phase state file it's safe to assume
that all the tables mentioned are non-temporary. But the ordinary
one-phase commit writes permanent and temporary relfilenodes to WAL
without distinction, and then, in xl_redo_commit() and
xl_redo_abort(), does this:

XLogDropRelation(xlrec->xnodes[i], fork);
smgrdounlink(srel, fork, false, true);

The third argument to smgrdounlink() is "isTemp", which we're here
passing as false, but might really be true. I don't think it
technically matters at present because the only effect of that
parameter right now is that we pass it through to
DropRelFileNodeBuffers(), which will drop shared buffers rather than
local buffers as a result of the incorrect setting. But that won't
matter because the WAL replay process shouldn't have any local buffers
anyway, since temp relations are not otherwise WAL-logged. For the
same reason, I don't think the call to XLogDropRelation() is an issue
because its only purpose is to remove entries from invalid_page_tab,
and there won't be any temporary pages in there anyway.

Of course if we're going to do $SUBJECT this will need to be changed
anyway, but assuming the above analysis is correct I think the
existing coding at least deserves a comment... then again, maybe I'm
all mixed up?

...Robert

#13

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Alvaro Herrera (#7)

1 attachment(s)

Re: including PID or backend ID in relpath of temp rels

On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

This turns out to be wrong, I think. It seems that what we do is
prevent backends other than the opening backend from reading pages
from non-local temp rels into private or shared buffers, but we don't
actually prevent them from having smgr references. This allows
autovacuum to drop them, for example, in an anti-wraparound situation.
(Thanks to Tom for helping me get my head around this better.)

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

+1 for doing it separately, but hopefully that doesn't mean the rest of
this patch is doomed ...

Doom has been averted. Proposed patch attached. It passes regression
tests and seems to work, but could use additional testing and, of
course, some code-reading also.

Some notes on this patch:

It seems prett clear that it isn't desirable to simply add backend ID
to RelFileNode, because there are too many places using RelFileNode
already for purposes where the backend ID can be inferred from
context, such as buffer headers and most of xlog. Instead, I
introduced BackendRelFileNode, which consists of an ordinary
RelFileNode augmented with a backend ID, and use that only where
needed. In particular, the smgr layer must use BackendRelFileNode
throughout, since it operates on both permanent and temporary
relations. smgr invalidations must also include the backend ID. xlog
generally happens only for non-temporary relations and can thus
continue to use an ordinary RelFileNode; however, commit/abort records
must use BackendRelFileNode as there may be physical storage
associated with temporary relations that must be unlinked.
Communication with the bgwriter must use BackendRelFileNode for
similar reasons. The relcache now stores rd_backend rather than
rd_islocaltemp so that it remains straightforward to call smgropen()
based on a relcache entry. Some smgr functions no longer require an
isTemp argument, because they can infer the necessary information from
their BackendRelFileNode. smgrwrite() and smgrextend() now take a
skipFsync argument rather than an isTemp argument.

I'm not totally sure whether it makes sense to do what we were talking
about above, viz, transfer unlink responsibility for temp rels from
the bgwriter to the owning backend. I haven't done that here. Nor
have I implemented any kind of improved temporary file cleanup
strategy, though I hope such a thing is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachments:

temprelpath.patchapplication/octet-stream; name=temprelpath.patchDownload

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d1f7bcc..9645c95 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -373,8 +373,7 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 	}
 
 	/* Truncate the unused VM pages, and send smgr inval message */
-	smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks,
-				 rel->rd_istemp);
+	smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks);
 
 	/*
 	 * We might as well update the local smgr_vm_nblocks setting. smgrtruncate
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 89ed8a0..06e304e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -295,9 +295,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	}
 
 	/*
-	 * Now write the page.	We say isTemp = true even if it's not a temp
-	 * index, because there's no need for smgr to schedule an fsync for this
-	 * write; we'll do it ourselves before ending the build.
+	 * Now write the page.	There's no need for smgr to schedule an fsync for
+	 * this write; we'll do it ourselves before ending the build.
 	 */
 	if (blkno == wstate->btws_pages_written)
 	{
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d432c9d..865f2e1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -865,8 +865,8 @@ StartPrepare(GlobalTransaction gxact)
 	hdr.prepared_at = gxact->prepared_at;
 	hdr.owner = gxact->owner;
 	hdr.nsubxacts = xactGetCommittedChildren(&children);
-	hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels, NULL);
-	hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels, NULL);
+	hdr.ncommitrels = smgrGetPendingTwophaseDeletes(true, &commitrels);
+	hdr.nabortrels = smgrGetPendingTwophaseDeletes(false, &abortrels);
 	hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
 														  &hdr.initfileinval);
 	StrNCpy(hdr.gid, gxact->gid, GIDSIZE);
@@ -1320,13 +1320,13 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	}
 	for (i = 0; i < ndelrels; i++)
 	{
-		SMgrRelation srel = smgropen(delrels[i]);
+		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
 		ForkNumber	fork;
 
 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
 		{
 			if (smgrexists(srel, fork))
-				smgrdounlink(srel, fork, false, false);
+				smgrdounlink(srel, fork, false);
 		}
 		smgrclose(srel);
 	}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b88cff2..9c65bed 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -889,7 +889,7 @@ RecordTransactionCommit(void)
 	bool		markXidCommitted = TransactionIdIsValid(xid);
 	TransactionId latestXid = InvalidTransactionId;
 	int			nrels;
-	RelFileNode *rels;
+	BackendRelFileNode *rels;
 	bool		haveNonTemp;
 	int			nchildren;
 	TransactionId *children;
@@ -988,7 +988,7 @@ RecordTransactionCommit(void)
 		{
 			rdata[0].next = &(rdata[1]);
 			rdata[1].data = (char *) rels;
-			rdata[1].len = nrels * sizeof(RelFileNode);
+			rdata[1].len = nrels * sizeof(BackendRelFileNode);
 			rdata[1].buffer = InvalidBuffer;
 			lastrdata = 1;
 		}
@@ -1269,7 +1269,7 @@ RecordTransactionAbort(bool isSubXact)
 	TransactionId xid = GetCurrentTransactionIdIfAny();
 	TransactionId latestXid;
 	int			nrels;
-	RelFileNode *rels;
+	BackendRelFileNode *rels;
 	int			nchildren;
 	TransactionId *children;
 	XLogRecData rdata[3];
@@ -1330,7 +1330,7 @@ RecordTransactionAbort(bool isSubXact)
 	{
 		rdata[0].next = &(rdata[1]);
 		rdata[1].data = (char *) rels;
-		rdata[1].len = nrels * sizeof(RelFileNode);
+		rdata[1].len = nrels * sizeof(BackendRelFileNode);
 		rdata[1].buffer = InvalidBuffer;
 		lastrdata = 1;
 	}
@@ -4434,15 +4434,18 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
 	/* Make sure files supposed to be dropped are dropped */
 	for (i = 0; i < xlrec->nrels; i++)
 	{
-		SMgrRelation srel = smgropen(xlrec->xnodes[i]);
+		SMgrRelation srel;
 		ForkNumber	fork;
 
+		srel = smgropen(xlrec->xnodes[i].node, xlrec->xnodes[i].backend);
 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
 		{
 			if (smgrexists(srel, fork))
 			{
-				XLogDropRelation(xlrec->xnodes[i], fork);
-				smgrdounlink(srel, fork, false, true);
+				/* temprel pages are not xlog'd, so can skip XLogDropRelation */
+				if (xlrec->xnodes[i].backend == InvalidBackendId)
+					XLogDropRelation(xlrec->xnodes[i].node, fork);
+				smgrdounlink(srel, fork, true);
 			}
 		}
 		smgrclose(srel);
@@ -4537,15 +4540,18 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid)
 	/* Make sure files supposed to be dropped are dropped */
 	for (i = 0; i < xlrec->nrels; i++)
 	{
-		SMgrRelation srel = smgropen(xlrec->xnodes[i]);
+		SMgrRelation srel;
 		ForkNumber	fork;
 
+		srel = smgropen(xlrec->xnodes[i].node, xlrec->xnodes[i].backend);
 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
 		{
 			if (smgrexists(srel, fork))
 			{
-				XLogDropRelation(xlrec->xnodes[i], fork);
-				smgrdounlink(srel, fork, false, true);
+				/* temprel pages are not xlog'd, so can skip XLogDropRelation */
+				if (xlrec->xnodes[i].backend == InvalidBackendId)
+					XLogDropRelation(xlrec->xnodes[i].node, fork);
+				smgrdounlink(srel, fork, true);
 			}
 		}
 		smgrclose(srel);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 9ee2036..3675fa8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -68,7 +68,7 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
 	 */
 	if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
 	{
-		char	   *path = relpath(node, forkno);
+		char	   *path = relpathperm(node, forkno);
 
 		if (present)
 			elog(DEBUG1, "page %u of relation %s is uninitialized",
@@ -133,7 +133,7 @@ forget_invalid_pages(RelFileNode node, ForkNumber forkno, BlockNumber minblkno)
 		{
 			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
 			{
-				char	   *path = relpath(hentry->key.node, forkno);
+				char	   *path = relpathperm(hentry->key.node, forkno);
 
 				elog(DEBUG2, "page %u of relation %s has been dropped",
 					 hentry->key.blkno, path);
@@ -166,7 +166,7 @@ forget_invalid_pages_db(Oid dbid)
 		{
 			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
 			{
-				char	   *path = relpath(hentry->key.node, hentry->key.forkno);
+				char	   *path = relpathperm(hentry->key.node, hentry->key.forkno);
 
 				elog(DEBUG2, "page %u of relation %s has been dropped",
 					 hentry->key.blkno, path);
@@ -200,7 +200,7 @@ XLogCheckInvalidPages(void)
 	 */
 	while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
 	{
-		char	   *path = relpath(hentry->key.node, hentry->key.forkno);
+		char	   *path = relpathperm(hentry->key.node, hentry->key.forkno);
 
 		if (hentry->present)
 			elog(WARNING, "page %u of relation %s was uninitialized",
@@ -278,7 +278,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	Assert(blkno != P_NEW);
 
 	/* Open the relation at smgr level */
-	smgr = smgropen(rnode);
+	smgr = smgropen(rnode, InvalidBackendId);
 
 	/*
 	 * Create the target file if it doesn't already exist.  This lets us cope
@@ -295,7 +295,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rnode, false, forknum, blkno,
+		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
 										   mode, NULL);
 	}
 	else
@@ -314,7 +314,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 		{
 			if (buffer != InvalidBuffer)
 				ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rnode, false, forknum,
+			buffer = ReadBufferWithoutRelcache(rnode, forknum,
 											   P_NEW, mode, NULL);
 			lastblock++;
 		}
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 3edfc23..394c24f 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -78,12 +78,12 @@ forkname_to_number(char *forkName)
 }
 
 /*
- * relpath			- construct path to a relation's file
+ * relpathbackend - construct path to a relation's file
  *
  * Result is a palloc'd string.
  */
 char *
-relpath(RelFileNode rnode, ForkNumber forknum)
+relpathbackend(RelFileNode rnode, BackendId backend, ForkNumber forknum)
 {
 	int			pathlen;
 	char	   *path;
@@ -92,6 +92,7 @@ relpath(RelFileNode rnode, ForkNumber forknum)
 	{
 		/* Shared system relations live in {datadir}/global */
 		Assert(rnode.dbNode == 0);
+		Assert(backend == InvalidBackendId);
 		pathlen = 7 + OIDCHARS + 1 + FORKNAMECHARS + 1;
 		path = (char *) palloc(pathlen);
 		if (forknum != MAIN_FORKNUM)
@@ -103,29 +104,69 @@ relpath(RelFileNode rnode, ForkNumber forknum)
 	else if (rnode.spcNode == DEFAULTTABLESPACE_OID)
 	{
 		/* The default tablespace is {datadir}/base */
-		pathlen = 5 + OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
-		path = (char *) palloc(pathlen);
-		if (forknum != MAIN_FORKNUM)
-			snprintf(path, pathlen, "base/%u/%u_%s",
-					 rnode.dbNode, rnode.relNode, forkNames[forknum]);
+		if (backend == InvalidBackendId)
+		{
+			pathlen = 5 + OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
+			path = (char *) palloc(pathlen);
+			if (forknum != MAIN_FORKNUM)
+				snprintf(path, pathlen, "base/%u/%u_%s",
+						 rnode.dbNode, rnode.relNode,
+						 forkNames[forknum]);
+			else
+				snprintf(path, pathlen, "base/%u/%u",
+						 rnode.dbNode, rnode.relNode);
+		}
 		else
-			snprintf(path, pathlen, "base/%u/%u",
-					 rnode.dbNode, rnode.relNode);
+		{
+			/* OIDCHARS will suffice for an integer, too */
+			pathlen = 5 + OIDCHARS + 2 + OIDCHARS + 1 + OIDCHARS + 1
+					+ FORKNAMECHARS + 1;
+			path = (char *) palloc(pathlen);
+			if (forknum != MAIN_FORKNUM)
+				snprintf(path, pathlen, "base/%u/t%d_%u_%s",
+						 rnode.dbNode, backend, rnode.relNode,
+						 forkNames[forknum]);
+			else
+				snprintf(path, pathlen, "base/%u/t%d_%u",
+						 rnode.dbNode, backend, rnode.relNode);
+		}
 	}
 	else
 	{
 		/* All other tablespaces are accessed via symlinks */
-		pathlen = 9 + 1 + OIDCHARS + 1 + strlen(TABLESPACE_VERSION_DIRECTORY) +
-			1 + OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
-		path = (char *) palloc(pathlen);
-		if (forknum != MAIN_FORKNUM)
-			snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u_%s",
-					 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
-					 rnode.dbNode, rnode.relNode, forkNames[forknum]);
+		if (backend == InvalidBackendId)
+		{
+			pathlen = 9 + 1 + OIDCHARS + 1
+					+ strlen(TABLESPACE_VERSION_DIRECTORY) + 1 + OIDCHARS + 1
+					+ OIDCHARS + 1 + FORKNAMECHARS + 1;
+			path = (char *) palloc(pathlen);
+			if (forknum != MAIN_FORKNUM)
+				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u_%s",
+						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
+						 rnode.dbNode, rnode.relNode,
+						 forkNames[forknum]);
+			else
+				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u",
+						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
+						 rnode.dbNode, rnode.relNode);
+		}
 		else
-			snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/%u",
-					 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
-					 rnode.dbNode, rnode.relNode);
+		{
+			/* OIDCHARS will suffice for an integer, too */
+			pathlen = 9 + 1 + OIDCHARS + 1
+					+ strlen(TABLESPACE_VERSION_DIRECTORY) + 1 + OIDCHARS + 2
+					+ OIDCHARS + 1 + OIDCHARS + 1 + FORKNAMECHARS + 1;
+			path = (char *) palloc(pathlen);
+			if (forknum != MAIN_FORKNUM)
+				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/t%d_%u_%s",
+						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
+						 rnode.dbNode, backend, rnode.relNode,
+						 forkNames[forknum]);
+			else
+				snprintf(path, pathlen, "pg_tblspc/%u/%s/%u/t%d_%u",
+						 rnode.spcNode, TABLESPACE_VERSION_DIRECTORY,
+						 rnode.dbNode, backend, rnode.relNode);
+		}
 	}
 	return path;
 }
@@ -458,16 +499,17 @@ GetNewOidWithIndex(Relation relation, Oid indexId, AttrNumber oidcolumn)
  * created by bootstrap have preassigned OIDs, so there's no need.
  */
 Oid
-GetNewRelFileNode(Oid reltablespace, Relation pg_class)
+GetNewRelFileNode(Oid reltablespace, Relation pg_class, BackendId backend)
 {
-	RelFileNode rnode;
+	BackendRelFileNode rnode;
 	char	   *rpath;
 	int			fd;
 	bool		collides;
 
 	/* This logic should match RelationInitPhysicalAddr */
-	rnode.spcNode = reltablespace ? reltablespace : MyDatabaseTableSpace;
-	rnode.dbNode = (rnode.spcNode == GLOBALTABLESPACE_OID) ? InvalidOid : MyDatabaseId;
+	rnode.node.spcNode = reltablespace ? reltablespace : MyDatabaseTableSpace;
+	rnode.node.dbNode = (rnode.node.spcNode == GLOBALTABLESPACE_OID) ? InvalidOid : MyDatabaseId;
+	rnode.backend = backend;
 
 	do
 	{
@@ -475,9 +517,9 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class)
 
 		/* Generate the OID */
 		if (pg_class)
-			rnode.relNode = GetNewOid(pg_class);
+			rnode.node.relNode = GetNewOid(pg_class);
 		else
-			rnode.relNode = GetNewObjectId();
+			rnode.node.relNode = GetNewObjectId();
 
 		/* Check for existing file of same name */
 		rpath = relpath(rnode, MAIN_FORKNUM);
@@ -508,5 +550,5 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class)
 		pfree(rpath);
 	} while (collides);
 
-	return rnode.relNode;
+	return rnode.node.relNode;
 }
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d848ef0..6e852b4 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -39,6 +39,7 @@
 #include "catalog/heap.h"
 #include "catalog/index.h"
 #include "catalog/indexing.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_attrdef.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_inherits.h"
@@ -975,7 +976,9 @@ heap_create_with_catalog(const char *relname,
 			binary_upgrade_next_toast_relfilenode = InvalidOid;
 		}
 		else
-			relid = GetNewRelFileNode(reltablespace, pg_class_desc);
+			relid = GetNewRelFileNode(reltablespace, pg_class_desc,
+									  isTempOrToastNamespace(relnamespace) ?
+										  MyBackendId : InvalidBackendId);
 	}
 
 	/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69946fe..3408a10 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -645,7 +645,12 @@ index_create(Oid heapRelationId,
 			binary_upgrade_next_index_relfilenode = InvalidOid;
 		}
 		else
-			indexRelationId = GetNewRelFileNode(tableSpaceId, pg_class);
+		{
+			indexRelationId =
+				GetNewRelFileNode(tableSpaceId, pg_class,
+								  heapRelation->rd_istemp ?
+									MyBackendId : InvalidBackendId);
+		}
 	}
 
 	/*
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ad376a1..2e729e5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -52,7 +52,7 @@
 typedef struct PendingRelDelete
 {
 	RelFileNode relnode;		/* relation that may need to be deleted */
-	bool		isTemp;			/* is it a temporary relation? */
+	BackendId	backend;		/* InvalidBackendId if not a temp rel */
 	bool		atCommit;		/* T=delete at commit; F=delete at abort */
 	int			nestLevel;		/* xact nesting level of request */
 	struct PendingRelDelete *next;		/* linked-list link */
@@ -102,8 +102,9 @@ RelationCreateStorage(RelFileNode rnode, bool istemp)
 	XLogRecData rdata;
 	xl_smgr_create xlrec;
 	SMgrRelation srel;
+	BackendId	backend = istemp ? MyBackendId : InvalidBackendId;
 
-	srel = smgropen(rnode);
+	srel = smgropen(rnode, backend);
 	smgrcreate(srel, MAIN_FORKNUM, false);
 
 	if (!istemp)
@@ -125,7 +126,7 @@ RelationCreateStorage(RelFileNode rnode, bool istemp)
 	pending = (PendingRelDelete *)
 		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
 	pending->relnode = rnode;
-	pending->isTemp = istemp;
+	pending->backend = backend;
 	pending->atCommit = false;	/* delete if abort */
 	pending->nestLevel = GetCurrentTransactionNestLevel();
 	pending->next = pendingDeletes;
@@ -145,7 +146,7 @@ RelationDropStorage(Relation rel)
 	pending = (PendingRelDelete *)
 		MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
 	pending->relnode = rel->rd_node;
-	pending->isTemp = rel->rd_istemp;
+	pending->backend = rel->rd_backend;
 	pending->atCommit = true;	/* delete if commit */
 	pending->nestLevel = GetCurrentTransactionNestLevel();
 	pending->next = pendingDeletes;
@@ -283,7 +284,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	}
 
 	/* Do the real work */
-	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks, rel->rd_istemp);
+	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
 /*
@@ -322,14 +323,11 @@ smgrDoPendingDeletes(bool isCommit)
 				SMgrRelation srel;
 				int			i;
 
-				srel = smgropen(pending->relnode);
+				srel = smgropen(pending->relnode, pending->backend);
 				for (i = 0; i <= MAX_FORKNUM; i++)
 				{
 					if (smgrexists(srel, i))
-						smgrdounlink(srel,
-									 i,
-									 pending->isTemp,
-									 false);
+						smgrdounlink(srel, i, false);
 				}
 				smgrclose(srel);
 			}
@@ -344,7 +342,7 @@ smgrDoPendingDeletes(bool isCommit)
  * smgrGetPendingDeletes() -- Get a list of relations to be deleted.
  *
  * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
+ * *ptr is set to point to a freshly-palloc'd array of BackendRelFileNodes.
  * If there are no relations to be deleted, *ptr is set to NULL.
  *
  * If haveNonTemp isn't NULL, the bool it points to gets set to true if
@@ -354,11 +352,12 @@ smgrDoPendingDeletes(bool isCommit)
  * by upper-level transactions.
  */
 int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp)
+smgrGetPendingDeletes(bool forCommit, BackendRelFileNode **ptr,
+					  bool *haveNonTemp)
 {
 	int			nestLevel = GetCurrentTransactionNestLevel();
 	int			nrels;
-	RelFileNode *rptr;
+	BackendRelFileNode *rptr;
 	PendingRelDelete *pending;
 
 	nrels = 0;
@@ -374,6 +373,57 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp)
 		*ptr = NULL;
 		return 0;
 	}
+	rptr = (BackendRelFileNode *) palloc(nrels * sizeof(BackendRelFileNode));
+	*ptr = rptr;
+	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+	{
+		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
+		{
+			rptr->node = pending->relnode;
+			rptr->backend = pending->backend;
+			rptr++;
+		}
+		if (haveNonTemp && pending->backend == InvalidBackendId)
+			*haveNonTemp = true;
+	}
+	return nrels;
+}
+
+/*
+ * smgrGetPendingTwophaseDeletes() -- Get a list of relations to be deleted.
+ *
+ * The return value is the number of relations scheduled for termination.
+ * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
+ * If there are no relations to be deleted, *ptr is set to NULL.
+ *
+ * This function serves essentially the same purpose as smgrGetPendingDeletes,
+ * but is intended specifically to handle prepared transactions, which are not
+ * permitted to touch temporary files.  Therefore, we can write just the
+ * RelFileNode to the two-phase state file; the backend can be assumed to be
+ * InvalidBackendId.
+ *
+ * Note that the list does not include anything scheduled for termination
+ * by upper-level transactions.
+ */
+int
+smgrGetPendingTwophaseDeletes(bool forCommit, RelFileNode **ptr)
+{
+	int			nestLevel = GetCurrentTransactionNestLevel();
+	int			nrels;
+	RelFileNode *rptr;
+	PendingRelDelete *pending;
+
+	nrels = 0;
+	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+	{
+		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
+			nrels++;
+	}
+	if (nrels == 0)
+	{
+		*ptr = NULL;
+		return 0;
+	}
 	rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
 	*ptr = rptr;
 	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -381,10 +431,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp)
 		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
 		{
 			*rptr = pending->relnode;
+			Assert(pending->backend == InvalidBackendId);
 			rptr++;
 		}
-		if (haveNonTemp && !pending->isTemp)
-			*haveNonTemp = true;
 	}
 	return nrels;
 }
@@ -456,7 +505,7 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
 		SMgrRelation reln;
 
-		reln = smgropen(xlrec->rnode);
+		reln = smgropen(xlrec->rnode, InvalidBackendId);
 		smgrcreate(reln, MAIN_FORKNUM, true);
 	}
 	else if (info == XLOG_SMGR_TRUNCATE)
@@ -465,7 +514,7 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		SMgrRelation reln;
 		Relation	rel;
 
-		reln = smgropen(xlrec->rnode);
+		reln = smgropen(xlrec->rnode, InvalidBackendId);
 
 		/*
 		 * Forcibly create relation if it doesn't exist (which suggests that
@@ -475,7 +524,7 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		smgrcreate(reln, MAIN_FORKNUM, true);
 
-		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno, false);
+		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 		/* Also tell xlogutils.c about it */
 		XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
@@ -502,7 +551,7 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
 	if (info == XLOG_SMGR_CREATE)
 	{
 		xl_smgr_create *xlrec = (xl_smgr_create *) rec;
-		char	   *path = relpath(xlrec->rnode, MAIN_FORKNUM);
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
 
 		appendStringInfo(buf, "file create: %s", path);
 		pfree(path);
@@ -510,7 +559,7 @@ smgr_desc(StringInfo buf, uint8 xl_info, char *rec)
 	else if (info == XLOG_SMGR_TRUNCATE)
 	{
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec;
-		char	   *path = relpath(xlrec->rnode, MAIN_FORKNUM);
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
 
 		appendStringInfo(buf, "file truncate: %s to %u blocks", path,
 						 xlrec->blkno);
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 435dfdd..54cdc4a 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -195,7 +195,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid, Datum reloptio
 	 * Toast tables for regular relations go in pg_toast; those for temp
 	 * relations go into the per-backend temp-toast-table namespace.
 	 */
-	if (rel->rd_islocaltemp)
+	if (rel->rd_backend == MyBackendId)
 		namespaceid = GetTempToastNamespace();
 	else
 		namespaceid = PG_TOAST_NAMESPACE;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9d46e47..3d61116 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1022,7 +1022,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 		}
 
 		/* check read-only transaction */
-		if (XactReadOnly && is_from && !cstate->rel->rd_islocaltemp)
+		if (XactReadOnly && is_from && cstate->rel->rd_backend != MyBackendId)
 			PreventCommandIfReadOnly("COPY FROM");
 
 		/* Don't allow COPY w/ OIDs to or from a table without them */
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index f52e1d8..f6867f4 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -470,7 +470,7 @@ nextval_internal(Oid relid)
 						RelationGetRelationName(seqrel))));
 
 	/* read-only transactions may only modify temp sequences */
-	if (!seqrel->rd_islocaltemp)
+	if (seqrel->rd_backend != MyBackendId)
 		PreventCommandIfReadOnly("nextval()");
 
 	if (elm->last != elm->cached)		/* some numbers were cached */
@@ -747,7 +747,7 @@ do_setval(Oid relid, int64 next, bool iscalled)
 						RelationGetRelationName(seqrel))));
 
 	/* read-only transactions may only modify temp sequences */
-	if (!seqrel->rd_islocaltemp)
+	if (seqrel->rd_backend != MyBackendId)
 		PreventCommandIfReadOnly("setval()");
 
 	/* lock page' buffer and read tuple */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 9b5ce65..2515e6c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6968,13 +6968,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace)
 	 * Relfilenodes are not unique across tablespaces, so we need to allocate
 	 * a new one in the new tablespace.
 	 */
-	newrelfilenode = GetNewRelFileNode(newTableSpace, NULL);
+	newrelfilenode = GetNewRelFileNode(newTableSpace, NULL, rel->rd_backend);
 
 	/* Open old and new relation */
 	newrnode = rel->rd_node;
 	newrnode.relNode = newrelfilenode;
 	newrnode.spcNode = newTableSpace;
-	dstrel = smgropen(newrnode);
+	dstrel = smgropen(newrnode, rel->rd_backend);
 
 	RelationOpenSmgr(rel);
 
@@ -7053,7 +7053,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 
 		/* XLOG stuff */
 		if (use_wal)
-			log_newpage(&dst->smgr_rnode, forkNum, blkno, page);
+			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page);
 
 		/*
 		 * Now write the page.	We say isTemp = true even if it's not a temp
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 95e9d37..cf79de3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -113,7 +113,7 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	BackendRelFileNode rnode;
 	ForkNumber	forknum;
 	BlockNumber segno;			/* see md.c for special values */
 	/* might add a real request-type field later; not needed yet */
@@ -1071,7 +1071,8 @@ RequestCheckpoint(int flags)
  * than we have to here.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(BackendRelFileNode rnode, ForkNumber forknum,
+					BlockNumber segno)
 {
 	BgWriterRequest *request;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index caae936..4dd894b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -95,7 +95,8 @@ static void WaitIO(volatile BufferDesc *buf);
 static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
 				  int set_flag_bits);
-static void buffer_write_error_callback(void *arg);
+static void shared_buffer_write_error_callback(void *arg);
+static void local_buffer_write_error_callback(void *arg);
 static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
@@ -141,7 +142,8 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode, forkNum, blockNum);
+		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+					   forkNum, blockNum);
 
 		/* determine its hash code and partition lock ID */
 		newHash = BufTableHashCode(&newTag);
@@ -251,18 +253,21 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
  *
- * NB: caller is assumed to know what it's doing if isTemp is true.
+ * NB: At present, this function may not be used on temporary relations, which
+ * is OK, because we only use it during XLOG replay.  If in the future we
+ * want to use it on temporary relations, we could pass the backend ID as an
+ * additional parameter.
  */
 Buffer
-ReadBufferWithoutRelcache(RelFileNode rnode, bool isTemp,
-						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy)
+ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
+						  BlockNumber blockNum, ReadBufferMode mode,
+						  BufferAccessStrategy strategy)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode);
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, isTemp, forkNum, blockNum, mode, strategy,
+	return ReadBuffer_common(smgr, false, forkNum, blockNum, mode, strategy,
 							 &hit);
 }
 
@@ -414,7 +419,7 @@ ReadBuffer_common(SMgrRelation smgr, bool isLocalBuf, ForkNumber forkNum,
 	{
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, isLocalBuf);
+		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
 	}
 	else
 	{
@@ -465,10 +470,10 @@ ReadBuffer_common(SMgrRelation smgr, bool isLocalBuf, ForkNumber forkNum,
 		VacuumCostBalance += VacuumCostPageMiss;
 
 	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rnode.spcNode,
-									  smgr->smgr_rnode.dbNode,
-									  smgr->smgr_rnode.relNode,
-									  isLocalBuf,
+									  smgr->smgr_rnode.node.spcNode,
+									  smgr->smgr_rnode.node.dbNode,
+									  smgr->smgr_rnode.node.relNode,
+									  smgr->smgr_rnode.backend,
 									  isExtend,
 									  found);
 
@@ -512,7 +517,7 @@ BufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 	bool		valid;
 
 	/* create a tag so we can lookup the buffer */
-	INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
+	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
 
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
@@ -1693,21 +1698,24 @@ PrintBufferLeakWarning(Buffer buffer)
 	volatile BufferDesc *buf;
 	int32		loccount;
 	char	   *path;
+	BackendId	backend;
 
 	Assert(BufferIsValid(buffer));
 	if (BufferIsLocal(buffer))
 	{
 		buf = &LocalBufferDescriptors[-buffer - 1];
 		loccount = LocalRefCount[-buffer - 1];
+		backend = MyBackendId;
 	}
 	else
 	{
 		buf = &BufferDescriptors[buffer - 1];
 		loccount = PrivateRefCount[buffer - 1];
+		backend = InvalidBackendId;
 	}
 
 	/* theoretically we should lock the bufhdr here */
-	path = relpath(buf->tag.rnode, buf->tag.forkNum);
+	path = relpathbackend(buf->tag.rnode, backend, buf->tag.forkNum);
 	elog(WARNING,
 		 "buffer refcount leak: [%03d] "
 		 "(rel=%s, blockNum=%u, flags=0x%x, refcount=%u %d)",
@@ -1831,14 +1839,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 		return;
 
 	/* Setup error traceback support for ereport() */
-	errcontext.callback = buffer_write_error_callback;
+	errcontext.callback = shared_buffer_write_error_callback;
 	errcontext.arg = (void *) buf;
 	errcontext.previous = error_context_stack;
 	error_context_stack = &errcontext;
 
 	/* Find smgr relation for buffer */
 	if (reln == NULL)
-		reln = smgropen(buf->tag.rnode);
+		reln = smgropen(buf->tag.rnode, InvalidBackendId);
 
 	TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,
 										buf->tag.blockNum,
@@ -1929,14 +1937,15 @@ RelationGetNumberOfBlocks(Relation relation)
  * --------------------------------------------------------------------
  */
 void
-DropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum, bool istemp,
+DropRelFileNodeBuffers(BackendRelFileNode rnode, ForkNumber forkNum,
 					   BlockNumber firstDelBlock)
 {
 	int			i;
 
-	if (istemp)
+	if (rnode.backend != InvalidBackendId)
 	{
-		DropRelFileNodeLocalBuffers(rnode, forkNum, firstDelBlock);
+		if (rnode.backend == MyBackendId)
+			DropRelFileNodeLocalBuffers(rnode.node, forkNum, firstDelBlock);
 		return;
 	}
 
@@ -1945,7 +1954,7 @@ DropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum, bool istemp,
 		volatile BufferDesc *bufHdr = &BufferDescriptors[i];
 
 		LockBufHdr(bufHdr);
-		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
 			bufHdr->tag.forkNum == forkNum &&
 			bufHdr->tag.blockNum >= firstDelBlock)
 			InvalidateBuffer(bufHdr);	/* releases spinlock */
@@ -2008,7 +2017,7 @@ PrintBufferDescs(void)
 			 "[%02d] (freeNext=%d, rel=%s, "
 			 "blockNum=%u, flags=0x%x, refcount=%u %d)",
 			 i, buf->freeNext,
-			 relpath(buf->tag.rnode, buf->tag.forkNum),
+			 relpathbackend(buf->tag.rnode, InvalidBackendId, buf->tag.forkNum),
 			 buf->tag.blockNum, buf->flags,
 			 buf->refcount, PrivateRefCount[i]);
 	}
@@ -2078,7 +2087,7 @@ FlushRelationBuffers(Relation rel)
 				ErrorContextCallback errcontext;
 
 				/* Setup error traceback support for ereport() */
-				errcontext.callback = buffer_write_error_callback;
+				errcontext.callback = local_buffer_write_error_callback;
 				errcontext.arg = (void *) bufHdr;
 				errcontext.previous = error_context_stack;
 				error_context_stack = &errcontext;
@@ -2087,7 +2096,7 @@ FlushRelationBuffers(Relation rel)
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  (char *) LocalBufHdrGetBlock(bufHdr),
-						  true);
+						  false);
 
 				bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 
@@ -2699,8 +2708,9 @@ AbortBufferIO(void)
 			if (sv_flags & BM_IO_ERROR)
 			{
 				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path = relpath(buf->tag.rnode, buf->tag.forkNum);
+				char	   *path;
 
+				path = relpathperm(buf->tag.rnode, buf->tag.forkNum);
 				ereport(WARNING,
 						(errcode(ERRCODE_IO_ERROR),
 						 errmsg("could not write block %u of %s",
@@ -2714,17 +2724,36 @@ AbortBufferIO(void)
 }
 
 /*
- * Error context callback for errors occurring during buffer writes.
+ * Error context callback for errors occurring during shared buffer writes.
  */
 static void
-buffer_write_error_callback(void *arg)
+shared_buffer_write_error_callback(void *arg)
 {
 	volatile BufferDesc *bufHdr = (volatile BufferDesc *) arg;
 
 	/* Buffer is pinned, so we can read the tag without locking the spinlock */
 	if (bufHdr != NULL)
 	{
-		char	   *path = relpath(bufHdr->tag.rnode, bufHdr->tag.forkNum);
+		char	   *path = relpathperm(bufHdr->tag.rnode, bufHdr->tag.forkNum);
+
+		errcontext("writing block %u of relation %s",
+				   bufHdr->tag.blockNum, path);
+		pfree(path);
+	}
+}
+
+/*
+ * Error context callback for errors occurring during buffer writes.
+ */
+static void
+local_buffer_write_error_callback(void *arg)
+{
+	volatile BufferDesc *bufHdr = (volatile BufferDesc *) arg;
+
+	if (bufHdr != NULL)
+	{
+		char	   *path = relpathbackend(bufHdr->tag.rnode, MyBackendId,
+										 bufHdr->tag.forkNum);
 
 		errcontext("writing block %u of relation %s",
 				   bufHdr->tag.blockNum, path);
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index c5b6a2c..bbf0a01 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -68,7 +68,7 @@ LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 
-	INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
+	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
 
 	/* Initialize local buffers if first request in this session */
 	if (LocalBufHash == NULL)
@@ -110,7 +110,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	int			trycounter;
 	bool		found;
 
-	INIT_BUFFERTAG(newTag, smgr->smgr_rnode, forkNum, blockNum);
+	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
 
 	/* Initialize local buffers if first request in this session */
 	if (LocalBufHash == NULL)
@@ -127,7 +127,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		Assert(BUFFERTAGS_EQUAL(bufHdr->tag, newTag));
 #ifdef LBDEBUG
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rnode.relNode, forkNum, blockNum, -b - 1);
+				smgr->smgr_rnode.node.relNode, forkNum, blockNum, -b - 1);
 #endif
 		/* this part is equivalent to PinBuffer for a shared buffer */
 		if (LocalRefCount[b] == 0)
@@ -150,7 +150,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 #ifdef LBDEBUG
 	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-		 smgr->smgr_rnode.relNode, forkNum, blockNum, -nextFreeLocalBuf - 1);
+		 smgr->smgr_rnode.node.relNode, forkNum, blockNum,
+		 -nextFreeLocalBuf - 1);
 #endif
 
 	/*
@@ -198,14 +199,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		SMgrRelation oreln;
 
 		/* Find smgr relation for buffer */
-		oreln = smgropen(bufHdr->tag.rnode);
+		oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
 
 		/* And write... */
 		smgrwrite(oreln,
 				  bufHdr->tag.forkNum,
 				  bufHdr->tag.blockNum,
 				  (char *) LocalBufHdrGetBlock(bufHdr),
-				  true);
+				  false);
 
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
@@ -309,7 +310,8 @@ DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum,
 			if (LocalRefCount[i] != 0)
 				elog(ERROR, "block %u of %s is still referenced (local %u)",
 					 bufHdr->tag.blockNum,
-					 relpath(bufHdr->tag.rnode, bufHdr->tag.forkNum),
+					 relpathbackend(bufHdr->tag.rnode, MyBackendId,
+								   bufHdr->tag.forkNum),
 					 LocalRefCount[i]);
 			/* Remove entry from hashtable */
 			hresult = (LocalBufferLookupEnt *)
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 579572f..b43a2ce 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -303,7 +303,7 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 	}
 
 	/* Truncate the unused FSM pages, and send smgr inval message */
-	smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks, rel->rd_istemp);
+	smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks);
 
 	/*
 	 * We might as well update the local smgr_fsm_nblocks setting.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4163ca0..79805b4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -119,7 +119,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
  */
 typedef struct
 {
-	RelFileNode rnode;			/* the targeted relation */
+	BackendRelFileNode rnode;	/* the targeted relation */
 	ForkNumber	forknum;
 	BlockNumber segno;			/* which segment */
 } PendingOperationTag;
@@ -135,7 +135,7 @@ typedef struct
 
 typedef struct
 {
-	RelFileNode rnode;			/* the dead relation to delete */
+	BackendRelFileNode rnode;	/* the dead relation to delete */
 	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
 } PendingUnlinkEntry;
 
@@ -158,14 +158,14 @@ static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum,
 	   ExtensionBehavior behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
 					   MdfdVec *seg);
-static void register_unlink(RelFileNode rnode);
+static void register_unlink(BackendRelFileNode rnode);
 static MdfdVec *_fdvec_alloc(void);
 static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
 			  BlockNumber segno);
 static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
 			  BlockNumber segno, int oflags);
 static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool isTemp, ExtensionBehavior behavior);
+			 BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 		   MdfdVec *seg);
 
@@ -321,7 +321,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * we are usually not in a transaction anymore when this is called.
  */
 void
-mdunlink(RelFileNode rnode, ForkNumber forkNum, bool isRedo)
+mdunlink(BackendRelFileNode rnode, ForkNumber forkNum, bool isRedo)
 {
 	char	   *path;
 	int			ret;
@@ -417,7 +417,7 @@ mdunlink(RelFileNode rnode, ForkNumber forkNum, bool isRedo)
  */
 void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 char *buffer, bool isTemp)
+		 char *buffer, bool skipFsync)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -440,7 +440,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						relpath(reln->smgr_rnode, forknum),
 						InvalidBlockNumber)));
 
-	v = _mdfd_getseg(reln, forknum, blocknum, isTemp, EXTENSION_CREATE);
+	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
 
 	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -478,7 +478,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errhint("Check free disk space.")));
 	}
 
-	if (!isTemp)
+	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
@@ -605,9 +605,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	MdfdVec    *v;
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rnode.spcNode,
-										reln->smgr_rnode.dbNode,
-										reln->smgr_rnode.relNode);
+										reln->smgr_rnode.node.spcNode,
+										reln->smgr_rnode.node.dbNode,
+										reln->smgr_rnode.node.relNode,
+										reln->smgr_rnode.backend);
 
 	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL);
 
@@ -624,9 +625,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rnode.spcNode,
-									   reln->smgr_rnode.dbNode,
-									   reln->smgr_rnode.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   reln->smgr_rnode.node.dbNode,
+									   reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.backend,
 									   nbytes,
 									   BLCKSZ);
 
@@ -666,7 +668,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  */
 void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		char *buffer, bool isTemp)
+		char *buffer, bool skipFsync)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -678,11 +680,12 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 #endif
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rnode.spcNode,
-										 reln->smgr_rnode.dbNode,
-										 reln->smgr_rnode.relNode);
+										 reln->smgr_rnode.node.spcNode,
+										 reln->smgr_rnode.node.dbNode,
+										 reln->smgr_rnode.node.relNode,
+										 reln->smgr_rnode.backend);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, isTemp, EXTENSION_FAIL);
+	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_FAIL);
 
 	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
 
@@ -697,9 +700,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
 
 	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rnode.spcNode,
-										reln->smgr_rnode.dbNode,
-										reln->smgr_rnode.relNode,
+										reln->smgr_rnode.node.spcNode,
+										reln->smgr_rnode.node.dbNode,
+										reln->smgr_rnode.node.relNode,
+										reln->smgr_rnode.backend,
 										nbytes,
 										BLCKSZ);
 
@@ -720,7 +724,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errhint("Check free disk space.")));
 	}
 
-	if (!isTemp)
+	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 }
 
@@ -794,8 +798,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
  *	mdtruncate() -- Truncate relation to specified number of blocks.
  */
 void
-mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
-		   bool isTemp)
+mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 {
 	MdfdVec    *v;
 	BlockNumber curnblk;
@@ -839,7 +842,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
 						 errmsg("could not truncate file \"%s\": %m",
 								FilePathName(v->mdfd_vfd))));
 
-			if (!isTemp)
+			if (!SmgrIsTemp(reln))
 				register_dirty_segment(reln, forknum, v);
 			v = v->mdfd_chain;
 			Assert(ov != reln->md_fd[forknum]); /* we never drop the 1st
@@ -864,7 +867,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
 					errmsg("could not truncate file \"%s\" to %u blocks: %m",
 						   FilePathName(v->mdfd_vfd),
 						   nblocks)));
-			if (!isTemp)
+			if (!SmgrIsTemp(reln))
 				register_dirty_segment(reln, forknum, v);
 			v = v->mdfd_chain;
 			ov->mdfd_chain = NULL;
@@ -1052,7 +1055,8 @@ mdsync(void)
 				 * the relation will have been dirtied through this same smgr
 				 * relation, and so we can save a file open/close cycle.
 				 */
-				reln = smgropen(entry->tag.rnode);
+				reln = smgropen(entry->tag.rnode.node,
+								entry->tag.rnode.backend);
 
 				/*
 				 * It is possible that the relation has been dropped or
@@ -1235,7 +1239,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
  * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNode rnode)
+register_unlink(BackendRelFileNode rnode)
 {
 	if (pendingOpsTable)
 	{
@@ -1278,7 +1282,8 @@ register_unlink(RelFileNode rnode)
  * structure for them.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+RememberFsyncRequest(BackendRelFileNode rnode, ForkNumber forknum,
+					 BlockNumber segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1291,7 +1296,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		hash_seq_init(&hstat, pendingOpsTable);
 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 		{
-			if (RelFileNodeEquals(entry->tag.rnode, rnode) &&
+			if (BackendRelFileNodeEquals(entry->tag.rnode, rnode) &&
 				entry->tag.forknum == forknum)
 			{
 				/* Okay, cancel this entry */
@@ -1312,7 +1317,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		hash_seq_init(&hstat, pendingOpsTable);
 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
 		{
-			if (entry->tag.rnode.dbNode == rnode.dbNode)
+			if (entry->tag.rnode.node.dbNode == rnode.node.dbNode)
 			{
 				/* Okay, cancel this entry */
 				entry->canceled = true;
@@ -1326,7 +1331,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
 
 			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
+			if (entry->rnode.node.dbNode == rnode.node.dbNode)
 			{
 				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
 				pfree(entry);
@@ -1393,7 +1398,7 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
  * ForgetRelationFsyncRequests -- forget any fsyncs for a rel
  */
 void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationFsyncRequests(BackendRelFileNode rnode, ForkNumber forknum)
 {
 	if (pendingOpsTable)
 	{
@@ -1428,11 +1433,12 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 void
 ForgetDatabaseFsyncRequests(Oid dbid)
 {
-	RelFileNode rnode;
+	BackendRelFileNode rnode;
 
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
+	rnode.node.dbNode = dbid;
+	rnode.node.spcNode = 0;
+	rnode.node.relNode = 0;
+	rnode.backend = InvalidBackendId;
 
 	if (pendingOpsTable)
 	{
@@ -1523,12 +1529,12 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
  *		specified block.
  *
  * If the segment doesn't exist, we ereport, return NULL, or create the
- * segment, according to "behavior".  Note: isTemp need only be correct
- * in the EXTENSION_CREATE case.
+ * segment, according to "behavior".  Note: skipFsync is only used in the
+ * EXTENSION_CREATE case.
  */
 static MdfdVec *
 _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
-			 bool isTemp, ExtensionBehavior behavior)
+			 bool skipFsync, ExtensionBehavior behavior)
 {
 	MdfdVec    *v = mdopen(reln, forknum, behavior);
 	BlockNumber targetseg;
@@ -1566,7 +1572,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 					mdextend(reln, forknum,
 							 nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
-							 zerobuf, isTemp);
+							 zerobuf, skipFsync);
 					pfree(zerobuf);
 				}
 				v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 3b12cb3..ecf238d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -45,19 +45,19 @@ typedef struct f_smgr
 	void		(*smgr_create) (SMgrRelation reln, ForkNumber forknum,
 											bool isRedo);
 	bool		(*smgr_exists) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_unlink) (RelFileNode rnode, ForkNumber forknum,
+	void		(*smgr_unlink) (BackendRelFileNode rnode, ForkNumber forknum,
 											bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
-							BlockNumber blocknum, char *buffer, bool isTemp);
+							BlockNumber blocknum, char *buffer, bool skipFsync);
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							BlockNumber blocknum, char *buffer, bool isTemp);
+							BlockNumber blocknum, char *buffer, bool skipFsync);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
-										   BlockNumber nblocks, bool isTemp);
+										   BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);		/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -83,8 +83,6 @@ static HTAB *SMgrRelationHash = NULL;
 
 /* local function prototypes */
 static void smgrshutdown(int code, Datum arg);
-static void smgr_internal_unlink(RelFileNode rnode, ForkNumber forknum,
-					 int which, bool isTemp, bool isRedo);
 
 
 /*
@@ -131,8 +129,9 @@ smgrshutdown(int code, Datum arg)
  *		This does not attempt to actually open the object.
  */
 SMgrRelation
-smgropen(RelFileNode rnode)
+smgropen(RelFileNode rnode, BackendId backend)
 {
+	BackendRelFileNode brnode;
 	SMgrRelation reln;
 	bool		found;
 
@@ -142,7 +141,7 @@ smgropen(RelFileNode rnode)
 		HASHCTL		ctl;
 
 		MemSet(&ctl, 0, sizeof(ctl));
-		ctl.keysize = sizeof(RelFileNode);
+		ctl.keysize = sizeof(BackendRelFileNode);
 		ctl.entrysize = sizeof(SMgrRelationData);
 		ctl.hash = tag_hash;
 		SMgrRelationHash = hash_create("smgr relation table", 400,
@@ -150,8 +149,10 @@ smgropen(RelFileNode rnode)
 	}
 
 	/* Look up or create an entry */
+	brnode.node = rnode;
+	brnode.backend = backend;
 	reln = (SMgrRelation) hash_search(SMgrRelationHash,
-									  (void *) &rnode,
+									  (void *) &brnode,
 									  HASH_ENTER, &found);
 
 	/* Initialize it if not present before */
@@ -261,7 +262,7 @@ smgrcloseall(void)
  * such entry exists already.
  */
 void
-smgrclosenode(RelFileNode rnode)
+smgrclosenode(BackendRelFileNode rnode)
 {
 	SMgrRelation reln;
 
@@ -305,8 +306,8 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 	 * should be here and not in commands/tablespace.c?  But that would imply
 	 * importing a lot of stuff that smgr.c oughtn't know, either.
 	 */
-	TablespaceCreateDbspace(reln->smgr_rnode.spcNode,
-							reln->smgr_rnode.dbNode,
+	TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+							reln->smgr_rnode.node.dbNode,
 							isRedo);
 
 	(*(smgrsw[reln->smgr_which].smgr_create)) (reln, forknum, isRedo);
@@ -323,29 +324,19 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
  *		already.
  */
 void
-smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo)
+smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 {
-	RelFileNode rnode = reln->smgr_rnode;
+	BackendRelFileNode rnode = reln->smgr_rnode;
 	int			which = reln->smgr_which;
 
 	/* Close the fork */
 	(*(smgrsw[which].smgr_close)) (reln, forknum);
 
-	smgr_internal_unlink(rnode, forknum, which, isTemp, isRedo);
-}
-
-/*
- * Shared subroutine that actually does the unlink ...
- */
-static void
-smgr_internal_unlink(RelFileNode rnode, ForkNumber forknum,
-					 int which, bool isTemp, bool isRedo)
-{
 	/*
 	 * Get rid of any remaining buffers for the relation.  bufmgr will just
 	 * drop them without bothering to write the contents.
 	 */
-	DropRelFileNodeBuffers(rnode, forknum, isTemp, 0);
+	DropRelFileNodeBuffers(rnode, forknum, 0);
 
 	/*
 	 * It'd be nice to tell the stats collector to forget it immediately, too.
@@ -385,10 +376,10 @@ smgr_internal_unlink(RelFileNode rnode, ForkNumber forknum,
  */
 void
 smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		   char *buffer, bool isTemp)
+		   char *buffer, bool skipFsync)
 {
 	(*(smgrsw[reln->smgr_which].smgr_extend)) (reln, forknum, blocknum,
-											   buffer, isTemp);
+											   buffer, skipFsync);
 }
 
 /*
@@ -426,16 +417,16 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  *		on disk at return, only dumped out to the kernel.  However,
  *		provisions will be made to fsync the write before the next checkpoint.
  *
- *		isTemp indicates that the relation is a temp table (ie, is managed
- *		by the local-buffer manager).  In this case no provisions need be
- *		made to fsync the write before checkpointing.
+ *		skipFsync indicates that the caller will make other provisions to
+ *		fsync the relation, so we needn't bother.  Temporary relations also
+ *		do not require fsync.
  */
 void
 smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  char *buffer, bool isTemp)
+		  char *buffer, bool skipFsync)
 {
 	(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
-											  buffer, isTemp);
+											  buffer, skipFsync);
 }
 
 /*
@@ -455,14 +446,13 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
  * The truncation is done immediately, so this can't be rolled back.
  */
 void
-smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
-			 bool isTemp)
+smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 {
 	/*
 	 * Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
 	 * just drop them without bothering to write the contents.
 	 */
-	DropRelFileNodeBuffers(reln->smgr_rnode, forknum, isTemp, nblocks);
+	DropRelFileNodeBuffers(reln->smgr_rnode, forknum, nblocks);
 
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
@@ -479,8 +469,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
 	/*
 	 * Do the truncation.
 	 */
-	(*(smgrsw[reln->smgr_which].smgr_truncate)) (reln, forknum, nblocks,
-												 isTemp);
+	(*(smgrsw[reln->smgr_which].smgr_truncate)) (reln, forknum, nblocks);
 }
 
 /*
@@ -499,7 +488,7 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks,
  *		to use the WAL log for PITR or replication purposes: in that case
  *		we have to make WAL entries as well.)
  *
- *		The preceding writes should specify isTemp = true to avoid
+ *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
  *
  *		Note that you need to do FlushRelationBuffers() first if there is
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index a4e0252..dac2d72 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -256,14 +256,14 @@ pg_tablespace_size_name(PG_FUNCTION_ARGS)
  * calculate size of (one fork of) a relation
  */
 static int64
-calculate_relation_size(RelFileNode *rfn, ForkNumber forknum)
+calculate_relation_size(RelFileNode *rfn, BackendId backend, ForkNumber forknum)
 {
 	int64		totalsize = 0;
 	char	   *relationpath;
 	char		pathname[MAXPGPATH];
 	unsigned int segcount = 0;
 
-	relationpath = relpath(*rfn, forknum);
+	relationpath = relpathbackend(*rfn, backend, forknum);
 
 	for (segcount = 0;; segcount++)
 	{
@@ -303,7 +303,7 @@ pg_relation_size(PG_FUNCTION_ARGS)
 
 	rel = relation_open(relOid, AccessShareLock);
 
-	size = calculate_relation_size(&(rel->rd_node),
+	size = calculate_relation_size(&(rel->rd_node), rel->rd_backend,
 							  forkname_to_number(text_to_cstring(forkName)));
 
 	relation_close(rel, AccessShareLock);
@@ -327,12 +327,14 @@ calculate_toast_table_size(Oid toastrelid)
 
 	/* toast heap size, including FSM and VM size */
 	for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-		size += calculate_relation_size(&(toastRel->rd_node), forkNum);
+		size += calculate_relation_size(&(toastRel->rd_node),
+										toastRel->rd_backend, forkNum);
 
 	/* toast index size, including FSM and VM size */
 	toastIdxRel = relation_open(toastRel->rd_rel->reltoastidxid, AccessShareLock);
 	for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-		size += calculate_relation_size(&(toastIdxRel->rd_node), forkNum);
+		size += calculate_relation_size(&(toastIdxRel->rd_node),
+										toastIdxRel->rd_backend, forkNum);
 
 	relation_close(toastIdxRel, AccessShareLock);
 	relation_close(toastRel, AccessShareLock);
@@ -361,7 +363,8 @@ calculate_table_size(Oid relOid)
 	 * heap size, including FSM and VM
 	 */
 	for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-		size += calculate_relation_size(&(rel->rd_node), forkNum);
+		size += calculate_relation_size(&(rel->rd_node), rel->rd_backend,
+										forkNum);
 
 	/*
 	 * Size of toast relation
@@ -404,7 +407,9 @@ calculate_indexes_size(Oid relOid)
 			idxRel = relation_open(idxOid, AccessShareLock);
 
 			for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++)
-				size += calculate_relation_size(&(idxRel->rd_node), forkNum);
+				size += calculate_relation_size(&(idxRel->rd_node),
+												idxRel->rd_backend,
+												forkNum);
 
 			relation_close(idxRel, AccessShareLock);
 		}
@@ -575,6 +580,7 @@ pg_relation_filepath(PG_FUNCTION_ARGS)
 	HeapTuple	tuple;
 	Form_pg_class relform;
 	RelFileNode rnode;
+	BackendId	backend;
 	char	   *path;
 
 	tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
@@ -612,12 +618,27 @@ pg_relation_filepath(PG_FUNCTION_ARGS)
 			break;
 	}
 
-	ReleaseSysCache(tuple);
-
 	if (!OidIsValid(rnode.relNode))
+	{
+		ReleaseSysCache(tuple);
 		PG_RETURN_NULL();
+	}
+
+	/* If temporary, determine owning backend. */
+	if (!relform->relistemp)
+		backend = InvalidBackendId;
+	else if (isTempOrToastNamespace(relform->relnamespace))
+		backend = MyBackendId;
+	else
+	{
+		/* Do it the hard way. */
+		backend = GetTempNamespaceBackendId(relform->relnamespace);
+		Assert(backend != InvalidOid);
+	}
+
+	ReleaseSysCache(tuple);
 
-	path = relpath(rnode, MAIN_FORKNUM);
+	path = relpathbackend(rnode, backend, MAIN_FORKNUM);
 
 	PG_RETURN_TEXT_P(cstring_to_text(path));
 }
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 3b15d85..91de76d 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1165,7 +1165,7 @@ CacheInvalidateRelcacheByRelid(Oid relid)
  * replaying WAL as well as when creating it.
  */
 void
-CacheInvalidateSmgr(RelFileNode rnode)
+CacheInvalidateSmgr(BackendRelFileNode rnode)
 {
 	SharedInvalidationMessage msg;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d462510..91c653f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -858,10 +858,20 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	relation->rd_createSubid = InvalidSubTransactionId;
 	relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
 	relation->rd_istemp = relation->rd_rel->relistemp;
-	if (relation->rd_istemp)
-		relation->rd_islocaltemp = isTempOrToastNamespace(relation->rd_rel->relnamespace);
+	if (!relation->rd_istemp)
+		relation->rd_backend = InvalidBackendId;
+	else if (isTempOrToastNamespace(relation->rd_rel->relnamespace))
+		relation->rd_backend = MyBackendId;
 	else
-		relation->rd_islocaltemp = false;
+	{
+		/*
+		 * If it's a temporary table, but not one of ours, we have to use
+		 * the slow, grotty method to figure out the owning backend.
+		 */
+		relation->rd_backend =
+			GetTempNamespaceBackendId(relation->rd_rel->relnamespace);
+		Assert(relation->rd_backend != InvalidOid);
+	}
 
 	/*
 	 * initialize the tuple descriptor (relation->rd_att).
@@ -1424,7 +1434,7 @@ formrdesc(const char *relationName, Oid relationReltype,
 	relation->rd_createSubid = InvalidSubTransactionId;
 	relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
 	relation->rd_istemp = false;
-	relation->rd_islocaltemp = false;
+	relation->rd_backend = InvalidBackendId;
 
 	/*
 	 * initialize relation tuple form
@@ -2515,7 +2525,7 @@ RelationBuildLocalRelation(const char *relname,
 
 	/* it is temporary if and only if it is in my temp-table namespace */
 	rel->rd_istemp = isTempOrToastNamespace(relnamespace);
-	rel->rd_islocaltemp = rel->rd_istemp;
+	rel->rd_backend = rel->rd_istemp ? MyBackendId : InvalidBackendId;
 
 	/*
 	 * create a new tuple descriptor from the one passed in.  We do this
@@ -2629,7 +2639,7 @@ void
 RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 {
 	Oid			newrelfilenode;
-	RelFileNode newrnode;
+	BackendRelFileNode newrnode;
 	Relation	pg_class;
 	HeapTuple	tuple;
 	Form_pg_class classform;
@@ -2640,7 +2650,8 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 		   TransactionIdIsNormal(freezeXid));
 
 	/* Allocate a new relfilenode */
-	newrelfilenode = GetNewRelFileNode(relation->rd_rel->reltablespace, NULL);
+	newrelfilenode = GetNewRelFileNode(relation->rd_rel->reltablespace, NULL,
+									   relation->rd_backend);
 
 	/*
 	 * Get a writable copy of the pg_class tuple for the given relation.
@@ -2660,9 +2671,10 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
 	 * NOTE: any conflict in relfilenode value will be caught here, if
 	 * GetNewRelFileNode messes up for any reason.
 	 */
-	newrnode = relation->rd_node;
-	newrnode.relNode = newrelfilenode;
-	RelationCreateStorage(newrnode, relation->rd_istemp);
+	newrnode.node = relation->rd_node;
+	newrnode.node.relNode = newrelfilenode;
+	newrnode.backend = relation->rd_backend;
+	RelationCreateStorage(newrnode.node, relation->rd_istemp);
 	smgrclosenode(newrnode);
 
 	/*
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 8ccb948..b06f8ff 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,7 +55,7 @@ provider postgresql {
 	probe sort__done(bool, long);
 
 	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, bool, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, bool, bool, bool);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 
@@ -81,10 +81,10 @@ provider postgresql {
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
 
-	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
-	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int);
-	probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
-	probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int);
+	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
+	probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
 
 	probe xlog__insert(unsigned char, unsigned char);
 	probe xlog__switch();
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 3bce72f..efec1f4 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -104,8 +104,8 @@ typedef struct xl_xact_commit
 	int			nmsgs;			/* number of shared inval msgs */
 	Oid			dbId;			/* MyDatabaseId */
 	Oid			tsId;			/* MyDatabaseTableSpace */
-	/* Array of RelFileNode(s) to drop at commit */
-	RelFileNode xnodes[1];		/* VARIABLE LENGTH ARRAY */
+	/* Array of BackendRelFileNode(s) to drop at commit */
+	BackendRelFileNode xnodes[1];	/* VARIABLE LENGTH ARRAY */
 	/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
 	/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
 } xl_xact_commit;
@@ -133,7 +133,7 @@ typedef struct xl_xact_abort
 	int			nrels;			/* number of RelFileNodes */
 	int			nsubxacts;		/* number of subtransaction XIDs */
 	/* Array of RelFileNode(s) to drop at abort */
-	RelFileNode xnodes[1];		/* VARIABLE LENGTH ARRAY */
+	BackendRelFileNode xnodes[1];	/* VARIABLE LENGTH ARRAY */
 	/* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
 } xl_xact_abort;
 
diff --git a/src/include/catalog/catalog.h b/src/include/catalog/catalog.h
index bd430cb..de96b8d 100644
--- a/src/include/catalog/catalog.h
+++ b/src/include/catalog/catalog.h
@@ -26,9 +26,15 @@
 extern const char *forkNames[];
 extern ForkNumber forkname_to_number(char *forkName);
 
-extern char *relpath(RelFileNode rnode, ForkNumber forknum);
+extern char *relpathbackend(RelFileNode rnode, BackendId backend,
+			  ForkNumber forknum);
 extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
 
+#define relpath(rnode, forknum) \
+		relpathbackend((rnode).node, (rnode).backend, (forknum))
+#define relpathperm(rnode, forknum) \
+		relpathbackend((rnode), InvalidBackendId, (forknum))
+
 extern bool IsSystemRelation(Relation relation);
 extern bool IsToastRelation(Relation relation);
 
@@ -45,6 +51,7 @@ extern bool IsSharedRelation(Oid relationId);
 extern Oid	GetNewOid(Relation relation);
 extern Oid GetNewOidWithIndex(Relation relation, Oid indexId,
 				   AttrNumber oidcolumn);
-extern Oid	GetNewRelFileNode(Oid reltablespace, Relation pg_class);
+extern Oid	GetNewRelFileNode(Oid reltablespace, Relation pg_class,
+				  BackendId backend);
 
 #endif   /* CATALOG_H */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 609f1c6..04a36a4 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,8 +30,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
-extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr,
+extern int smgrGetPendingDeletes(bool forCommit, BackendRelFileNode **ptr,
 					  bool *haveNonTemp);
+extern int smgrGetPendingTwophaseDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 06a4e37..a22fbee 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -27,7 +27,7 @@ extern void BackgroundWriterMain(void);
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+extern bool ForwardFsyncRequest(BackendRelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f38f545..d4bf341 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -160,7 +160,7 @@ extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BlockNumber blockNum, ReadBufferMode mode,
 				   BufferAccessStrategy strategy);
-extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode, bool isTemp,
+extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
 						  ReadBufferMode mode, BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
@@ -180,8 +180,8 @@ extern BlockNumber BufferGetBlockNumber(Buffer buffer);
 extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
-extern void DropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum,
-					   bool istemp, BlockNumber firstDelBlock);
+extern void DropRelFileNodeBuffers(BackendRelFileNode rnode,
+					   ForkNumber forkNum, BlockNumber firstDelBlock);
 extern void DropDatabaseBuffers(Oid dbid);
 
 #ifdef NOT_USED
diff --git a/src/include/storage/relfilenode.h b/src/include/storage/relfilenode.h
index 6f0c0ad..9bb0fa2 100644
--- a/src/include/storage/relfilenode.h
+++ b/src/include/storage/relfilenode.h
@@ -14,6 +14,8 @@
 #ifndef RELFILENODE_H
 #define RELFILENODE_H
 
+#include "storage/backendid.h"
+
 /*
  * The physical storage of a relation consists of one or more forks. The
  * main fork is always created, but in addition to that there can be
@@ -37,7 +39,8 @@ typedef enum ForkNumber
 
 /*
  * RelFileNode must provide all that we need to know to physically access
- * a relation. Note, however, that a "physical" relation is comprised of
+ * a relation, with the exception of the backend ID, which can be provided
+ * separately. Note, however, that a "physical" relation is comprised of
  * multiple files on the filesystem, as each fork is stored as a separate
  * file, and each fork can be divided into multiple segments. See md.c.
  *
@@ -74,14 +77,30 @@ typedef struct RelFileNode
 } RelFileNode;
 
 /*
- * Note: RelFileNodeEquals compares relNode first since that is most likely
- * to be different in two unequal RelFileNodes.  It is probably redundant
- * to compare spcNode if the other two fields are found equal, but do it
- * anyway to be sure.
+ * Augmenting a relfilenode with the backend ID provides all the information
+ * we need to locate the physical storage.
+ */
+typedef struct BackendRelFileNode
+{
+	RelFileNode	node;
+	BackendId	backend;
+} BackendRelFileNode;
+
+/*
+ * Note: RelFileNodeEquals and BackendRelFileNodeEquals compare relNode first
+ * since that is most likely to be different in two unequal RelFileNodes.  It
+ * is probably redundant to compare spcNode if the other fields are found equal,
+ * but do it anyway to be sure.
  */
 #define RelFileNodeEquals(node1, node2) \
 	((node1).relNode == (node2).relNode && \
 	 (node1).dbNode == (node2).dbNode && \
 	 (node1).spcNode == (node2).spcNode)
 
+#define BackendRelFileNodeEquals(node1, node2) \
+	((node1).node.relNode == (node2).node.relNode && \
+	 (node1).node.dbNode == (node2).node.dbNode && \
+	 (node1).backend == (node2).backend && \
+	 (node1).node.spcNode == (node2).node.spcNode)
+
 #endif   /* RELFILENODE_H */
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index bbdde81..0168d17 100644
--- a/src/include/storage/sinval.h
+++ b/src/include/storage/sinval.h
@@ -92,7 +92,7 @@ typedef struct
 typedef struct
 {
 	int16		id;				/* type field --- must be first */
-	RelFileNode rnode;			/* physical file ID */
+	BackendRelFileNode rnode;	/* physical file ID */
 } SharedInvalSmgrMsg;
 
 #define SHAREDINVALRELMAP_ID	(-4)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index cf248b8..7ae78ec 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,6 +16,7 @@
 
 #include "access/xlog.h"
 #include "fmgr.h"
+#include "storage/backendid.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
@@ -38,7 +39,7 @@
 typedef struct SMgrRelationData
 {
 	/* rnode is the hashtable lookup key, so it must be first! */
-	RelFileNode smgr_rnode;		/* relation physical identifier */
+	BackendRelFileNode smgr_rnode;		/* relation physical identifier */
 
 	/* pointer to owning pointer, or NULL if none */
 	struct SMgrRelationData **smgr_owner;
@@ -68,28 +69,30 @@ typedef struct SMgrRelationData
 
 typedef SMgrRelationData *SMgrRelation;
 
+#define SmgrIsTemp(smgr) \
+	((smgr)->smgr_rnode.backend != InvalidBackendId)
 
 extern void smgrinit(void);
-extern SMgrRelation smgropen(RelFileNode rnode);
+extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
 extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
 extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
-extern void smgrclosenode(RelFileNode rnode);
+extern void smgrclosenode(BackendRelFileNode rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, ForkNumber forknum,
-			 bool isTemp, bool isRedo);
+			 bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum, char *buffer, bool isTemp);
+		   BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-		  BlockNumber blocknum, char *buffer, bool isTemp);
+		  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
-			 BlockNumber nblocks, bool isTemp);
+			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -103,27 +106,28 @@ extern void mdinit(void);
 extern void mdclose(SMgrRelation reln, ForkNumber forknum);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNode rnode, ForkNumber forknum, bool isRedo);
+extern void mdunlink(BackendRelFileNode rnode, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool isTemp);
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool isTemp);
+		BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks, bool isTemp);
+		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+extern void RememberFsyncRequest(BackendRelFileNode rnode, ForkNumber forknum,
 					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetRelationFsyncRequests(BackendRelFileNode rnode,
+							ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 
 /* smgrtype.c */
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index a86a17c..7fab919 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -49,7 +49,7 @@ extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
 
 extern void CacheInvalidateRelcacheByRelid(Oid relid);
 
-extern void CacheInvalidateSmgr(RelFileNode rnode);
+extern void CacheInvalidateSmgr(BackendRelFileNode rnode);
 
 extern void CacheInvalidateRelmap(Oid databaseId);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 444e892..296c651 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -127,7 +127,7 @@ typedef struct RelationData
 	struct SMgrRelationData *rd_smgr;	/* cached file handle, or NULL */
 	int			rd_refcnt;		/* reference count */
 	bool		rd_istemp;		/* rel is a temporary relation */
-	bool		rd_islocaltemp; /* rel is a temp rel of this session */
+	BackendId	rd_backend;		/* owning backend id, if temporary relation */
 	bool		rd_isnailed;	/* rel is nailed in cache */
 	bool		rd_isvalid;		/* relcache entry is valid */
 	char		rd_indexvalid;	/* state of rd_indexlist: 0 = not valid, 1 =
@@ -347,7 +347,7 @@ typedef struct StdRdOptions
 #define RelationOpenSmgr(relation) \
 	do { \
 		if ((relation)->rd_smgr == NULL) \
-			smgrsetowner(&((relation)->rd_smgr), smgropen((relation)->rd_node)); \
+			smgrsetowner(&((relation)->rd_smgr), smgropen((relation)->rd_node, (relation)->rd_backend)); \
 	} while (0)
 
 /*
@@ -393,7 +393,7 @@ typedef struct StdRdOptions
  * Beware of multiple eval of argument
  */
 #define RELATION_IS_LOCAL(relation) \
-	((relation)->rd_islocaltemp || \
+	((relation)->rd_backend == MyBackendId || \
 	 (relation)->rd_createSubid != InvalidSubTransactionId)
 
 /*
@@ -403,7 +403,7 @@ typedef struct StdRdOptions
  * Beware of multiple eval of argument
  */
 #define RELATION_IS_OTHER_TEMP(relation) \
-	((relation)->rd_istemp && !(relation)->rd_islocaltemp)
+	((relation)->rd_istemp && (relation)->rd_backend != MyBackendId)
 
 /* routines in utils/cache/relcache.c */
 extern void RelationIncrementReferenceCount(Relation rel);

#14

Jim Nasby

decibel@decibel.org

over 15 years ago

In reply to: Robert Haas (#13)

Re: including PID or backend ID in relpath of temp rels

On May 6, 2010, at 10:24 PM, Robert Haas wrote:

On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations? I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

This turns out to be wrong, I think. It seems that what we do is
prevent backends other than the opening backend from reading pages
from non-local temp rels into private or shared buffers, but we don't
actually prevent them from having smgr references. This allows
autovacuum to drop them, for example, in an anti-wraparound situation.
(Thanks to Tom for helping me get my head around this better.)

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

+1 for doing it separately, but hopefully that doesn't mean the rest of
this patch is doomed ...

Doom has been averted. Proposed patch attached. It passes regression
tests and seems to work, but could use additional testing and, of
course, some code-reading also.

Some notes on this patch:

It seems prett clear that it isn't desirable to simply add backend ID
to RelFileNode, because there are too many places using RelFileNode
already for purposes where the backend ID can be inferred from
context, such as buffer headers and most of xlog. Instead, I
introduced BackendRelFileNode, which consists of an ordinary
RelFileNode augmented with a backend ID, and use that only where
needed. In particular, the smgr layer must use BackendRelFileNode
throughout, since it operates on both permanent and temporary
relations. smgr invalidations must also include the backend ID. xlog
generally happens only for non-temporary relations and can thus
continue to use an ordinary RelFileNode; however, commit/abort records
must use BackendRelFileNode as there may be physical storage
associated with temporary relations that must be unlinked.
Communication with the bgwriter must use BackendRelFileNode for
similar reasons. The relcache now stores rd_backend rather than
rd_islocaltemp so that it remains straightforward to call smgropen()
based on a relcache entry. Some smgr functions no longer require an
isTemp argument, because they can infer the necessary information from
their BackendRelFileNode. smgrwrite() and smgrextend() now take a
skipFsync argument rather than an isTemp argument.

I'm not totally sure whether it makes sense to do what we were talking
about above, viz, transfer unlink responsibility for temp rels from
the bgwriter to the owning backend. I haven't done that here. Nor
have I implemented any kind of improved temporary file cleanup
strategy, though I hope such a thing is possible.

Any particular reason not to use directories to help organize things? IE:

base/database_oid/pg_temp_rels/backend_pid/relfilenode

Perhaps relfilenode should be something else.

This seems to have several advantages:

1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
3: It separates all the temporary stuff away from real files.

The only downside I see is some extra code to create the backend_pid directory.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#15

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Jim Nasby (#14)

Re: including PID or backend ID in relpath of temp rels

On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote:

It seems prett clear that it isn't desirable to simply add backend ID
to RelFileNode, because there are too many places using RelFileNode
already for purposes where the backend ID can be inferred from
context, such as buffer headers and most of xlog. Instead, I
introduced BackendRelFileNode, which consists of an ordinary
RelFileNode augmented with a backend ID, and use that only where
needed. In particular, the smgr layer must use BackendRelFileNode
throughout, since it operates on both permanent and temporary
relations. smgr invalidations must also include the backend ID. xlog
generally happens only for non-temporary relations and can thus
continue to use an ordinary RelFileNode; however, commit/abort records
must use BackendRelFileNode as there may be physical storage
associated with temporary relations that must be unlinked.
Communication with the bgwriter must use BackendRelFileNode for
similar reasons. The relcache now stores rd_backend rather than
rd_islocaltemp so that it remains straightforward to call smgropen()
based on a relcache entry. Some smgr functions no longer require an
isTemp argument, because they can infer the necessary information from
their BackendRelFileNode. smgrwrite() and smgrextend() now take a
skipFsync argument rather than an isTemp argument.

I'm not totally sure whether it makes sense to do what we were talking
about above, viz, transfer unlink responsibility for temp rels from
the bgwriter to the owning backend. I haven't done that here. Nor
have I implemented any kind of improved temporary file cleanup
strategy, though I hope such a thing is possible.

Any particular reason not to use directories to help organize things? IE:

base/database_oid/pg_temp_rels/backend_pid/relfilenode

Perhaps relfilenode should be something else.

This seems to have several advantages:

1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
3: It separates all the temporary stuff away from real files.

The only downside I see is some extra code to create the backend_pid directory.

I like the idea of using directories to organize things better and I
completely agree with points #1 and #3. Point #2 is a little more
complicated, I think, and something I've been struggling with. We
need to make sure that we clean up not only the temporary files but
also the catalog entries that point to them, if any. The current code
only blows away temporary tables that are "old" in terms of
transaction IDs, is driven off the catalog entries, and essentially
does DROP TABLE <whatever>. So it can't clean up orphaned temporary
files that don't have any catalog entries associated with them (which
is what we want to fix) but on the other hand whatever it does clean
up is cleaned up completely.

We might be able to do something like this:

1. Scan pg_temp_rels. For each subdirectory found (whose name looks
like a backend ID), if the corresponding backend is not currently
running, add the backend ID to a list of backend IDs needing cleanup.
2. For each backend ID derived in step 1:
2A. Scan the subdirectory and add all the files you find (whose names
are in the right format) to a list of files to be deleted.
2B. Check again whether the backend in question is running. If it is,
don't do anything further for this backend and go on to the next
backend ID (i.e. continue with step 2).
2C. For each file found in step 2A, look for a pg_class entry in the
temp tablespace for that backend ID with a matching relfilenode
number. If one is found, drop the rel. If not, unlink the file if it
still exists.
2D. Attempt to remove the directory, ignoring failures.

I think step 2B is sufficient to prevent a race condition where we end
up mistaking a newly created file for an orphaned one. Am I right?

One possible problem with this is that it would need to be repeated
for every database/tablespace combination, but maybe that wouldn't be
too expensive. autovacuum already needs to process every database,
but I don't know how you'd decide how often to check for stray temp
files. Certainly you'd want to check after a restart... after that I
get fuzzy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#16

Greg Stark

gsstark@mit.edu

over 15 years ago

In reply to: Tom Lane (#10)

Re: including PID or backend ID in relpath of temp rels

On Tue, May 4, 2010 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, that's not the case, anymore than it is for blocks in shared
buffer cache for regular rels. smgrextend() results in an observable
extension of the file EOF immediately, whether or not you can see
up-to-date data for those pages.

Now people have often complained about the extra I/O involved in that,
and it'd be nice to have a solution, but it's not clear to me that
fixing it would be harder for temprels than regular rels.

For systems that have it and filesystems that optimize it I think
posix_fallocate() handles this case. We can extend files without
actually doing any i/o but we get the guarantee from the filesystem
that it has the space available and writing to those blocks won't fail
due to lack of space. Only meta-data i/o is triggered allocating the
blocks and marking them as virtually filled with nulls and it's not
synced unless there's an fsync so there's no extra physical i/o.

This should be the case for ext4 but I'm not sure what other
filesystems implement this.

--
greg

#17

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Jim Nasby (#14)

Re: including PID or backend ID in relpath of temp rels

On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote:

Any particular reason not to use directories to help organize things? IE:

base/database_oid/pg_temp_rels/backend_pid/relfilenode

Perhaps relfilenode should be something else.

This seems to have several advantages:

1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
3: It separates all the temporary stuff away from real files.

The only downside I see is some extra code to create the backend_pid directory.

I thought this was a good idea when you first proposed it, but on
further review I've changed my mind. There are several places in the
code that rely on checking whether the database directory within any
given tablespace is empty to determine whether that database is using
that tablespace. While I could rewrite all of that logic to do the
right thing, I think it's unnecessary pain.

I talked with Tom Lane about this a little bit at PGcon and opined
that we probably only need to clean out stray temporary files at
startup. So what I'm tempted to do is just write a function that goes
through all tablespace/database combinations and scans each directory
for files with a name like t<digits>_<digits> and blows them away.
This will leave the catalog entries pointing at nothing, but we
already have working code in autovacuum.c to clean up the catalog
entries, and I believe that will work just fine even if the underlying
files have been removed earlier.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company