[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints. The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.
One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.
Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation. Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content. Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache. Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode. Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation. Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.
WAL sequence using pg_waldump
----------------------------------------------------
1. (new wal to create db dir and write PG_VERSION file)
rmgr: Database desc: CREATE create dir 1663/16394
2. (new wal to create and write relmap file)
rmgr: RelMap desc: CREATE database 16394 tablespace 1663 size 512
2. (create relfilenode)
rmgr: Storage desc: CREATE base/16394/16384
rmgr: Storage desc: CREATE base/16394/2619
3. (write page data)
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW
rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW
............
4. (create other forks)
rmgr: Storage desc: CREATE base/16394/2619_fsm
rmgr: Storage CREATE base/16394/2619_vm
.............
I have attached a POC patch, which shows this idea, with this patch
all basic sanity testing and the "check-world" is passing.
Open points:
-------------------
- This is a POC patch so needs more refactoring/cleanup and testing.
- Might need to relook into the SMGR level API usage.
Credits:
-----------
Thanks to Robert Haas, for suggesting this idea and the high-level design.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
POC-0001-WAL-logged-CREATE-DATABASE.patchtext/x-patch; charset=US-ASCII; name=POC-0001-WAL-logged-CREATE-DATABASE.patchDownload+613-322
On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote:
Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints. The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.
I only had a quick look at the patch but AFAICS your patch makes the new
behavior mandatory. Wouldn't it make sense to have a way to use the previous
approach? People creating wanting to copy somewhat big database and with a
slow replication may prefer to pay 2 checkpoints rather than stream everything.
Same for people who have an otherwise idle system (I often use that to make
temporary backups and/or prepare multiple dataset and most of the time the
checkpoint is basically free).
On 15/06/2021 14:20, Dilip Kumar wrote:
Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation. Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content. Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache. Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.
I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.
Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation. Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.
Yeah, WAL-logging the contents of the source database would certainly be
less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?
In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.
- Heikki
On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/06/2021 14:20, Dilip Kumar wrote:
Design Idea:
. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.
Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.
Yeah, WAL-logging the contents of the source database would certainly be
less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.
Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster. I think code wise maintaining two
approaches will not be a very difficult job because the old approach
just calls copydir(), but I am thinking about how can we decide which
approach is better in which scenario. I don't think we can take calls
just based on the database size? It would also depend upon many other
factors e.g. how busy your system is, how many total dirty buffers are
there in the cluster right? because checkpoint will affect the
performance of the operation going on in other databases in the
cluster.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Am I mistaken in thinking that this would allow CREATE DATABASE to run
inside a transaction block now, further reducing the DDL commands that are
non-transactional?
On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
Yeah, WAL-logging the contents of the source database would certainly
be less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?
I'm 100% certain there are. It's not even a niche case.
In principle, we could have both mechanisms, and use the new
WAL-logged system if the database is small, and the old system with
checkpoints if it's large. But I don't like idea of having to maintain
both.
Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:
Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.
Looks like a good approach.
At Tue, 15 Jun 2021 22:07:32 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:
Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.Looks like a good approach.
If we are willing to maintain the two methods.
Couldn't we just skip the checkpoints if the database is known to
"clean", which means no page has been loaded for the database since
startup? We can use the "template" mark to reject connections to the
database. (I'm afraid that we also should prevent vacuum to visit the
template databases, but...)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Jun 16, 2021 at 03:27:21PM +0900, Kyotaro Horiguchi wrote:
If we are willing to maintain the two methods.
Couldn't we just skip the checkpoints if the database is known to
"clean", which means no page has been loaded for the database since
startup? We can use the "template" mark to reject connections to the
database. (I'm afraid that we also should prevent vacuum to visit the
template databases, but...)
There's already a datallowconn for that purpose. Modifying template databases
is a common practice and we shouldn't prevent that.
But having the database currently doesn't accepting connection doesn't mean that
there is no dirty buffer and/or pending unlink, so it doesn't look like
something that could be optimized, at least for the majority of use cases.
On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.
Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:
The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
TABLESPACE to be fully WAL-logged.
Generally quite a bit in favor of this - the current approach is very
heavyweight, slow and I think we have a few open corner bugs related to
it.
Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.
What happens if you crash / promote at this point?
Create the "relation map file" in the target database and copy the
content from the source database. For this, we can use some modified
versions of the write_relmap_file() and WAL-log the relmap create
operation along with the file content. Now, read the relmap file to
find the relfilenode for pg_class and then we read pg_class block by
block and decode the tuples.
This doesn't seem like a great approach - you're not going to be able to
use much of the normal infrastructure around processing tuples. So it
seems like it'd end up with quite a bit of special case code that needs
to maintained in parallel.
Now read the source relfilenode block by block using
ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.
I think we might need a bit more batching for the WAL logging. There are
cases of template database considerably bigger than the default and the
overhead of logging each write separately seems likely to be noticable.
Greetings,
Andres Freund
Hi,
On 2021-06-15 18:11:23 +0530, Dilip Kumar wrote:
On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 15/06/2021 14:20, Dilip Kumar wrote:
Design Idea:
. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.I guess that would work, but you could also walk the database directory
like copydir() does. How you find the relations to copy is orthogonal to
whether you WAL-log them or use checkpoints. And whether you use the
buffer cache is also orthogonal to the rest of the proposal; you could
issue FlushDatabaseBuffers() instead of a checkpoint.Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.
It's not entirely obvious to me that it's important to avoid
FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
it unnecessarily writes out dirty buffers in other databases, triggers FPWs
etc. Normally a database used as a template won't have a meaningful amount of
dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
writes. Of course, there is the matter of FlushDatabaseBuffers() not being
cheap with a large shared_buffers - but I suspect that's not a huge factor
compared to the rest of the database creation cost.
I think the better argument for going through shared buffers is that it might
be worth doing so for the *target* database. A common use of frequently
creating databases, in particular with a non-default template database, is to
run regression tests with pre-created schema / data - writing out all that data
just to have it then dropped a few seconds later after the regression test
completed is wasteful.
In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.
I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.
Greetings,
Andres Freund
On 6/15/21 3:31 PM, Andrew Dunstan wrote:
On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
Yeah, WAL-logging the contents of the source database would certainly
be less weird than the current system. As Julien also pointed out, the
question is, are there people using on "CREATE DATABASE foo TEMPLATE
bar" to copy a large source database, on the premise that it's fast
because it skips WAL-logging?I'm 100% certain there are. It's not even a niche case.
In principle, we could have both mechanisms, and use the new
WAL-logged system if the database is small, and the old system with
checkpoints if it's large. But I don't like idea of having to maintain
both.Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.
I think we should be asking what is the benefit of that use case, and
perhaps try addressing that without having to maintain two entirely
different ways to do CREATE DATABASE. It's not like we're sure the
current code is 100% reliable in various corner cases, I doubt having
two separate approaches will improve the situation :-/
I can see three reasons why people want to skip the WAL logging:
1) it's faster, because there's no CPU and I/O for building the WAL
I wonder if some optimization / batching could help with (1), as
suggested by Andres elsewhere in this thread.
2) it saves the amount of WAL (could matter with large template
databases and WAL archiving, etc.)
We can't really do much about this - we need to log all the data. But
the batching from (1) might help a bit too, I guess.
3) saves the amount of WAL that needs to be copied to standby, so that
there's no increase of replication lag, etc. particularly when the
network link has limited bandwidth
I think this is a more general issue - some operations that may
generate a lot of WAL, and we generally assume it's better to do
that rather than hold exclusive locks for long time. But maybe we
could have some throttling, to limit the amount of WAL per second,
similarly to what we have to plain vacuum.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
()log_newpage()On Thu, Jun 17, 2021 at 3:28 AM Andres Freund
<andres@anarazel.de> wrote:
Hi,
On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:
The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
TABLESPACE to be fully WAL-logged.Generally quite a bit in favor of this - the current approach is very
heavyweight, slow and I think we have a few open corner bugs related to
it.
Great!
Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.What happens if you crash / promote at this point?
I will check this.
Create the "relation map file" in the target database and copy the
content from the source database. For this, we can use some modified
versions of the write_relmap_file() and WAL-log the relmap create
operation along with the file content. Now, read the relmap file to
find the relfilenode for pg_class and then we read pg_class block by
block and decode the tuples.This doesn't seem like a great approach - you're not going to be able to
use much of the normal infrastructure around processing tuples. So it
seems like it'd end up with quite a bit of special case code that needs
to maintained in parallel.
Yeah, this needs some special-purpose code but it is not too much
code. I agree that instead of scanning the pg_class we can scan all
the tablespaces and under that identify the source database directory
as we do now. And from there we can copy each relfilenode block by
block with wal log. Honestly, these both seem like a special-purpose
code. Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?
Now read the source relfilenode block by block using
ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.I think we might need a bit more batching for the WAL logging. There are
cases of template database considerably bigger than the default and the
overhead of logging each write separately seems likely to be noticable.
Yeah, we can do that, and instead of using log_newpage() we can use
log_newpages(), to log multiple pages at once.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 17, 2021 at 3:43 AM Andres Freund <andres@anarazel.de> wrote:
Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache. And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.It's not entirely obvious to me that it's important to avoid
FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
it unnecessarily writes out dirty buffers in other databases, triggers FPWs
etc. Normally a database used as a template won't have a meaningful amount of
dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
writes. Of course, there is the matter of FlushDatabaseBuffers() not being
cheap with a large shared_buffers - but I suspect that's not a huge factor
compared to the rest of the database creation cost.
Okay so if I think from that POW, then maybe we can just
FlushDatabaseBuffers() and then directly use smgrread() calls.
I think the better argument for going through shared buffers is that it might
be worth doing so for the *target* database. A common use of frequently
creating databases, in particular with a non-default template database, is to
run regression tests with pre-created schema / data - writing out all that data
just to have it then dropped a few seconds later after the regression test
completed is wasteful.
Okay, I am not sure how common this use case is but for this use case
it makes sense to use bufmgr for the target database.
In principle, we could have both mechanisms, and use the new WAL-logged
system if the database is small, and the old system with checkpoints if
it's large. But I don't like idea of having to maintain both.Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.
Make sense.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 17/06/2021 08:45, Dilip Kumar wrote:
Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?
You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.
- Heikki
On Thu, Jun 17, 2021 at 2:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 17/06/2021 08:45, Dilip Kumar wrote:
Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.
Yeah, that is the one place, another place I am using it to decide
whether to WAL log the new page while writing into the target
relfilenode, if it is unlogged relation then I am not WAL logging. But
now, I think that is not the right idea, during creating the database
we should WAL log all the pages irrespective of whether the table is
logged or unlogged.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 16, 2021 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:
I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.
This is a good point, but on the other hand, I think this smells a lot
like the wal_level=minimal optimization where we don't need to log
data being bulk-loaded into a table created in the same transaction if
wal_level=minimal. In theory, that optimization has a lot of value,
but in practice it gets a lot of bad press on this list, because (1)
sometimes doing the fsync is more expensive than writing the extra WAL
would have been and (2) most people want to run with
wal_level=replica/logical so it ends up being a code path that isn't
used much and is therefore more likely than average to have bugs
nobody's terribly interested in fixing (except Noah ... thanks Noah!).
If we add features in the future, lke TDE or perhaps incremental
backup, that rely on new pages getting new LSNs instead of recycled
ones, this may turn into the same kind of wart. And as with that
optimization, you're probably not even better off unless the database
is pretty big, and you might be worse off if you have to do fsyncs or
flush buffers synchronously. I'm not severely opposed to keeping both
methods around, so if that's really what people want to do, OK, but I
guess I wonder whether we're really going to be happy with that
decision down the road.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.
I think the root of the problem with this feature is that it doesn't
go through shared_buffers, so in my opinion, it would be better if we
can make it all go through shared_buffers. It seems like you're
advocating a middle ground where half of the operation goes through
shared_buffers and the other half doesn't, but that sounds like
getting rid of half of the hack when we could have gotten rid of all
of it. I think things that don't go through shared_buffers are bad,
and we should be making an effort to get rid of them where we can
reasonably do so. I believe I've both introduced and fixed my share of
bugs that were caused by such cases, and I think the behavior of the
whole system would be a lot easier to reason about if we had fewer of
those, or none.
I can also think of at least one significant advantage of driving this
off the remote database's pg_class rather than the filesystem
contents. It's a known defect of PostgreSQL that if you create a table
and then crash, you leave behind a dead file that never gets removed.
If you now copy the database that contains that orphaned file, you
would ideally prefer not to copy that file, but if you do a copy based
on the filesystem contents, then you will. If you drive the copy off
of pg_class, you won't.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2021-06-17 13:53:38 -0400, Robert Haas wrote:
On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
You only need relpersistence if you want to use the buffer cache, right?
I think that's a good argument for not using it.
Do we really need pg_class to figure this out? Can't we just check if
the relation has an init fork?
I can also think of at least one significant advantage of driving this
off the remote database's pg_class rather than the filesystem
contents. It's a known defect of PostgreSQL that if you create a table
and then crash, you leave behind a dead file that never gets removed.
If you now copy the database that contains that orphaned file, you
would ideally prefer not to copy that file, but if you do a copy based
on the filesystem contents, then you will. If you drive the copy off
of pg_class, you won't.
I'm very unconvinced this is the place to tackle the issue of orphan
relfilenodes. It'd be one thing if it were doable by existing code,
e.g. because we supported cross-database relation accesses fully, but we
don't.
Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.
I don't think this is an academic concern: You need to read from shared
buffers to read the "remote" pg_class, otherwise you'll potentially miss
changes. But it's not correct to read in pages or to pin pages without
holding a lock, and there's code that relies on that (see
e.g. InvalidateBuffer()).
Greetings,
Andres Freund