proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Hi all,
currently, CREATE DATABASE forces an immediate checkpoint (actually, it
forces two, but the time interval is usually rather small). For
traditional deployments this is not a big deal, because creating a
database is a rare event, and may be planned to off-peak times.
However for shared (cloud-like) deployments, this is not the case. E.g.
we're hosting hundreds (or even thousands) of customer databases on some
clusters, and creating a new database is quite common.
This turns the checkpoints into a significant pain point for us, because
it forces a write of all the dirty buffers from all the databases. No
matter how well we tune the spread checkpoints, this makes it
inefficient and causes significant I/O spikes (especially with larger
shared_buffer values). It also leads to high duration of the CREATE
DATABASE command, making it rather inpractical for 'interactive' use (a
user hitting a button in a UI or something).
Based on the talks from pgconf.eu, where I've seen this mentioned in at
least two talks (and in the hallway track), we're not alone. I'd like to
address this, if possible.
The usual workaround for this is "create the databases in advance" but
that's not always possible (e.g. when having more than handful of
templates, or when the template evolves over time).
After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().
It's not exactly trivial change, but it does not seem frighteningly
difficult coding either.
The templates are usually static, so this would minimize both the CREATE
DATABASE duration and disruption to the cluster it causes.
My fear however is that this while the code will work, it will break the
recovery in some subtle way (as illustrated by the comments about 8.0
PITR bugs in createdb).
Am I missing something that makes this dead in the water?
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/26/2014 11:47 PM, Tomas Vondra wrote:
After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().
Hmm. You could replace the first checkpoint with that, but I don't think
that's enough for the second. To get any significant performance
benefit, you need to get rid of both checkpoints, because doing two
checkpoints one after another is almost as fast as doing a single
checkpoint; the second checkpoint has very little work to do because the
first checkpoint already flushed out everything.
The second checkpoint, after copying but before commit, is done because
(from the comments in createdb function):
* #1: When PITR is off, we don't XLOG the contents of newly created
* indexes; therefore the drop-and-recreate-whole-directory behavior
* of DBASE_CREATE replay would lose such indexes.* #2: Since we have to recopy the source database during DBASE_CREATE
* replay, we run the risk of copying changes in it that were
* committed after the original CREATE DATABASE command but before the
* system crash that led to the replay. This is at least unexpected
* and at worst could lead to inconsistencies, eg duplicate table
* names.
Doing only FlushDatabaseBuffers would not prevent these issues - you
need a full checkpoint. These issues are better explained here:
/messages/by-id/28884.1119727671@sss.pgh.pa.us
To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that
if a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.
That would also solve #2, when doing crash recovery. But it would remain
when doing archive recovery. I guess we could still redo the copy when
in archive recovery mode. I believe it would be the first time we have a
WAL record that's replayed differently in crash recovery than in archive
recovery, so here be dragons...
It's not exactly trivial change, but it does not seem frighteningly
difficult coding either.The templates are usually static, so this would minimize both the CREATE
DATABASE duration and disruption to the cluster it causes.
I wonder if we should bite the bullet and start WAL-logging all the
files that are copied from the template database to the new database.
When the template database is small (template0 is 6.4MB currently), that
wouldn't generate too much WAL. We could perhaps do that only if the
template database is small, and do the checkpoints otherwise, although I
wouldn't like to have subtly different behavior depending on database
size like that.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that if
a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.
Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?
I wonder if we should bite the bullet and start WAL-logging all the files
that are copied from the template database to the new database. When the
template database is small (template0 is 6.4MB currently), that wouldn't
generate too much WAL. We could perhaps do that only if the template
database is small, and do the checkpoints otherwise, although I wouldn't
like to have subtly different behavior depending on database size like that.
For the sort of workload Tomas described above (creating a lot of databases
on the fly), we may end up with a lot of WAL eventually if we do this.
Regards,
Atri
On 10/27/2014 01:06 PM, Atri Sharma wrote:
To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that if
a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?
Right.
I wonder if we should bite the bullet and start WAL-logging all the files
that are copied from the template database to the new database. When the
template database is small (template0 is 6.4MB currently), that wouldn't
generate too much WAL. We could perhaps do that only if the template
database is small, and do the checkpoints otherwise, although I wouldn't
like to have subtly different behavior depending on database size like that.For the sort of workload Tomas described above (creating a lot of databases
on the fly), we may end up with a lot of WAL eventually if we do this.
I think writing 6 MB for each CREATE DATABASE would be OK. For
comparison, a pg_switch_xlog() call will waste on average half a
segment, i.e. 8MB. It's a lot cheaper than a checkpoint on a busy
system, for sure. But of course, if the template database is larger,
then you will generate more WAL.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 27, 2014 at 4:44 PM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:
On 10/27/2014 01:06 PM, Atri Sharma wrote:
To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that
if
a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?Right.
It does make sense, however, with the checkpoint after creating the files
gone, the window between the creation of files and actual commit might be
increased, increasing the possibility of a crash during that period and
causing an orphan database. However, my understanding of the consequences
of removing the checkpoint might be incorrect, so my fears might be wrong.
Regards,
Atri
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
On 10/26/2014 11:47 PM, Tomas Vondra wrote:
After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().Hmm. You could replace the first checkpoint with that, but I don't think
that's enough for the second. To get any significant performance
benefit, you need to get rid of both checkpoints, because doing two
checkpoints one after another is almost as fast as doing a single
checkpoint; the second checkpoint has very little work to do because the
first checkpoint already flushed out everything.
Yes, that's why I wrote that two checkpoints not too far away are
effectively a single checkpoint. OTOH if the template database is not
small, it may take a while to copy the data, increasing the distance
between the checkpoints.
While our template databases are small (in the order of ~10-100MB), there
are probably people using this to clone much larger databases.
The second checkpoint, after copying but before commit, is done because
(from the comments in createdb function):* #1: When PITR is off, we don't XLOG the contents of newly created
* indexes; therefore the drop-and-recreate-whole-directory behavior
* of DBASE_CREATE replay would lose such indexes.* #2: Since we have to recopy the source database during DBASE_CREATE
* replay, we run the risk of copying changes in it that were
* committed after the original CREATE DATABASE command but before the
* system crash that led to the replay. This is at least unexpected
* and at worst could lead to inconsistencies, eg duplicate table
* names.Doing only FlushDatabaseBuffers would not prevent these issues - you
need a full checkpoint. These issues are better explained here:
/messages/by-id/28884.1119727671@sss.pgh.pa.usTo solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that
if a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.That would also solve #2, when doing crash recovery. But it would remain
when doing archive recovery. I guess we could still redo the copy when
in archive recovery mode. I believe it would be the first time we have a
WAL record that's replayed differently in crash recovery than in archive
recovery, so here be dragons...
Yeah ...
It's not exactly trivial change, but it does not seem frighteningly
difficult coding either.The templates are usually static, so this would minimize both the CREATE
DATABASE duration and disruption to the cluster it causes.I wonder if we should bite the bullet and start WAL-logging all the
files that are copied from the template database to the new database.
When the template database is small (template0 is 6.4MB currently), that
wouldn't generate too much WAL. We could perhaps do that only if the
template database is small, and do the checkpoints otherwise, although I
wouldn't like to have subtly different behavior depending on database
size like that.
IMHO writing all the data into a WAL would be the cleanest solution.
Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.
So maybe if we could determine the amount of data to be checkpointed, and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).
Another option might be forcing just a "spread" checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would lower
the impact on other activity on the machine.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
IMHO writing all the data into a WAL would be the cleanest solution.
Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.
So are you proposing having a heuristic based on the amount of data in
shared buffers and write activity? Do you have something in mind that works
for general workloads as well?
So maybe if we could determine the amount of data to be checkpointed, and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).Another option might be forcing just a "spread" checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would lower
the impact on other activity on the machine.
I believe this to be the cleanest way to reduce the amount of I/O
generated. If I understand correctly, the original problem you mentioned
was not the time CREATE DATABASE is taking but rather the amount of I/O
each one is generating.
This also leads me to think if it makes sense to explore group commits
around the creation of files for a new database (for a same backend, of
course). This might be on call, if the user knows he/she is going to create
a lot of databases in the near future and is fine with a large spike in I/O
at one go. Again, might be even more broken than the current scenario, but
depends on what the user wants...
Regards,
Atri
--
Regards,
Atri
*l'apprenant*
Dne 27 Říjen 2014, 13:50, Atri Sharma napsal(a):
IMHO writing all the data into a WAL would be the cleanest solution.
Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of
dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.So are you proposing having a heuristic based on the amount of data in
shared buffers and write activity? Do you have something in mind that
works
for general workloads as well?So maybe if we could determine the amount of data to be checkpointed,
and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).Another option might be forcing just a "spread" checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would
lower
the impact on other activity on the machine.I believe this to be the cleanest way to reduce the amount of I/O
generated. If I understand correctly, the original problem you mentioned
was not the time CREATE DATABASE is taking but rather the amount of I/O
each one is generating.
Not exactly. There are two related issues, both caused by the I/O activity
from the CHECKPOINT.
(a) I/O spike, because of checkpointing everything
(b) long CREATE DATABASE durations
This "spread CREATE DATABASE" only fixes (a), and makes (b) a bit worse.
It however makes the 'create databases in advance' a bit more flexible,
because it allows more frequent runs of the cron job that actually creates
them. Right now we're forced to schedule it rather rarely (say, 1/day) to
minimize the impact, which has downsides (higher probability of running
out of spare dbs, slow response to updated template etc.).
If we can fix both (a) and (b), that'd be great. If we can fix only (a),
so be it.
This also leads me to think if it makes sense to explore group commits
around the creation of files for a new database (for a same backend, of
course). This might be on call, if the user knows he/she is going to
create
a lot of databases in the near future and is fine with a large spike in
I/O
at one go. Again, might be even more broken than the current scenario, but
depends on what the user wants...
That seems much more complex than what I proposed to do, but maybe I'm
wrong. If we could get rid of the "checkpoint everything" altogether, we
don't really need the group commit, no?
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
On 10/26/2014 11:47 PM, Tomas Vondra wrote:
After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().Hmm. You could replace the first checkpoint with that, but I don't think
that's enough for the second. To get any significant performance
benefit, you need to get rid of both checkpoints, because doing two
checkpoints one after another is almost as fast as doing a single
checkpoint; the second checkpoint has very little work to do because the
first checkpoint already flushed out everything.The second checkpoint, after copying but before commit, is done because
(from the comments in createdb function):* #1: When PITR is off, we don't XLOG the contents of newly created
* indexes; therefore the drop-and-recreate-whole-directory behavior
* of DBASE_CREATE replay would lose such indexes.* #2: Since we have to recopy the source database during DBASE_CREATE
* replay, we run the risk of copying changes in it that were
* committed after the original CREATE DATABASE command but before the
* system crash that led to the replay. This is at least unexpected
* and at worst could lead to inconsistencies, eg duplicate table
* names.Doing only FlushDatabaseBuffers would not prevent these issues - you
need a full checkpoint. These issues are better explained here:
/messages/by-id/28884.1119727671@sss.pgh.pa.us
Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?
I mean, when we use database A as a template, why do we need to checkpoint
B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
my comfort zone in this part of the code.)
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
On 10/26/2014 11:47 PM, Tomas Vondra wrote:
After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().Hmm. You could replace the first checkpoint with that, but I don't think
that's enough for the second. To get any significant performance
benefit, you need to get rid of both checkpoints, because doing two
checkpoints one after another is almost as fast as doing a single
checkpoint; the second checkpoint has very little work to do because the
first checkpoint already flushed out everything.The second checkpoint, after copying but before commit, is done because
(from the comments in createdb function):* #1: When PITR is off, we don't XLOG the contents of newly created
* indexes; therefore the drop-and-recreate-whole-directory behavior
* of DBASE_CREATE replay would lose such indexes.* #2: Since we have to recopy the source database during DBASE_CREATE
* replay, we run the risk of copying changes in it that were
* committed after the original CREATE DATABASE command but before the
* system crash that led to the replay. This is at least unexpected
* and at worst could lead to inconsistencies, eg duplicate table
* names.Doing only FlushDatabaseBuffers would not prevent these issues - you
need a full checkpoint. These issues are better explained here:
/messages/by-id/28884.1119727671@sss.pgh.pa.usThinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?I mean, when we use database A as a template, why do we need to checkpoint
B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
my comfort zone in this part of the code.)
A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?
A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).
Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-27 09:46:41 -0400, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.
+1.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/27/2014 03:46 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.
That would be a nasty surprise for anyone who's using CREATE DATABASE as
a fast way to clone a large database. But I would be OK with that, at
least if we can skip the WAL-logging with wal_level=minimal.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27.10.2014 17:24, Heikki Linnakangas wrote:
On 10/27/2014 03:46 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Thinking about this a bit more, do we really need a full checkpoint?
That
is a checkpoint of all the databases in the cluster? Why
checkpointing the
source database is not enough?A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory
copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more
I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.That would be a nasty surprise for anyone who's using CREATE DATABASE
as a fast way to clone a large database. But I would be OK with that,
at least if we can skip the WAL-logging with wal_level=minimal.
That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(
The only thing I can think of is having two CREATE DATABASE "flavors".
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).
It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).
I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.
I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).
It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra <tv@fuzzy.cz> writes:
That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.
I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).
Good point. But the other side of that coin is that if somebody *was*
doing this, the locking would mean that slowing it down would be even
more painful than you might think. Still, personally I'm willing to
accept that downside, given that we've pretty much always had the above
caveat in the docs.
It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.
Well, no, because the restriction "nobody can be connected to the source
database" would still apply; relaxing that would be enormously more
complicated, and probably fragile, than what we're talking about here.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/27/2014 05:58 PM, Tomas Vondra wrote:
On 27.10.2014 17:24, Heikki Linnakangas wrote:
On 10/27/2014 03:46 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
On 10/27/2014 03:21 PM, Tomas Vondra wrote:
Thinking about this a bit more, do we really need a full checkpoint?
That
is a checkpoint of all the databases in the cluster? Why
checkpointing the
source database is not enough?A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory
copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more
I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.That would be a nasty surprise for anyone who's using CREATE DATABASE
as a fast way to clone a large database. But I would be OK with that,
at least if we can skip the WAL-logging with wal_level=minimal.That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(The only thing I can think of is having two CREATE DATABASE "flavors".
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.
Notwithstanding what the docs say, I have seen CREATE DATABASE used
plenty of times, and quite effectively, to clone databases. I don't
think making it do twice the IO in the general case is going to go down
well.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
On 10/27/2014 05:58 PM, Tomas Vondra wrote:
On 27.10.2014 17:24, Heikki Linnakangas wrote:
I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
of times, and quite effectively, to clone databases. I don't think making it
do twice the IO in the general case is going to go down well.
I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/27/2014 07:01 PM, Andres Freund wrote:
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
On 10/27/2014 05:58 PM, Tomas Vondra wrote:
On 27.10.2014 17:24, Heikki Linnakangas wrote:
I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
of times, and quite effectively, to clone databases. I don't think making it
do twice the IO in the general case is going to go down well.I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.
At the very least that needs to be benchmarked.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
of times, and quite effectively, to clone databases. I don't think making it
do twice the IO in the general case is going to go down well.
I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.
Note that currently, CREATE DATABASE requires fsync'ing each file written
into the new database. With the proposed new implementation, we'd write
out that data to the kernel *but not have to fsync it*. Instead, we'd
fsync just the WAL. At least on spinning rust, that could be a
considerable win, for exactly the same reasons that we don't fsync
anything but WAL for ordinary transaction commits (ie, way fewer seeks).
Maybe not by enough to counteract doubling the write volume, but I think
we'd need some benchmarks before concluding that it's completely horrid.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28.10.2014 00:06, Andrew Dunstan wrote:
On 10/27/2014 07:01 PM, Andres Freund wrote:
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
On 10/27/2014 05:58 PM, Tomas Vondra wrote:
On 27.10.2014 17:24, Heikki Linnakangas wrote:
I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?That being said, our CREATE DATABASE docs currently say this
Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).It also seems to me the "general-purpose COPY DATABASE" described in
the
docs is what we're describing in this thread.Notwithstanding what the docs say, I have seen CREATE DATABASE used
plenty
of times, and quite effectively, to clone databases. I don't think
making it
do twice the IO in the general case is going to go down well.I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.At the very least that needs to be benchmarked.
The question is what workload are we going to benchmark. It's unlikely
one of the approaches to be a clear winner in all cases, so we'll have
to decide which ones are more common / important, or somehow combine
both approaches (and thus not getting some of the WAL-only benefits).
I'm pretty sure we'll see about three main cases:
(1) read-mostly workloads
Current approach wins, because checkpoint is cheap and the
WAL-based approach results in 2x the I/O.
The difference is proportional to template database size. For
small databases it's negligible, for large databases it's more
significant.
(2) write-heavy workloads / small template database
WAL-based approach wins, because it does not require explicit
checkpoint and for small databases the I/O generated by WAL-logging
everything is lower than checkpoint (which is more random).
This is the case of our PostgreSQL clusters.
(3) write-heavy workloads / large template database
Current approach wins, for two reasons: (a) for large databases the
WAL-logging overhead may generate much more I/O than a checkpoint,
and (b) it may generate so many WAL segments it eventually triggers
a checkpoint anyway (even repeatedly).
The exact boundary between the cases really depends on multiple things:
(a) shared_buffers size (the larger the more expensive checkpoint)
(b) read-write activity (more writes => more expensive checkpoint)
(c) hardware (especially how well it handles random I/O)
Not sure how to decide which case is more important, and I agree that
there are people using CREATE DATABASE to clone databases - maybe not in
production, but e.g. for testing purposes (still, it'd be rather
unfortunate to make it considerably slower for them). Not sure how to
balance this :-/
So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
Heikki's idea from to abandon the full checkpoint and instead assume
that once the transaction commits, all the files were fsynced OK. Of
couse, this will do nothing about the replay hazards.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers