Streaming base backups
Attached is an updated streaming base backup patch, based off the work
that Heikki
started. It includes support for tablespaces, permissions, progress
reporting and
some actual documentation of the protocol changes (user interface
documentation is
going to be depending on exactly what the frontend client will look like, so I'm
waiting with that one a while).
The basic implementation is: Add a new command to the replication mode called
BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
compatible format) of the data directory and all tablespaces, and then end
the base backup in a single operation.
Other than the basic implementation, there is a small refactoring done of
pg_start_backup() and pg_stop_backup() splitting them into a "backend function"
that is easier to call internally and a "user facing function" that remains
identical to the previous one, and I've also added a pg_abort_backup()
internal-only function to get out of crashes while in backup mode in a safer
way (so it can be called from error handlers). Also, the walsender needs a
resource owner in order to call pg_start_backup().
I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).
Some remaining thoughts and must-dos:
* Compression: Do we want to be able to compress the backups server-side? Or
defer that to whenever we get compression in libpq? (you can still tunnel it
through for example SSH to get compression if you want to) My thinking is
defer it.
* Compression: We could still implement compression of the tar files in
pg_streamrecv (probably easier, possibly more useful?)
* Windows support (need to implement readlink)
* Tar code is copied from pg_dump and modified. Should we try to factor it out
into port/? There are changes in the middle of it so it can't be done with
the current calling points, it would need a refactor. I think it's not worth
it, given how simple it is.
Improvements I want to add, but that aren't required for basic operation:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?
* include all the necessary WAL files in the backup. This way we could generate
a tar file that would work on it's own - right now, you still need to set up
log archiving (or use streaming repl) to get the remaining logfiles from the
master. This is fine for replication setups, but not for backups.
This would also require us to block recycling of WAL files during the backup,
of course.
* Suggestion from Heikki: don't put backup_label in $PGDATA during the backup.
Rather, include it just in the tar file. That way if you crash during the
backup, the master doesn't start recovery from the backup_label, leading
to failure to start up in the worst case
* Suggestion from Heikki: perhaps at some point we're going to need a full
bison grammar for walsender commands.
* Relocation of tablespaces (can at least partially be done client-side)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Attachments:
basebackup.patchtext/x-patch; charset=US-ASCII; name=basebackup.patchDownload+676-40
On 01/05/2011 02:54 PM, Magnus Hagander wrote:
[..]
Some remaining thoughts and must-dos:
* Compression: Do we want to be able to compress the backups server-side? Or
defer that to whenever we get compression in libpq? (you can still tunnel it
through for example SSH to get compression if you want to) My thinking is
defer it.
* Compression: We could still implement compression of the tar files in
pg_streamrecv (probably easier, possibly more useful?)
hmm compression would be nice but I don't think it is required for this
initial implementation.
* Windows support (need to implement readlink)
* Tar code is copied from pg_dump and modified. Should we try to factor it out
into port/? There are changes in the middle of it so it can't be done with
the current calling points, it would need a refactor. I think it's not worth
it, given how simple it is.Improvements I want to add, but that aren't required for basic operation:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?
well my main concern is that a basebackup done that way might blew up
the buffercache of the OS causing temporary performance issues.
This might be more serious with an in-core solution than with what
people use now because a number of backup software and tools (like some
of the commercial backup solutions) employ various tricks to avoid that.
One interesting tidbit i found was:
http://insights.oetiker.ch/linux/fadvise/
which is very Linux specific but interesting nevertheless...
Stefan
Magnus Hagander <magnus@hagander.net> writes:
Attached is an updated streaming base backup patch, based off the work
Thanks! :)
* Compression: Do we want to be able to compress the backups server-side? Or
defer that to whenever we get compression in libpq? (you can still tunnel it
through for example SSH to get compression if you want to) My thinking is
defer it.
Compression in libpq would be a nice way to solve it, later.
* Compression: We could still implement compression of the tar files in
pg_streamrecv (probably easier, possibly more useful?)
What about pg_streamrecv | gzip > …, which has the big advantage of
being friendly to *any* compression command line tool, whatever patents
and licenses?
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?
Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?
* include all the necessary WAL files in the backup. This way we could generate
a tar file that would work on it's own - right now, you still need to set up
log archiving (or use streaming repl) to get the remaining logfiles from the
master. This is fine for replication setups, but not for backups.
This would also require us to block recycling of WAL files during the backup,
of course.
Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
Attached is an updated streaming base backup patch, based off the work
Thanks! :)
* Compression: Do we want to be able to compress the backups server-side? Or
defer that to whenever we get compression in libpq? (you can still tunnel it
through for example SSH to get compression if you want to) My thinking is
defer it.Compression in libpq would be a nice way to solve it, later.
Yeah, I'm pretty much set on postponing that one.
* Compression: We could still implement compression of the tar files in
pg_streamrecv (probably easier, possibly more useful?)What about pg_streamrecv | gzip > …, which has the big advantage of
being friendly to *any* compression command line tool, whatever patents
and licenses?
That's part of what I meant with "easier and more useful".
Right now though, pg_streamrecv will output one tar file for each
tablespace, so you can't get it on stdout. But that can be changed of
course. The easiest step 1 is to just use gzopen() from zlib on the
files and use the same code as now :-)
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?
I think that's way more complex than we want to go here.
* include all the necessary WAL files in the backup. This way we could generate
a tar file that would work on it's own - right now, you still need to set up
log archiving (or use streaming repl) to get the remaining logfiles from the
master. This is fine for replication setups, but not for backups.
This would also require us to block recycling of WAL files during the backup,
of course.Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.
Yes, this was mostly for the use-case of "getting a single tarfile
that you can actually use to restore from without needing the log
archive at all".
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes:
Compression in libpq would be a nice way to solve it, later.
Yeah, I'm pretty much set on postponing that one.
+1, in case it was not clear for whoever's counting the votes :)
What about pg_streamrecv | gzip > …, which has the big advantage of
That's part of what I meant with "easier and more useful".
Well…
Right now though, pg_streamrecv will output one tar file for each
tablespace, so you can't get it on stdout. But that can be changed of
course. The easiest step 1 is to just use gzopen() from zlib on the
files and use the same code as now :-)
Oh if integrating it is easier :)
Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?I think that's way more complex than we want to go here.
Yeah.
Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.Yes, this was mostly for the use-case of "getting a single tarfile
that you can actually use to restore from without needing the log
archive at all".
It also allows for a simpler kick-start procedure for preparing a
standby, and allows to stop worrying too much about wal_keep_segments
and archive servers.
When do the standby launch its walreceiver? It would be extra-nice for
the base backup tool to optionally continue streaming WALs until the
standby starts doing it itself, so that wal_keep_segments is really
deprecated. No idea how feasible that is, though.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 06.01.2011 00:27, Dimitri Fontaine wrote:
Magnus Hagander<magnus@hagander.net> writes:
What about pg_streamrecv | gzip> …, which has the big advantage of
That's part of what I meant with "easier and more useful".
Well…
One thing to keep in mind is that if you do compression in libpq for the
transfer, and gzip the tar file in the client, that's quite inefficient.
You compress the data once in the server, decompress in the client, then
compress it again in the client. If you're going to write the backup to
a compressed file, and you want to transfer it compressed to save
bandwidth, you want to gzip it in the server to begin with.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Jan 5, 2011 at 23:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?
It's not much of an improvement. For pages that we already have in
shared memory, OS cache is mostly useless. OS cache matters for pages
that *aren't* in shared memory.
Regards,
Marti
On Wed, Jan 5, 2011 at 23:27, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.Yes, this was mostly for the use-case of "getting a single tarfile
that you can actually use to restore from without needing the log
archive at all".It also allows for a simpler kick-start procedure for preparing a
standby, and allows to stop worrying too much about wal_keep_segments
and archive servers.When do the standby launch its walreceiver? It would be extra-nice for
the base backup tool to optionally continue streaming WALs until the
standby starts doing it itself, so that wal_keep_segments is really
deprecated. No idea how feasible that is, though.
I think that's we're inventing a whole lot of complexity that may not
be necessary at all. Let's do it the simple way and see how far we can
get by with that one - we can always improve this for 9.2
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On 05.01.2011 15:54, Magnus Hagander wrote:
Attached is an updated streaming base backup patch, based off the work
that Heikki started.
...
I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
Hmm, is there any point in keeping the two functionalities in the same
binary, taking the base backup and streaming WAL to an archive
directory? Looks like the only common option between the two modes is
passing the connection string, and the verbose flag. A separate
pg_basebackup binary would probably make more sense.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).
Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
because they're not included in the streamed tar. Wouldn't it be better
to include them in the tar as empty directories at the server-side?
Otherwise if you write the tar file to disk and untar it later, you have
to manually create them.
It would be nice to have an option in pg_streamrecv to specify the
backup label to use.
An option to stream the tar to stdout instead of a file would be very
handy too, so that you could pipe it directly to gzip for example. I
realize you get multiple tar files if tablespaces are used, but even if
you just throw an error in that case, it would be handy.
* Suggestion from Heikki: perhaps at some point we're going to need a full
bison grammar for walsender commands.
Maybe we should at least start using the lexer; we're not quite there to
need a full-blown grammar yet, but even a lexer might help.
BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
from scratch with an empty target directory, it needs to connect to
"postgres" database, to run pg_current_xlog_location(), and then
reconnect in replication mode. That's a bit awkward, there might not be
a "postgres" database, and even if there is, you might not have the
permission to connect to it. It would be much better to have a variant
of the START_REPLICATION command at the server-side that begins
streaming from the current location. Maybe just by leaving out the
start-location parameter.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 05.01.2011 15:54, Magnus Hagander wrote:
Attached is an updated streaming base backup patch, based off the work
that Heikki started.
...
I've implemented a frontend for this in pg_streamrecv, based on the
assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we
want to.Hmm, is there any point in keeping the two functionalities in the same
binary, taking the base backup and streaming WAL to an archive directory?
Looks like the only common option between the two modes is passing the
connection string, and the verbose flag. A separate pg_basebackup binary
would probably make more sense.
Yeah, once I broke things apart for better readability, I started
leaning in that direction as well.
However, if you consider the things that Dimiti mentioned about
streaming at the same time as downloading, having them in the same one
would make more sense. I don't think that's something for now,
though..
That code needs a lot more cleanup, but I wanted to make sure I got the
backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at
https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once
it's
been a bit more cleaned up (it does work now if you want to test it,
though).Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
because they're not included in the streamed tar. Wouldn't it be better to
include them in the tar as empty directories at the server-side? Otherwise
if you write the tar file to disk and untar it later, you have to manually
create them.
Yeah, good point. Originally, the tar code (your tar code, btw :P)
didn't create *any* directories, so I stuck it in there. I agree it
should be moved to the backend patch now.
It would be nice to have an option in pg_streamrecv to specify the backup
label to use.
Agreed.
An option to stream the tar to stdout instead of a file would be very handy
too, so that you could pipe it directly to gzip for example. I realize you
get multiple tar files if tablespaces are used, but even if you just throw
an error in that case, it would be handy.
Makes sense.
* Suggestion from Heikki: perhaps at some point we're going to need a full
bison grammar for walsender commands.Maybe we should at least start using the lexer; we're not quite there to
need a full-blown grammar yet, but even a lexer might help.
Might. I don't speak flex very well, so I'm not really sure what that
would mean.
BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
from scratch with an empty target directory, it needs to connect to
"postgres" database, to run pg_current_xlog_location(), and then reconnect
in replication mode. That's a bit awkward, there might not be a "postgres"
database, and even if there is, you might not have the permission to connect
to it. It would be much better to have a variant of the START_REPLICATION
command at the server-side that begins streaming from the current location.
Maybe just by leaving out the start-location parameter.
Agreed. That part is unchanged from the one that runs against 9.0
though, where that wasn't a possibility. But adding something like
that to the walsender in 9.1 would be good.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
2011/1/5 Magnus Hagander <magnus@hagander.net>:
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?I think that's way more complex than we want to go here.
DONTNEED will remove the block from OS buffer everytime.
It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.
posix_fadvise support is already in postgresql core...we can start by
just doing a snapshot of the files before starting, or at some point
in the basebackup, it will need only 256kB per GB of data...
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:
The basic implementation is: Add a new command to the replication mode called
BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
compatible format) of the data directory and all tablespaces, and then end
the base backup in a single operation.
I'm a little dubious of the performance of that approach for some users,
though it does seem a popular idea.
One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.
It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 7, 2011 at 02:15, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:
The basic implementation is: Add a new command to the replication mode called
BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
compatible format) of the data directory and all tablespaces, and then end
the base backup in a single operation.I'm a little dubious of the performance of that approach for some users,
though it does seem a popular idea.
Well, it's of course only going to be an *option*. We should keep our
flexibility and allow the current ways as well.
One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.
The patch already does this. Or rather, as it's coded it does this
once per tablespace.
It'll give you an approximation only of course, that can change, but
it should be enough for the purposes of a progress indication.
It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.
Yeah, I have that on the todo list suggested by Heikki. I consider it
a later phase though.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
2011/1/5 Magnus Hagander <magnus@hagander.net>:
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
in the process that streams all the files out. Seems useful, as long as that
doesn't kick them out of the cache *completely*, for other backends as well.
Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?I think that's way more complex than we want to go here.
DONTNEED will remove the block from OS buffer everytime.
Then we definitely don't want to use it - because some other backend
might well want the file. Better leave it up to the standard logic in
the kernel.
It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.
Uh, it just reads the files out of the filesystem. Just like you'd to
today, except it's now integrated and streams the data across a
regular libpq connection.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Thu, Jan 06, 2011 at 07:47:39PM -0500, C�dric Villemain wrote:
2011/1/5 Magnus Hagander <magnus@hagander.net>:
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
� in the process that streams all the files out. Seems useful, as long as that
� doesn't kick them out of the cache *completely*, for other backends as well.
� Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?I think that's way more complex than we want to go here.
DONTNEED will remove the block from OS buffer everytime.
It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.posix_fadvise support is already in postgresql core...we can start by
just doing a snapshot of the files before starting, or at some point
in the basebackup, it will need only 256kB per GB of data...
It is actually possible to be more scalable than the simple solution you
outline here (although that solution works pretty well).
I've written a program that syncronizes the OS cache state using
mmap()/mincore() between two computers. It haven't actually tested its
impact on performance yet, but I was surprised by how fast it actually runs
and how compact cache maps can be.
If one encodes the data so one remembers the number of zeros between 1s
one, storage scale by the amount of memory in each size rather than the
dataset size. I actually played with doing that, then doing huffman
encoding of that. I get around 1.2-1.3 bits / page of _physical memory_
on my tests.
I don't have my notes handy, but here are some numbers from memory...
The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
of physical memory in the machine. The latter limit get better, however,
since there are < 1024 symbols possible for the encoder (since in this
case symbols are spans of zeros that need to fit in a file that is 1 GB in
size). So is actually real worst case is much closer to 1 bit per page of
the dataset or ~10 bits per page of physical memory. The real performance
I see with huffman is more like 1.3 bits per page of physical memory. All the
encoding decoding is actually very fast. zlib would actually compress even
better than huffman, but huffman encoder/decoder is actually pretty good and
very straightforward code.
I would like to integrate something like this into PG or perhaps even into
something like rsync, but its was written as proof of concept and I haven't
had time work on it recently.
Garick
Show quoted text
--
C�dric Villemain� � � � � � �� 2ndQuadrant
http://2ndQuadrant.fr/� �� PostgreSQL : Expertise, Formation et Support--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 07, 2011 at 10:26:29AM -0500, Garick Hamlin wrote:
On Thu, Jan 06, 2011 at 07:47:39PM -0500, C�dric Villemain wrote:
2011/1/5 Magnus Hagander <magnus@hagander.net>:
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Magnus Hagander <magnus@hagander.net> writes:
* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
� in the process that streams all the files out. Seems useful, as long as that
� doesn't kick them out of the cache *completely*, for other backends as well.
� Do we know if that is the case?Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?I think that's way more complex than we want to go here.
DONTNEED will remove the block from OS buffer everytime.
It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.posix_fadvise support is already in postgresql core...we can start by
just doing a snapshot of the files before starting, or at some point
in the basebackup, it will need only 256kB per GB of data...It is actually possible to be more scalable than the simple solution you
outline here (although that solution works pretty well).I've written a program that syncronizes the OS cache state using
mmap()/mincore() between two computers. It haven't actually tested its
impact on performance yet, but I was surprised by how fast it actually runs
and how compact cache maps can be.If one encodes the data so one remembers the number of zeros between 1s
one, storage scale by the amount of memory in each size rather than the
Sorry for the typos, that should read:
the storage scales by the number of pages resident in memory rather than the
total dataset size.
Show quoted text
dataset size. I actually played with doing that, then doing huffman
encoding of that. I get around 1.2-1.3 bits / page of _physical memory_
on my tests.I don't have my notes handy, but here are some numbers from memory...
The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
of physical memory in the machine. The latter limit get better, however,
since there are < 1024 symbols possible for the encoder (since in this
case symbols are spans of zeros that need to fit in a file that is 1 GB in
size). So is actually real worst case is much closer to 1 bit per page of
the dataset or ~10 bits per page of physical memory. The real performance
I see with huffman is more like 1.3 bits per page of physical memory. All the
encoding decoding is actually very fast. zlib would actually compress even
better than huffman, but huffman encoder/decoder is actually pretty good and
very straightforward code.I would like to integrate something like this into PG or perhaps even into
something like rsync, but its was written as proof of concept and I haven't
had time work on it recently.Garick
--
C�dric Villemain� � � � � � �� 2ndQuadrant
http://2ndQuadrant.fr/� �� PostgreSQL : Expertise, Formation et Support--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 05.01.2011 15:54, Magnus Hagander wrote:
* Suggestion from Heikki: perhaps at some point we're going to need a full
bison grammar for walsender commands.
Here's a patch for this (Also available at
git@github.com:hlinnaka/postgres.git, branch "streaming_base"). I
thought I know our bison/flex magic pretty well by now, but it turned
out to take much longer than I thought. But here it is.
I'm not 100% sure if this is worth the trouble quite yet. It adds quite
a lot of boilerplate code.. OTOH, having a bison grammar file makes it
easier to see what exactly the grammar is, and I like that. It's not too
bad with three commands yet, but if it expands much further a bison
grammar is a must.
At first I tried using the backend lexer for this, but it couldn't parse
the xlog-start location in the "START_REPLICATION 0/47000000" command.
In hindsight that may have been a badly chosen syntax. But as you
pointed out on IM, the lexer needed to handle this limited set of
commands is very small, so I wrote a dedicated flex lexer instead that
can handle it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
replication-grammar-1.patchtext/x-diff; name=replication-grammar-1.patchDownload+591-130
On 05.01.2011 15:54, Magnus Hagander wrote:
I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).
One more thing, now that I've played a bit with pg_streamrecv:
I find it strange that the data directory must exist when you call
pg_streamrecv in base-backup mode. I would expect it to work like
initdb, and create the directory if it doesn't exist.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
because they're not included in the streamed tar. Wouldn't it be better to
include them in the tar as empty directories at the server-side? Otherwise
if you write the tar file to disk and untar it later, you have to manually
create them.
Attached is an updated patch that does this.
It also collects all the header records as a single resultset at the
beginning. This made for cleaner code, but more importantly makes it
possible to get the total size of the backup even if there are
multiple tablespaces.
It also changes the tar members to use relative paths instead of
absolute ones - since we send the root of the directory in the header
anyway. That also takes away the "./" portion in all tar members.
git branch on github updated as well, of course.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Attachments:
basebackup.patchtext/x-patch; charset=US-ASCII; name=basebackup.patchDownload+727-40
On 7.1.2011 15:45, Magnus Hagander wrote:
On Fri, Jan 7, 2011 at 02:15, Simon Riggs<simon@2ndquadrant.com> wrote:
One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.The patch already does this. Or rather, as it's coded it does this
once per tablespace.It'll give you an approximation only of course, that can change,
In this case you actually could send exact numbers, as you need to only
transfer the files
up to the size they were when starting the base backup. The rest will
be taken care of by
WAL replay
but
it should be enough for the purposes of a progress indication.It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.Yeah, I have that on the todo list suggested by Heikki. I consider it
a later phase though.
--
--------------------------------------------
Hannu Krosing
Senior Consultant,
Infinite Scalability& Performance
http://www.2ndQuadrant.com/books/