Using pg_start_backup() and pg_stop_backup()

Started by David B Harrisalmost 13 years ago8 messagesgeneral

dbharris@eelf.ddts.net

almost 13 years ago

Good afternoon all,

I'm trying to use pg_start_backup() and pg_stop_backup() to create
point-in-time backups. More specifically, I'm trying to use filesystem
tools (notably rsync or an rsync-like tool) since the production machine
is on the other end of a (narrow, expensive) pipe. pg_dump is too
expensive (both in time and bandwidth); the gzip-compressed database
dump is about 30GB.

These backups might be maintained/used by others who are only somewhat
familiar with Linux and PostgreSQL, so I'm trying to keep them as simple
as possible.

Now if I read it right (and I'm concerned I'm not), then according to
section 24.3 of the documentation (Continuous Archiving and
Point-in-Time Recovery (PITR)), the backup procedure needs to be as
follows:

1. Issue pg_start_backup('label')
2. Perform rsync of cluster directory
3. Issue pg_stop_backup()
4. Copy all logs from start of pg_start_backup() through to when
pg_stop_backup() finished (using the backup history file, I
guess, which I haven't actually been able to find yet :)

So far enough. Before I really grasped that, though, I was testing with
just steps #1 through #3. And everything always seemed to work fine.
Ultimately I tested it dozens of times. With various loads on the
production server (certainly at times with more than enough writes to
max out the number of allowed log segments). And the restore never
failed (no errors at least, and spot-checking the data indicated that
everything appeared to be in place).

Am I on drugs? Just crazy lucky? Is #4 actually necessary? (I can
imagine ways of writing to the cluster files which might make it
unnecessary, maybe somebody implemented that and didn't update the
documentation?)

Thanks very much in advance,

David

--
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realise the pig is enjoying it.

OpenPGP v4 key ID: 4096R/59DDCB9F
Fingerprint: CC53 F124 35C0 7BC2 58FE 7A3C 157D DFD9 59DD CB9F
Retrieve from subkeys.pgp.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Paquier

michael@paquier.xyz

almost 13 years ago

In reply to: David B Harris (#1)

Re: Using pg_start_backup() and pg_stop_backup()

On Tue, Jul 16, 2013 at 11:10 PM, David B Harris <dbharris@eelf.ddts.net> wrote:

Good afternoon all,

I'm trying to use pg_start_backup() and pg_stop_backup() to create
point-in-time backups. More specifically, I'm trying to use filesystem
tools (notably rsync or an rsync-like tool) since the production machine
is on the other end of a (narrow, expensive) pipe. pg_dump is too
expensive (both in time and bandwidth); the gzip-compressed database
dump is about 30GB.

These backups might be maintained/used by others who are only somewhat
familiar with Linux and PostgreSQL, so I'm trying to keep them as simple
as possible.

Now if I read it right (and I'm concerned I'm not), then according to
section 24.3 of the documentation (Continuous Archiving and
Point-in-Time Recovery (PITR)), the backup procedure needs to be as
follows:

1. Issue pg_start_backup('label')
2. Perform rsync of cluster directory
3. Issue pg_stop_backup()
4. Copy all logs from start of pg_start_backup() through to when
pg_stop_backup() finished (using the backup history file, I
guess, which I haven't actually been able to find yet :)

I assume that you use WAL archiving, so why would you do that manually
as I understand it is the case at step #4? When using pg_stop_backup
it is even ensured that all the necessary WAL files are ready to be
archived. You still need for the file to be actually archived before
starting the recovery though

So far enough. Before I really grasped that, though, I was testing with
just steps #1 through #3. And everything always seemed to work fine.
Ultimately I tested it dozens of times. With various loads on the
production server (certainly at times with more than enough writes to
max out the number of allowed log segments). And the restore never
failed (no errors at least, and spot-checking the data indicated that
everything appeared to be in place).

Am I on drugs? Just crazy lucky?

I don't believe so. Or we both are and we live together in an imaginary world.

Is #4 actually necessary? (I can
imagine ways of writing to the cluster files which might make it
unnecessary, maybe somebody implemented that and didn't update the
documentation?)

Not really, archiving would do the rest for you, and it is managed by
the server. Just be sure to set up restore_command in recovery.conf
when starting a node from the backup you took.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David B Harris

dbharris@eelf.ddts.net

almost 13 years ago

In reply to: David B Harris (#1)

Re: Using pg_start_backup() and pg_stop_backup()

On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:

On Tue, Jul 16, 2013 at 11:10 PM, David B Harris <dbharris@eelf.ddts.net> wrote:

4. Copy all logs from start of pg_start_backup() through to when
pg_stop_backup() finished (using the backup history file, I
guess, which I haven't actually been able to find yet :)

I assume that you use WAL archiving, so why would you do that manually
as I understand it is the case at step #4? When using pg_stop_backup
it is even ensured that all the necessary WAL files are ready to be
archived. You still need for the file to be actually archived before
starting the recovery though

I'm not using WAL archiving actually (these installations [there are a
few of them] have never used log-shipping). Not yet anyway. Even if I
were though, a step #4 would still be necessary I believe - if I read
the documentation right, the last log segment that's supposed to be
transferred isn't ready until after pg_stop_backup().

Is #4 actually necessary? (I can
imagine ways of writing to the cluster files which might make it
unnecessary, maybe somebody implemented that and didn't update the
documentation?)

Not really, archiving would do the rest for you, and it is managed by
the server. Just be sure to set up restore_command in recovery.conf
when starting a node from the backup you took.

Yeah, so if I understand right, you're using WAL archiving for step #4,
and your process looks like this:

1. Issue pg_start_backup('label')
2. Perform rsync of cluster directory
3. Issue pg_stop_backup()
4. In parallel, WAL archiving has copied all the logs from the
start of pg_start_backup() through to the end of pg_stop_backup()

A configuration like this is provided like this in the documentation
(again section 24.3), though some assembly is required. I'm hoping to
avoid it since it'll be introducing a second channel to the backup which
I can almost guarantee will be forgotten in time. (BTW, we tend to
consider the lifespan of our installations in terms of decades.)

I mean, if it's necessary, the so be it of course, but the question
still remains: I've tested a fair bit under fairly adverse conditions
and not had a single failure. Luck?

--
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realise the pig is enjoying it.

OpenPGP v4 key ID: 4096R/59DDCB9F
Fingerprint: CC53 F124 35C0 7BC2 58FE 7A3C 157D DFD9 59DD CB9F
Retrieve from subkeys.pgp.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Reply to msg id not found: CAB7nPqQUFFdj9R4t_DG9+QgysD-kreLSh27qU7Er6T+1bovoYQ@mail.gmail.com

Michael Paquier

michael@paquier.xyz

almost 13 years ago

In reply to: David B Harris (#3)

Re: Using pg_start_backup() and pg_stop_backup()

On Wed, Jul 17, 2013 at 8:24 AM, David B Harris <dbharris@eelf.ddts.net> wrote:

On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:

On Tue, Jul 16, 2013 at 11:10 PM, David B Harris <dbharris@eelf.ddts.net> wrote:

4. Copy all logs from start of pg_start_backup() through to when
pg_stop_backup() finished (using the backup history file, I
guess, which I haven't actually been able to find yet :)

I assume that you use WAL archiving, so why would you do that manually
as I understand it is the case at step #4? When using pg_stop_backup
it is even ensured that all the necessary WAL files are ready to be
archived. You still need for the file to be actually archived before
starting the recovery though

I'm not using WAL archiving actually (these installations [there are a
few of them] have never used log-shipping). Not yet anyway. Even if I
were though, a step #4 would still be necessary I believe - if I read
the documentation right, the last log segment that's supposed to be
transferred isn't ready until after pg_stop_backup().

Yep.

Is #4 actually necessary? (I can
imagine ways of writing to the cluster files which might make it
unnecessary, maybe somebody implemented that and didn't update the
documentation?)

Not really, archiving would do the rest for you, and it is managed by
the server. Just be sure to set up restore_command in recovery.conf
when starting a node from the backup you took.

Yeah, so if I understand right, you're using WAL archiving for step #4,
and your process looks like this:

1. Issue pg_start_backup('label')
2. Perform rsync of cluster directory
3. Issue pg_stop_backup()
4. In parallel, WAL archiving has copied all the logs from the
start of pg_start_backup() through to the end of pg_stop_backup()

Yes exactly. This avoids to have to maintain your own scripts and rely
on the server features... What do you actually do to copy the
necessary WAL files. Do you fetch them directly from the master's
pg_xlog folder?

A configuration like this is provided like this in the documentation
(again section 24.3), though some assembly is required. I'm hoping to
avoid it since it'll be introducing a second channel to the backup which
I can almost guarantee will be forgotten in time. (BTW, we tend to
consider the lifespan of our installations in terms of decades.)

I mean, if it's necessary, the so be it of course, but the question
still remains: I've tested a fair bit under fairly adverse conditions
and not had a single failure. Luck?

It looks so, and wal_keep_segments is set to a value high enough on
the master side so as all the necessary WAL files are kept intact
somewhere even if they are not archived.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David B Harris

dbharris@eelf.ddts.net

almost 13 years ago

In reply to: Michael Paquier (#4)

Re: Using pg_start_backup() and pg_stop_backup()

On Wed Jul 17, 09:16am +0900, Michael Paquier wrote:

On Wed, Jul 17, 2013 at 8:24 AM, David B Harris <dbharris@eelf.ddts.net> wrote:

On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:
4. In parallel, WAL archiving has copied all the logs from the
start of pg_start_backup() through to the end of pg_stop_backup()

Yes exactly. This avoids to have to maintain your own scripts and rely
on the server features... What do you actually do to copy the
necessary WAL files. Do you fetch them directly from the master's
pg_xlog folder?

Due to bandwidth constraints separate scripts are going to need to be
maintained anyways. I'd love to use pg_basebackup, but since it
transfers the entire cluster each time, it's not suitable - something
like rsync is going to be used instead. Actually the use of
pg_basebackup would require a separate script anyway, so regardless. If
the WAL files do need to be copied, I'll be copying them separately
yeah.

(I'll use archive_command that only runs when backup_label is in place
[pg_startup() has been called], put them somewhere safe during the
cluster rsync [probably outside of the cluster directory], transfer them
after pg_stop_backup(), and then delete them on the source/master
server.)

A configuration like this is provided like this in the documentation
(again section 24.3), though some assembly is required. I'm hoping to
avoid it since it'll be introducing a second channel to the backup which
I can almost guarantee will be forgotten in time. (BTW, we tend to
consider the lifespan of our installations in terms of decades.)

I mean, if it's necessary, the so be it of course, but the question
still remains: I've tested a fair bit under fairly adverse conditions
and not had a single failure. Luck?

It looks so, and wal_keep_segments is set to a value high enough on
the master side so as all the necessary WAL files are kept intact
somewhere even if they are not archived.

I'm going to re-run my tests again, but here's what I did:

1. Set wal_keep_segments to a small value (I think I used 8)
2. Create a new database with a small canary table
2. pg_start_backup()
3. Run a bunch of database drops/loads and table drops/loads (with
pg_restore and COPY .. FROM, respectively) ... for hours and
hours (writing tens of GBs of data, without question cycling
through wal_keep_segments)
4. Take a tarball of the cluster (including pg_xlog/)
5. pg_stop_backup()
6. Restore from tarball, start cluster
7. Check for canary table (successfully)

Separately, I also prototyped the backup and I've run it dozens of times
occasionally with a heavy write load on the source/master server without
errors. Now the write load was likely never enough to saturate all
allocatable WAL segments, but given the documentation (which says that
the _last_ WAL segment which is synced by pg_stop_backup() must be
copied as well), I'd have expected a consistency error or similar.

Since somebody has now said "no you're crazy" (paraphrased :), I'm going
to double-check again.

I didn't mention in the original email, but of course one possibility is
that the documentation might be incomplete simply in that if
wal_keep_segments is set high enough such that the earliest segment
isn't overwritten while the rsync is running, everything would be fine.
But (again if I'm reading the documentation correctly), I think there
should be consistency errors (again unless I'm just lucky). Maybe if
there's an incomplete WAL transaction/segment/whatever, it just silently
ignores it. (This would of course make sense.)

I'll run the tests again and this time not use a canary table. It occurs
to me that what I might be seeing is data files in the cluster not being
deleted. If I'm dropping databases and/or tables, my base backup may be
getting the old ones but the new ones might be inconsistent/broken.
(Though I'd still expect a "cannot replay log" error of some sort at
cluster startup.)

--
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realise the pig is enjoying it.

OpenPGP v4 key ID: 4096R/59DDCB9F
Fingerprint: CC53 F124 35C0 7BC2 58FE 7A3C 157D DFD9 59DD CB9F
Retrieve from subkeys.pgp.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

David B Harris

dbharris@eelf.ddts.net

almost 13 years ago

In reply to: David B Harris (#5)

Re: Using pg_start_backup() and pg_stop_backup()

Yep, my tests were broken :)

Thanks :)

Though it's still possible that the backup would be fine if
wal_keep_segments is high enough - might be good if section 24.3
explicitly mentioned as much.

Actually though (if any PostgreSQL developers are paying attention), it
might be useful to have a new WAL segment-managing behaviour. With the
advent of the replication functionality (which is amazing stuff, thanks
so much), I'd expect fewer and fewer installations to use WAL archiving.
If WAL archiving is disabled, it might make sense for pg_start_backup()
to postpone the deletion of WAL segments until pg_stop_backup().

(Hm I might file that as a feature request. Won't really help me today,
but it would have made this backup stuff I'm doing about half as complex
as it's going to have to be.)

Thanks again,

David

On Tue Jul 16, 08:35pm -0400, David B Harris wrote:

On Wed Jul 17, 09:16am +0900, Michael Paquier wrote:

On Wed, Jul 17, 2013 at 8:24 AM, David B Harris <dbharris@eelf.ddts.net> wrote:

On Wed Jul 17, 08:12am +0900, Michael Paquier wrote:
4. In parallel, WAL archiving has copied all the logs from the
start of pg_start_backup() through to the end of pg_stop_backup()

Yes exactly. This avoids to have to maintain your own scripts and rely
on the server features... What do you actually do to copy the
necessary WAL files. Do you fetch them directly from the master's
pg_xlog folder?

Due to bandwidth constraints separate scripts are going to need to be
maintained anyways. I'd love to use pg_basebackup, but since it
transfers the entire cluster each time, it's not suitable - something
like rsync is going to be used instead. Actually the use of
pg_basebackup would require a separate script anyway, so regardless. If
the WAL files do need to be copied, I'll be copying them separately
yeah.

(I'll use archive_command that only runs when backup_label is in place
[pg_startup() has been called], put them somewhere safe during the
cluster rsync [probably outside of the cluster directory], transfer them
after pg_stop_backup(), and then delete them on the source/master
server.)

A configuration like this is provided like this in the documentation
(again section 24.3), though some assembly is required. I'm hoping to
avoid it since it'll be introducing a second channel to the backup which
I can almost guarantee will be forgotten in time. (BTW, we tend to
consider the lifespan of our installations in terms of decades.)

I mean, if it's necessary, the so be it of course, but the question
still remains: I've tested a fair bit under fairly adverse conditions
and not had a single failure. Luck?

It looks so, and wal_keep_segments is set to a value high enough on
the master side so as all the necessary WAL files are kept intact
somewhere even if they are not archived.

I'm going to re-run my tests again, but here's what I did:

1. Set wal_keep_segments to a small value (I think I used 8)
2. Create a new database with a small canary table
2. pg_start_backup()
3. Run a bunch of database drops/loads and table drops/loads (with
pg_restore and COPY .. FROM, respectively) ... for hours and
hours (writing tens of GBs of data, without question cycling
through wal_keep_segments)
4. Take a tarball of the cluster (including pg_xlog/)
5. pg_stop_backup()
6. Restore from tarball, start cluster
7. Check for canary table (successfully)

Separately, I also prototyped the backup and I've run it dozens of times
occasionally with a heavy write load on the source/master server without
errors. Now the write load was likely never enough to saturate all
allocatable WAL segments, but given the documentation (which says that
the _last_ WAL segment which is synced by pg_stop_backup() must be
copied as well), I'd have expected a consistency error or similar.

Since somebody has now said "no you're crazy" (paraphrased :), I'm going
to double-check again.

I didn't mention in the original email, but of course one possibility is
that the documentation might be incomplete simply in that if
wal_keep_segments is set high enough such that the earliest segment
isn't overwritten while the rsync is running, everything would be fine.
But (again if I'm reading the documentation correctly), I think there
should be consistency errors (again unless I'm just lucky). Maybe if
there's an incomplete WAL transaction/segment/whatever, it just silently
ignores it. (This would of course make sense.)

I'll run the tests again and this time not use a canary table. It occurs
to me that what I might be seeing is data files in the cluster not being
deleted. If I'm dropping databases and/or tables, my base backup may be
getting the old ones but the new ones might be inconsistent/broken.
(Though I'd still expect a "cannot replay log" error of some sort at
cluster startup.)

--
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realise the pig is enjoying it.

OpenPGP v4 key ID: 4096R/59DDCB9F
Fingerprint: CC53 F124 35C0 7BC2 58FE 7A3C 157D DFD9 59DD CB9F
Retrieve from subkeys.pgp.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

John R Pierce

pierce@hogranch.com

almost 13 years ago

In reply to: David B Harris (#6)

Re: Using pg_start_backup() and pg_stop_backup()

On 7/16/2013 6:21 PM, David B Harris wrote:

Actually though (if any PostgreSQL developers are paying attention), it
might be useful to have a new WAL segment-managing behaviour. With the
advent of the replication functionality (which is amazing stuff, thanks
so much), I'd expect fewer and fewer installations to use WAL archiving.
If WAL archiving is disabled, it might make sense for pg_start_backup()
to postpone the deletion of WAL segments until pg_stop_backup().

WAL archiving has another completely different use case, which is PITR,
Point In Time Recovery.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Paquier

michael@paquier.xyz

almost 13 years ago

In reply to: John R Pierce (#7)

Re: Using pg_start_backup() and pg_stop_backup()

On Wed, Jul 17, 2013 at 10:49 AM, John R Pierce <pierce@hogranch.com> wrote:

On 7/16/2013 6:21 PM, David B Harris wrote:

Actually though (if any PostgreSQL developers are paying attention), it
might be useful to have a new WAL segment-managing behaviour. With the
advent of the replication functionality (which is amazing stuff, thanks
so much), I'd expect fewer and fewer installations to use WAL archiving.
If WAL archiving is disabled, it might make sense for pg_start_backup()
to postpone the deletion of WAL segments until pg_stop_backup().

WAL archiving has another completely different use case, which is PITR,
Point In Time Recovery.

Yep, and it is thought as safer to *always* keep a WAL archive working
in parallel with replication. Just for safety.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general