: PostgreSQL Online Backup

Started by Venkat Balajiover 14 years ago11 messagesgeneral

venkat.balaji@verse.in

over 14 years ago

Hello Everyone,

We have had situations where-in "rsync" was executed without executing
"pg_start_backup()" on the production data directory and on the next runs,
"pg_start_backup()" has been executed with "rsync". This was to avoid high
IO load on production. We ended up getting unmatched files (especially in
pg_clog) and not sure about "base" directory.

Postgres is asking for WAL Archive files dated sometime around 15 days ago.
We are absolutely not sure whats going on.

Is this dangerous for production (like corruption) ? or just the backup will
be invalid ? Please help us know if we have to perform any precautionary
checks on the production cluster.

Apart from firing a checkpoint, does "pg_start_backup()" updates any
dictionary tables or views ? or it updates anything in "pg_xlog"

Looking forward for your help !

Thanks
VB

Venkat Balaji

venkat.balaji@verse.in

over 14 years ago

In reply to: Venkat Balaji (#1)

Re: : PostgreSQL Online Backup

I tried restoring the backup, after taking the full backup.

Below is what i see in the "archive destination".

Postgres was asking for "00000001000001930000006F" and i tried to find the
same and below is what i find...

-rw------- 1 postgres postgres 3.3M Sep 26 02:06 00000001000001930000006F.gz
-rw------- 1 postgres postgres 219 Sep 26 02:53
00000001000001930000006F.00328508.backup.gz

Why is PG (9.0) putting an extension for the WAL Archive file as "backup.gz"
??

Please help !

Thanks
VB

On Mon, Sep 26, 2011 at 5:11 PM, Venkat Balaji <venkat.balaji@verse.in>wrote:

Show quoted text

Hello Everyone,

We have had situations where-in "rsync" was executed without executing
"pg_start_backup()" on the production data directory and on the next runs,
"pg_start_backup()" has been executed with "rsync". This was to avoid high
IO load on production. We ended up getting unmatched files (especially in
pg_clog) and not sure about "base" directory.

Postgres is asking for WAL Archive files dated sometime around 15 days ago.
We are absolutely not sure whats going on.

Is this dangerous for production (like corruption) ? or just the backup
will be invalid ? Please help us know if we have to perform any
precautionary checks on the production cluster.

Apart from firing a checkpoint, does "pg_start_backup()" updates any
dictionary tables or views ? or it updates anything in "pg_xlog"

Looking forward for your help !

Thanks
VB

Laurenz Albe

laurenz.albe@cybertec.at

over 14 years ago

In reply to: Venkat Balaji (#1)

Re: : PostgreSQL Online Backup

Venkat Balaji wrote:

We have had situations where-in "rsync" was executed without executing

"pg_start_backup()" on the

production data directory and on the next runs, "pg_start_backup()"

has been executed with "rsync".

This was to avoid high IO load on production. We ended up getting

unmatched files (especially in

pg_clog) and not sure about "base" directory.

Postgres is asking for WAL Archive files dated sometime around 15 days

ago. We are absolutely not sure

whats going on.

Is this dangerous for production (like corruption) ? or just the

backup will be invalid ? Please help

us know if we have to perform any precautionary checks on the

production cluster.

Apart from firing a checkpoint, does "pg_start_backup()" updates any

dictionary tables or views ? or

it updates anything in "pg_xlog"

Looking forward for your help !

I am not sure what the problem is.

Do you have problems starting the original PostgreSQL cluster,
or do you have problems restoring a backup?

Running pg_start_backup() will not harm the cluster.
End online backup mode by running pg_stop_backup() or removing
the backup_label file in the cluster directory.

Yours,
Laurenz Albe

Alan Hodgson

ahodgson@simkin.ca

over 14 years ago

In reply to: Venkat Balaji (#2)

Re: : PostgreSQL Online Backup

On September 26, 2011 05:49:50 AM Venkat Balaji wrote:

I tried restoring the backup, after taking the full backup.

Below is what i see in the "archive destination".

Postgres was asking for "00000001000001930000006F" and i tried to find the
same and below is what i find...

-rw------- 1 postgres postgres 3.3M Sep 26 02:06
00000001000001930000006F.gz -rw------- 1 postgres postgres 219 Sep 26
02:53
00000001000001930000006F.00328508.backup.gz

Why is PG (9.0) putting an extension for the WAL Archive file as

"backup.gz" ??

The archive files are created by your archive_command, as specified in
postgresql.conf. My guess would be that your archive command runs the files
through gzip as part of archiving (which is fine).

However, the restore_command you specify in recovery.conf must undo this
compression. So instead of (for example) 'cp -f "%f" "%p"', it might instead
need to look like 'zcat "%f" > "%p"'.

Hope this helps.

Venkat Balaji

venkat.balaji@verse.in

over 14 years ago

In reply to: Alan Hodgson (#4)

Re: : PostgreSQL Online Backup

Thanks for all your inputs !

Our problem is -

We had mistakenly executed "rsync" on the running PostgreSQL data directory
(production) and we did not run "pg_start_backup()".

Will this harm production ? can this lead to corruption ?

Thanks -

On Mon, Sep 26, 2011 at 10:29 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:

Show quoted text

On September 26, 2011 05:49:50 AM Venkat Balaji wrote:

I tried restoring the backup, after taking the full backup.

Below is what i see in the "archive destination".

Postgres was asking for "00000001000001930000006F" and i tried to find

the

same and below is what i find...

-rw------- 1 postgres postgres 3.3M Sep 26 02:06
00000001000001930000006F.gz -rw------- 1 postgres postgres 219 Sep 26
02:53
00000001000001930000006F.00328508.backup.gz

Why is PG (9.0) putting an extension for the WAL Archive file as

"backup.gz" ??

The archive files are created by your archive_command, as specified in
postgresql.conf. My guess would be that your archive command runs the files
through gzip as part of archiving (which is fine).

However, the restore_command you specify in recovery.conf must undo this
compression. So instead of (for example) 'cp -f "%f" "%p"', it might
instead
need to look like 'zcat "%f" > "%p"'.

Hope this helps.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Laurenz Albe

laurenz.albe@cybertec.at

over 14 years ago

In reply to: Venkat Balaji (#5)

Re: : PostgreSQL Online Backup

Venkat Balaji wrote:

Our problem is -

We had mistakenly executed "rsync" on the running PostgreSQL data

directory (production) and we did

not run "pg_start_backup()".

Will this harm production ? can this lead to corruption ?

I assume that you used rsync to copy *from* the data directory.

This cannot lead to data corruption.
Only performance might suffer temporarily due to the additional I/O.

The backup made with rsync will be unusable without pg_start_backup().

Yours,
Laurenz Albe

Venkat Balaji

venkat.balaji@verse.in

over 14 years ago

In reply to: Laurenz Albe (#6)

Re: : PostgreSQL Online Backup

Another problem in recovery (probably because of "rsync") -

As said earlier, we are taking a production backup everyday incrementally
using "rsync".

But, Postgres some how misses to sync few files in between and keeps on
asking the back dated archive files (more than 1 week ago).

I restored October 2nd backup and PG is asking for September 26th archive
file with the last known time as 26th Sep, 2011.

2011-10-03 07:17:12 CDT [12705]: [1-1] LOG: database system was
interrupted; last known up at 2011-09-26 09:01:36 CDT
2011-10-03 07:17:12 CDT [12705]: [2-1] LOG: starting archive recovery
cp: cannot stat
`/usr/local/pgsql9.0.1/obtdata/data/pg_xlog/000000010000053900000076': No
such file or directory
2011-10-03 07:17:12 CDT [12705]: [3-1] LOG: could not open file
"pg_xlog/000000010000053900000076" (log file 1337, segment 118): No such
file or directory
2011-10-03 07:17:12 CDT [12705]: [4-1] LOG: invalid checkpoint record
2011-10-03 07:17:12 CDT [12705]: [5-1] PANIC: could not locate required
checkpoint record
2011-10-03 07:17:12 CDT [12705]: [6-1] HINT: If you are not restoring from
a backup, try removing the file
"/usr/local/pgsql9.0.1/obtdata/data/backup_label".
2011-10-03 07:17:12 CDT [12702]: [1-1] LOG: startup process (PID 12705) was
terminated by signal 6: Aborted
2011-10-03 07:17:12 CDT [12702]: [2-1] LOG: aborting startup due to startup
process failure

I always see pg_clog files and some base files not getting synced.

Below is what we are doing -

pg_start_backup()
rsync the data directory
pg_stop_backup()

The first time "rsync" is fine, but, the subsequent runs are generating
in-consistency.

We do the same every day to backup the data directory incrementally.

What i observed is PG records the TXN id when ever backup starts and stops +
backup label. The next day when PG records the start backup time and TXN id,
i think some of the TXN ids and pg_clog files generated between last stop
time and the next start time are missed.

Did anyone observe this behavior ?? Please help !

This is critical for us. I want to recommend not to use "rsync" (use cp or
scp instead) for production backup.

Thanks
VB

On Tue, Sep 27, 2011 at 2:36 PM, Albe Laurenz <laurenz.albe@wien.gv.at>wrote:

Show quoted text

Venkat Balaji wrote:

Our problem is -

We had mistakenly executed "rsync" on the running PostgreSQL data

directory (production) and we did

not run "pg_start_backup()".

Will this harm production ? can this lead to corruption ?

I assume that you used rsync to copy *from* the data directory.

This cannot lead to data corruption.
Only performance might suffer temporarily due to the additional I/O.

The backup made with rsync will be unusable without pg_start_backup().

Yours,
Laurenz Albe

Alan Hodgson

ahodgson@simkin.ca

over 14 years ago

In reply to: Venkat Balaji (#7)

Re: : PostgreSQL Online Backup

On October 3, 2011 05:33:35 AM Venkat Balaji wrote:

Did anyone observe this behavior ?? Please help !

This is critical for us. I want to recommend not to use "rsync" (use cp or
scp instead) for production backup.

rsync works fine. Why exactly can't the recovery find the backed up copy of
000000010000053900000076? Please post your archive_command settings, the
contents of any script(s) called by that, and the recovery.conf file you're
using that's having problems, as well as the complete process you followed to
initiate recovery. I strongly suspect you're missing part of the process of
actually saving the WAL files needed for recovery.

Venkat Balaji

venkat.balaji@verse.in

over 14 years ago

In reply to: Alan Hodgson (#8)

Re: : PostgreSQL Online Backup

The recovery is unable to find the WAL archive because, it was generated on
26th September.

Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the files.

I do not have that WAL archive copy.

The problem area -

I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not
256K).

Thanks
VB

2011/10/3 Alan Hodgson <ahodgson@simkin.ca>

Show quoted text

On October 3, 2011 05:33:35 AM Venkat Balaji wrote:

Did anyone observe this behavior ?? Please help !

This is critical for us. I want to recommend not to use "rsync" (use cp

or

scp instead) for production backup.

rsync works fine. Why exactly can't the recovery find the backed up copy of
000000010000053900000076? Please post your archive_command settings, the
contents of any script(s) called by that, and the recovery.conf file you're
using that's having problems, as well as the complete process you followed
to
initiate recovery. I strongly suspect you're missing part of the process of
actually saving the WAL files needed for recovery.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Alan Hodgson

ahodgson@simkin.ca

over 14 years ago

In reply to: Venkat Balaji (#9)

Re: : PostgreSQL Online Backup

rsync works fine. Why exactly can't the recovery find the backed up copy
of 000000010000053900000076? Please post your archive_command settings,
the contents of any script(s) called by that, and the recovery.conf file
you're using that's having problems, as well as the complete process you
followed to
initiate recovery. I strongly suspect you're missing part of the process
of actually saving the WAL files needed for recovery.

The recovery is unable to find the WAL archive because, it was generated on
26th September.

Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the files.

I do not have that WAL archive copy.

The problem area -

I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not
256K).

I'm going to need the rest of what I asked for to offer any further suggestions
- especially the full and exact steps you took to initiate recovery and the
contents of recovery.conf. Also, please don't top-post.

#11

Venkat Balaji

venkat.balaji@verse.in

over 14 years ago

In reply to: Alan Hodgson (#10)

Re: : PostgreSQL Online Backup

Sorry for not responding to this email for so long.

Alan,

We had mentioned the following line in recovery.conf file (we had given
pg_xlog location since we did not have WAL archives) -

restore_command = 'cp <data dir>/pg_xlog/%f %p'

We found where the problem was -

Here is what i did -

1. We had taken a full backup using pg_start_backup() and pg_stop_backup()
on Day 1
2. Rest of the days (from Day 2 - Day 15), we had incrementally backed-up (
this is also using pg_start_backup() and pg_stop_backup())
3. On Day-16th, when i started the recovery, PG was asking Day 1's WAL
archive file, which we did not have.

A fresh complete backup with change in our backup strategy resolved the
issue.

Thanks a lot for all your inputs and help on this !!

Regards,
VB

2011/10/4 Alan Hodgson <ahodgson@simkin.ca>

Show quoted text

rsync works fine. Why exactly can't the recovery find the backed up

copy

of 000000010000053900000076? Please post your archive_command settings,
the contents of any script(s) called by that, and the recovery.conf

file

you're using that's having problems, as well as the complete process

you

followed to
initiate recovery. I strongly suspect you're missing part of the

process

of actually saving the WAL files needed for recovery.

The recovery is unable to find the WAL archive because, it was generated

on

26th September.

Whereas the backup is as taken on Oct 2nd, 2011. We deleted all the

files.

I do not have that WAL archive copy.

The problem area -

I found that a pg_clog file dated 26th Sep, 2011 is not synced (its not
256K).

I'm going to need the rest of what I asked for to offer any further
suggestions
- especially the full and exact steps you took to initiate recovery and the
contents of recovery.conf. Also, please don't top-post.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general