For what should pg_stop_backup wait?

Started by Fujii Masaoover 17 years ago11 messages
#1Fujii Masao
masao.fujii@gmail.com

Hi

In HEAD, pg_stop_backup waits until the history file has been archived.
But, in order to ensure that the last wal file was archived, pg_stop_backup
should wait until not only the history file but also the backup stopping wal
file has been archived, I think.

Because the alphabetic order of the history file and the backup stopping
wal file is not constant.

If the backup starting wal file is the same as the stopping one (that is,
any wal files were not switched during backup), the history file whose
name consists of the starting one is behind the stopping one.
Otherwise, the backup stopping wal file is behind the history file.

Is this worth fixing?

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#2Simon Riggs
simon@2ndquadrant.com
In reply to: Fujii Masao (#1)
Re: For what should pg_stop_backup wait?

On Thu, 2008-08-07 at 21:11 +0900, Fujii Masao wrote:

In HEAD, pg_stop_backup waits until the history file has been archived.
But, in order to ensure that the last wal file was archived, pg_stop_backup
should wait until not only the history file but also the backup stopping wal
file has been archived, I think.

Because the alphabetic order of the history file and the backup stopping
wal file is not constant.

If the backup starting wal file is the same as the stopping one (that is,
any wal files were not switched during backup), the history file whose
name consists of the starting one is behind the stopping one.
Otherwise, the backup stopping wal file is behind the history file.

pg_stop_backup explicitly requests a log switch prior to writing the
history file, so this occurs in all cases. The assumption is wrong.

So yes, we should wait for stopxlogfilename, not histfilepath.

I'll do a patch. Thanks for your input.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#2)
1 attachment(s)
Re: For what should pg_stop_backup wait?

On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:

I'll do a patch. Thanks for your input.

Please review attached patch.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

stop_backup_wait_bug.v1.patchtext/x-patch; charset=utf-8; name=stop_backup_wait_bug.v1.patchDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.316
diff -c -r1.316 xlog.c
*** src/backend/access/transam/xlog.c	13 Jul 2008 20:45:47 -0000	1.316
--- src/backend/access/transam/xlog.c	7 Aug 2008 14:31:18 -0000
***************
*** 6721,6729 ****
  	CleanupBackupHistory();
  
  	/*
! 	 * Wait until the history file has been archived. We assume that the 
! 	 * alphabetic sorting property of the WAL files ensures the last WAL
! 	 * file is guaranteed archived by the time the history file is archived.
  	 *
  	 * We wait forever, since archive_command is supposed to work and
  	 * we assume the admin wanted his backup to work completely. If you 
--- 6721,6732 ----
  	CleanupBackupHistory();
  
  	/*
! 	 * Wait until the last WAL file has been archived. We assume that the 
! 	 * alphabetic sorting property of the WAL files ensures the history
! 	 * file is guaranteed archived by the time the last WAL file is archived.
! 	 * The history file name depends upon the startpoint, whereas the last
! 	 * file depends upon the stoppoint. They are always different because we
! 	 * make an explicit xlog switch earlier in this function.
  	 *
  	 * We wait forever, since archive_command is supposed to work and
  	 * we assume the admin wanted his backup to work completely. If you 
***************
*** 6733,6745 ****
  	 * set to .ready before we slept, then while asleep it has been set
  	 * to .done and then removed by a concurrent checkpoint.
  	 */
- 	BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg,
- 						  startpoint.xrecoff % XLogSegSize);
- 
  	seconds_before_warning = 60;
  	waits = 0;
  
! 	while (!XLogArchiveCheckDone(histfilepath, false))
  	{
  		CHECK_FOR_INTERRUPTS();
  
--- 6736,6745 ----
  	 * set to .ready before we slept, then while asleep it has been set
  	 * to .done and then removed by a concurrent checkpoint.
  	 */
  	seconds_before_warning = 60;
  	waits = 0;
  
! 	while (!XLogArchiveCheckDone(stopxlogfilename, false))
  	{
  		CHECK_FOR_INTERRUPTS();
  
#4Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#3)
Re: For what should pg_stop_backup wait?

On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:

I'll do a patch. Thanks for your input.

Please review attached patch.

Thank you for your patch!

But, there are two problems in this patch, I think.

! * Wait until the last WAL file has been archived. We assume that the
! * alphabetic sorting property of the WAL files ensures the history
! * file is guaranteed archived by the time the last WAL file is archived.
! * The history file name depends upon the startpoint, whereas the last
! * file depends upon the stoppoint. They are always different because we
! * make an explicit xlog switch earlier in this function.

If there are a few transactions during backup, startpoint may be the same as
the stoppoint.

postgres=# SELECT pg_xlogfile_name(pg_start_backup('test')) AS startpoint;
startpoint
--------------------------
000000010000000000000004
(1 row)

*** A few transaction occurs ***

postgres=# SELECT pg_xlogfile_name(pg_stop_backup()) AS stoppoint;
stoppoint
--------------------------
000000010000000000000004
(1 row)

In this situation, the history file (000000010000000000000004.00000020.backup)
is behind the stoppoint (000000010000000000000004) in the alphabetic order.
So, pg_stop_backup should wait for both the stoppoint and the history
file, I think.

! while (!XLogArchiveCheckDone(stopxlogfilename, false))

If a concurrent checkpoint removes the status file before XLogArchiveCheckDone,
pg_stop_backup continues waiting forever. This is undesirable behavior.
Yes, statement_timeout may help. But, I don't want to use it, because the
*successful* backup is canceled.

How about checking whether the stoppoint was archived by comparing with
the last WAL archived. The archiver process can tell the last WAL archived.
Or, we can calculate it from the status file.

On the other hand, pg_stop_backup doesn't continue waiting for the history file
forever. Because, only pg_stop_backup removes the status file of it,
and a concurrent
pg_stop_backup never happen.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Fujii Masao (#4)
Re: For what should pg_stop_backup wait?

On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote:

On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

In this situation, the history file (000000010000000000000004.00000020.backup)
is behind the stoppoint (000000010000000000000004) in the alphabetic order.
So, pg_stop_backup should wait for both the stoppoint and the history
file, I think.

OK, I see that now.

! while (!XLogArchiveCheckDone(stopxlogfilename, false))

If a concurrent checkpoint removes the status file before XLogArchiveCheckDone,
pg_stop_backup continues waiting forever. This is undesirable behavior.

I think it will only get removed by the second checkpoint, not the
first. So the risk of that happening seems almost certainly impossible.
But we'll put in a check just in case.

Yes, statement_timeout may help. But, I don't want to use it, because the
*successful* backup is canceled.

How about checking whether the stoppoint was archived by comparing with
the last WAL archived. The archiver process can tell the last WAL archived.
Or, we can calculate it from the status file.

I think its easier to test whether the stopxlogfilename still exists in
pg_xlog. If not, we know it has been archived away. We can add that as
an extra condition inside the loop.

So thinking we should test XLogArchiveCheckDone() for both
stopxlogfilename and history file and then stat for the stop WAL file:

BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg,
startpoint.xrecoff % XLogSegSize);

seconds_before_warning = 60;
waits = 0;

while (!XLogArchiveCheckDone(histfilepath, false) ||
!XLogArchiveCheckDone(stopxlogfilename, false))
{
struct stat stat_buf;
char xlogpath[MAXPGPATH];

/*
* Check to see if file has already been archived and WAL file
* removed by a concurrent checkpoint
*/
snprintf(xlogpath, MAXPGPATH, XLOGDIR "/%s", stopxlogfilename);
if (XLogArchiveCheckDone(histfilepath, false) &&
stat(xlogpath, &stat_buf) != 0)
break;

CHECK_FOR_INTERRUPTS();

pg_usleep(1000000L);

if (++waits >= seconds_before_warning)
{
seconds_before_warning *= 2; /* This wraps in >10 years... */
elog(WARNING, "pg_stop_backup() waiting for archive to complete "
"(%d seconds delay)", waits);
}
}

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#5)
1 attachment(s)
Re: For what should pg_stop_backup wait?

On Fri, 2008-08-08 at 12:57 +0100, Simon Riggs wrote:

Yes, statement_timeout may help. But, I don't want to use it, because the
*successful* backup is canceled.

How about checking whether the stoppoint was archived by comparing with
the last WAL archived. The archiver process can tell the last WAL archived.
Or, we can calculate it from the status file.

I think its easier to test whether the stopxlogfilename still exists in
pg_xlog. If not, we know it has been archived away. We can add that as
an extra condition inside the loop.

So thinking we should test XLogArchiveCheckDone() for both
stopxlogfilename and history file and then stat for the stop WAL file:

This seems better.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

stop_backup_wait_bug.v2.patchtext/x-patch; charset=utf-8; name=stop_backup_wait_bug.v2.patchDownload
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.316
diff -c -r1.316 xlog.c
*** src/backend/access/transam/xlog.c	13 Jul 2008 20:45:47 -0000	1.316
--- src/backend/access/transam/xlog.c	8 Aug 2008 13:56:40 -0000
***************
*** 1165,1170 ****
--- 1165,1184 ----
  	/* Retry creation of the .ready file */
  	if (create_if_missing)
  		XLogArchiveNotify(xlog);
+ 	else
+ 	{
+ 		char xlogpath[MAXPGPATH];
+ 
+ 		snprintf(xlogpath, MAXPGPATH, XLOGDIR "/%s", xlog);
+ 
+ 		/*
+ 		 * Check to see if the WAL file has been removed by checkpoint, 
+ 		 * which implies it has already been archived, and explains why we
+ 		 * can't see a status file for it.
+ 		 */
+ 		if (stat(xlogpath, &stat_buf) != 0)
+ 			return true;
+ 	}
  
  	return false;
  }
***************
*** 6721,6735 ****
  	CleanupBackupHistory();
  
  	/*
! 	 * Wait until the history file has been archived. We assume that the 
! 	 * alphabetic sorting property of the WAL files ensures the last WAL
! 	 * file is guaranteed archived by the time the history file is archived.
  	 *
  	 * We wait forever, since archive_command is supposed to work and
  	 * we assume the admin wanted his backup to work completely. If you 
  	 * don't wish to wait, you can SET statement_timeout = xx;
  	 *
! 	 * If the status file is missing, we assume that is because it was
  	 * set to .ready before we slept, then while asleep it has been set
  	 * to .done and then removed by a concurrent checkpoint.
  	 */
--- 6735,6748 ----
  	CleanupBackupHistory();
  
  	/*
! 	 * Wait until both the last WAL file filled during backup and the
! 	 * history file have been archived.
  	 *
  	 * We wait forever, since archive_command is supposed to work and
  	 * we assume the admin wanted his backup to work completely. If you 
  	 * don't wish to wait, you can SET statement_timeout = xx;
  	 *
! 	 * If the status files are missing, we assume that is because it was
  	 * set to .ready before we slept, then while asleep it has been set
  	 * to .done and then removed by a concurrent checkpoint.
  	 */
***************
*** 6739,6745 ****
  	seconds_before_warning = 60;
  	waits = 0;
  
! 	while (!XLogArchiveCheckDone(histfilepath, false))
  	{
  		CHECK_FOR_INTERRUPTS();
  
--- 6752,6759 ----
  	seconds_before_warning = 60;
  	waits = 0;
  
! 	while (!XLogArchiveCheckDone(histfilepath, false) || 
! 		   !XLogArchiveCheckDone(stopxlogfilename, false))
  	{
  		CHECK_FOR_INTERRUPTS();
  
#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#4)
Re: For what should pg_stop_backup wait?

On Fri, 2008-08-08 at 11:47 +0900, Fujii Masao wrote:

On Thu, Aug 7, 2008 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2008-08-07 at 14:59 +0100, Simon Riggs wrote:

I'll do a patch. Thanks for your input.

Please review attached patch.

Thank you for your patch!

Would you mind if I asked you to be the official reviewer of this patch
for the latest CommitFest? It would help greatly, since you understand
the issue and clearly the code also. Thanks, if possible.

http://wiki.postgresql.org/wiki/CommitFest:2008-09

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#7)
Re: For what should pg_stop_backup wait?

On Fri, Sep 5, 2008 at 7:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Would you mind if I asked you to be the official reviewer of this patch
for the latest CommitFest? It would help greatly, since you understand
the issue and clearly the code also. Thanks, if possible.

OK. I'll try to review your patch.
Do I need to inform Josh of reviewing it in advance?

regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#6)
Re: For what should pg_stop_backup wait?

Simon Riggs <simon@2ndquadrant.com> writes:

So thinking we should test XLogArchiveCheckDone() for both
stopxlogfilename and history file and then stat for the stop WAL file:

This seems better.

Somehow I missed the 5-Apr patch that introduced this bogosity.
Next time you make a fundamental change in the behavior of a function,
how about updating its comment to match?

After studying it for a minute, I think that XLogArchiveCheckDone no
longer even has a clearly explainable purpose; it needs to be split
into two functions with two different behaviors. I plan to revert
XLogArchiveCheckDone to its previous behavior and introduce
XLogArchiveIsBusy (flipping the sense of the result) to use in
pg_stop_backup.

regards, tom lane

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: For what should pg_stop_backup wait?

On Mon, 2008-09-08 at 11:57 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

So thinking we should test XLogArchiveCheckDone() for both
stopxlogfilename and history file and then stat for the stop WAL file:

This seems better.

Somehow I missed the 5-Apr patch that introduced this bogosity.
Next time you make a fundamental change in the behavior of a function,
how about updating its comment to match?

It felt OK when I did it, because of the clearly named boolean. But I
accept your point and will look to improve my code comments in future.

After studying it for a minute, I think that XLogArchiveCheckDone no
longer even has a clearly explainable purpose; it needs to be split
into two functions with two different behaviors. I plan to revert
XLogArchiveCheckDone to its previous behavior and introduce
XLogArchiveIsBusy (flipping the sense of the result) to use in
pg_stop_backup.

You sound like you're in the middle of doing this yourself. Or would you
like me to do that?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#10)
Re: For what should pg_stop_backup wait?

Simon Riggs <simon@2ndQuadrant.com> writes:

You sound like you're in the middle of doing this yourself. Or would you
like me to do that?

Yeah, done and committed.

regards, tom lane