pgsql: Make standby server continuously retry restoring the next WAL

Started by Heikki Linnakangasabout 16 years ago77 messageshackersdocs
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
hackersdocs

Log Message:
-----------
Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.

This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.

To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.

This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.

Modified Files:
--------------
pgsql/src/backend/access/transam:
xlog.c (r1.361 -> r1.362)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xlog.c?r1=1.361&r2=1.362)
pgsql/src/backend/postmaster:
postmaster.c (r1.601 -> r1.602)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/postmaster/postmaster.c?r1=1.601&r2=1.602)
pgsql/src/backend/replication:
walreceiver.c (r1.1 -> r1.2)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/replication/walreceiver.c?r1=1.1&r2=1.2)
walreceiverfuncs.c (r1.2 -> r1.3)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/replication/walreceiverfuncs.c?r1=1.2&r2=1.3)
pgsql/src/include/replication:
walreceiver.h (r1.4 -> r1.5)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/replication/walreceiver.h?r1=1.4&r2=1.5)
pgsql/src/include/storage:
pmsignal.h (r1.28 -> r1.29)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/storage/pmsignal.h?r1=1.28&r2=1.29)

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
hackersdocs
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, Jan 28, 2010 at 12:27 AM, Heikki Linnakangas
<heikki@postgresql.org> wrote:

Log Message:
-----------
Make standby server continuously retry restoring the next WAL segment with
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.

This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.

http://archives.postgresql.org/pgsql-hackers/2010-01/msg01520.php
http://archives.postgresql.org/pgsql-hackers/2010-01/msg02589.php

As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.

sby [20088] FATAL: archive file "000000010000000000000087" has
wrong size: 14139392 instead of 16777216
sby [20076] LOG: startup process (PID 20088) exited with exit code 1
sby [20076] LOG: terminating any other active server processes
act [18164] LOG: received immediate shutdown request

If the startup process is in standby mode, I think that it should retry
starting replication instead of emitting an error when it finds a
partially-filled file in the archive. Then if the replication has been
terminated, it has only to restore the archived file again. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#2)
hackersdocs
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Fujii Masao wrote:

As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.

sby [20088] FATAL: archive file "000000010000000000000087" has
wrong size: 14139392 instead of 16777216
sby [20076] LOG: startup process (PID 20088) exited with exit code 1
sby [20076] LOG: terminating any other active server processes
act [18164] LOG: received immediate shutdown request

If the startup process is in standby mode, I think that it should retry
starting replication instead of emitting an error when it finds a
partially-filled file in the archive. Then if the replication has been
terminated, it has only to restore the archived file again. Thought?

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#3)
hackersdocs
Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Wed, Feb 10, 2010 at 4:32 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?

Yes, only in standby mode case. OTOH I think that normal archive recovery
should treat it as a FATAL error.

And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Right. But the server in standby mode also needs to complain about that?
We might be able to read completely such a WAL file that looks truncated
from the primary via SR, or from the archive after a few seconds. So it's
odd for me to give up continuing the standby only by finding the WAL file
whose file size is short. I believe that the warm standby (+ pg_standby)
also is based on that thought.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#5Aidan Van Dyk
aidan@highrise.ca
In reply to: Heikki Linnakangas (#3)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

But isn't this something every current PITR archive already "works
around"... Everybody doing PITR archives already know the importance of
making the *appearance* of the WAL filename in the archive atomic.

Don't docs warn about plain cp not being atomic and allowing "short"
files to appear in the archive...

I'm not sure why "streaming recovery" suddenly changes the requirements...

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Aidan Van Dyk (#5)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Aidan Van Dyk wrote:

* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100210 02:33]:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

But isn't this something every current PITR archive already "works
around"... Everybody doing PITR archives already know the importance of
making the *appearance* of the WAL filename in the archive atomic.

Well, pg_standby does defend against that, but you don't use pg_standby
with the built-in standby mode anymore. It would be reasonable to have
the same level of defenses built-in. It's essentially a one-line change,
and saves a lot of trouble and risk of subtle misconfiguration for admins.

Don't docs warn about plain cp not being atomic and allowing "short"
files to appear in the archive...

Hmm, I don't see anything about that at quick glance. Besides, normal
PITR doesn't have a problem with that, because it will stop when it
reaches the end of archived WAL anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#3)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:

Fujii Masao wrote:

As I pointed out previously, the standby might restore a partially-filled
WAL file that is being archived by the primary, and cause a FATAL error.
And this happened in my box when I was testing the SR.

sby [20088] FATAL: archive file "000000010000000000000087" has
wrong size: 14139392 instead of 16777216
sby [20076] LOG: startup process (PID 20088) exited with exit code 1
sby [20076] LOG: terminating any other active server processes
act [18164] LOG: received immediate shutdown request

If the startup process is in standby mode, I think that it should retry
starting replication instead of emitting an error when it finds a
partially-filled file in the archive. Then if the replication has been
terminated, it has only to restore the archived file again. Thought?

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Are we trying to re-invent pg_standby here?

--
Simon Riggs www.2ndQuadrant.com

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#7)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs wrote:

On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Are we trying to re-invent pg_standby here?

That's not the goal, but we seem to need some of the same functionality
in the backend now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#8)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Are we trying to re-invent pg_standby here?

That's not the goal, but we seem to need some of the same functionality
in the backend now.

I think you need to say why...

--
Simon Riggs www.2ndQuadrant.com

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#9)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs wrote:

On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Are we trying to re-invent pg_standby here?

That's not the goal, but we seem to need some of the same functionality
in the backend now.

I think you need to say why...

See the quoted paragraph above. We should check the file size, so that
we will not fail if the WAL file is just being copied into the archive
directory.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#10)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, 2010-02-11 at 14:44 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2010-02-11 at 14:22 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Wed, 2010-02-10 at 09:32 +0200, Heikki Linnakangas wrote:

Hmm, so after running restore_command, check the file size and if it's
too short, treat it the same as if restore_command returned non-zero?
And it will be retried on the next iteration. Works for me, though OTOH
it will then fail to complain about a genuinely WAL file that's
truncated for some reason. I guess there's no way around that, even if
you have a script as restore_command that does the file size check, it
will have the same problem.

Are we trying to re-invent pg_standby here?

That's not the goal, but we seem to need some of the same functionality
in the backend now.

I think you need to say why...

See the quoted paragraph above. We should check the file size, so that
we will not fail if the WAL file is just being copied into the archive
directory.

We can read, but that's not an explanation. By giving terse answers in
that way you are giving the impression that you don't want discussion on
these points.

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

--
Simon Riggs www.2ndQuadrant.com

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#11)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs wrote:

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

pg_standby cannot be used with streaming replication.

I guess you're next question is: why not?

The startup process alternates between streaming, and restoring files
from archive using restore_command. It will progress using streaming as
long as it can, but if the connection is lost, it will try to poll the
archive until the connection is established again. The startup process
expects the restore_command to try to restore the file and fail if it's
not found. If the restore_command goes into sleep, waiting for the file
to arrive, that will defeat the retry logic in the server because the
startup process won't get control again to retry establishing the
connection.

That's the the essence of my proposal here:
http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com
which is what has now been implemented.

To suppport a restore_command that does the sleeping itself, like
pg_standby, would require a major rearchitecting of the retry logic. And
I don't see why that'd desirable anyway. It's easier for the admin to
set up using simple commands like 'cp' or 'scp', than require him/her to
write scripts that handle the sleeping and retry logic.

The real problem we have right now is missing documentation. It's
starting to hurt us more and more every day, as more people start to
test this. As shown by this thread and some other recent posts.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Simon Riggs (#11)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs <simon@2ndQuadrant.com> writes:

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

Let me try.

pg_standby will not let the server get back to streaming replication
mode once it's done with driving the replay of all the WAL files
available in the archive, but will have the server sits there waiting
for the next file.

The way we want that is implemented now is to have the server switch
back and forth between replaying from the archive and streaming from the
master. So we want the server to restore from the archive the same way
pg_standby used to, except that if the archive does not contain the next
WAL files, we want to get back to streaming.

And the archive reading will resume at next network glitch.

I think it's the reasonning, I hope it explains what you see happening.
--
dim

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#12)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, 2010-02-11 at 15:28 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

pg_standby cannot be used with streaming replication.

I guess you're next question is: why not?

The startup process alternates between streaming, and restoring files
from archive using restore_command. It will progress using streaming as
long as it can, but if the connection is lost, it will try to poll the
archive until the connection is established again. The startup process
expects the restore_command to try to restore the file and fail if it's
not found. If the restore_command goes into sleep, waiting for the file
to arrive, that will defeat the retry logic in the server because the
startup process won't get control again to retry establishing the
connection.

Why does the startup process need to regain control? Why not just let it
sit and wait? Have you seen that if someone does use pg_standby or
similar scripts in the restore_command that the server will never regain
control in the way you hope. Would that cause a sporadic hang?

The overall design was previously that the solution implementor was in
charge of the archive and only they knew its characteristics.

It seems strange that we will be forced to explicitly ban people from
using a utility they were previously used to using and is still included
with the distro. Then we implement in the server the very things the
utility did. Only this time the solution implementor will not be in
control.

I would not be against implementing all aspects of pg_standby into the
server. It would make life easier in some ways. I am against
implementing only a *few* of the aspects because that leaves solution
architects in a difficult position to know what to do.

Please lay out some options here for discussion by the community. This
seems like a difficult area and not one to be patched up quickly.

That's the the essence of my proposal here:
http://archives.postgresql.org/message-id/4B50AFB4.4060902@enterprisedb.com
which is what has now been implemented.

To suppport a restore_command that does the sleeping itself, like
pg_standby, would require a major rearchitecting of the retry logic. And
I don't see why that'd desirable anyway. It's easier for the admin to
set up using simple commands like 'cp' or 'scp', than require him/her to
write scripts that handle the sleeping and retry logic.

The real problem we have right now is missing documentation. It's
starting to hurt us more and more every day, as more people start to
test this. As shown by this thread and some other recent posts.

--
Simon Riggs www.2ndQuadrant.com

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Dimitri Fontaine (#13)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, 2010-02-11 at 14:41 +0100, Dimitri Fontaine wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

If you were running pg_standby as the restore_command then this error
wouldn't happen. So you need to explain why running pg_standby cannot
solve your problem and why we must fix it by replicating code that has
previously existed elsewhere.

Let me try.

pg_standby will not let the server get back to streaming replication
mode once it's done with driving the replay of all the WAL files
available in the archive, but will have the server sits there waiting
for the next file.

The way we want that is implemented now is to have the server switch
back and forth between replaying from the archive and streaming from the
master. So we want the server to restore from the archive the same way
pg_standby used to, except that if the archive does not contain the next
WAL files, we want to get back to streaming.

And the archive reading will resume at next network glitch.

I think it's the reasonning, I hope it explains what you see happening.

OK, thanks.

One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore.

--
Simon Riggs www.2ndQuadrant.com

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#15)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs wrote:

One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore.

You can still use %R. Of course, plain 'cp' won't know what to do with
it, so a script will then be required. We should probably provide a
sample of that in the docs, or even a ready-made tool similar to
pg_standby but without the waiting.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Aidan Van Dyk
aidan@highrise.ca
In reply to: Heikki Linnakangas (#12)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

* Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> [100211 08:29]:

To suppport a restore_command that does the sleeping itself, like
pg_standby, would require a major rearchitecting of the retry logic. And
I don't see why that'd desirable anyway. It's easier for the admin to
set up using simple commands like 'cp' or 'scp', than require him/her to
write scripts that handle the sleeping and retry logic.

But colour me confused, I'm still not understanding why this is any
different that with normal PITR recovery.

So even with a plain "cp" in your recovery command instead of a
sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
solutions out thery), I'm not seeing how it's going to get "half files".
The only way I can see that is if you're out of disk space in your
recovering pg_xlog.

It's well know in PostgreSQL wal archivne - you don't just "shove" files
into the archive, you make sure they appear there with the right name
atomically. And if the master is only running the archive command on
whole WAL files, I just don't understand this whole short wal problem.

And don't try and tell me your just "poaching" files from a running
cluster's pg_xlog directory, because I'm going to cry...

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#16)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

On Thu, 2010-02-11 at 15:55 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

One question then: how do we ensure that the archive does not grow too
big? pg_standby cleans down the archive using %R. That function appears
to not exist anymore.

You can still use %R. Of course, plain 'cp' won't know what to do with
it, so a script will then be required. We should probably provide a
sample of that in the docs, or even a ready-made tool similar to
pg_standby but without the waiting.

So we still need a script but it can't be pg_standby? Hmmm, OK...

Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.

--
Simon Riggs www.2ndQuadrant.com

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Aidan Van Dyk (#17)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Aidan Van Dyk wrote:

But colour me confused, I'm still not understanding why this is any
different that with normal PITR recovery.

So even with a plain "cp" in your recovery command instead of a
sleep+copy (a la pg_standby, or PITR tools, or all the home-grown
solutions out thery), I'm not seeing how it's going to get "half files".

If the file is just being copied to the archive when restore_command
('cp', say) is launched, it will copy a half file. That's not a problem
for PITR, because PITR will end at the end of valid WAL anyway, but
returning a half WAL file in standby mode is a problem.

It's well know in PostgreSQL wal archivne - you don't just "shove" files
into the archive, you make sure they appear there with the right name
atomically. And if the master is only running the archive command on
whole WAL files, I just don't understand this whole short wal problem.

Yeah, if you're careful about that, then this change isn't required. But
pg_standby protects against that, so I think it'd be reasonable to have
the same level of protection built-in. It's not a lot of code.

We could well just document that you should do that, ie. make sure the
file appears in the archive atomically with the right size.

And don't try and tell me your just "poaching" files from a running
cluster's pg_xlog directory, because I'm going to cry...

No :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#18)
hackersdocs
Re: Re: [COMMITTERS] pgsql: Make standby server continuously retry restoring the next WAL

Simon Riggs wrote:

Might it not be simpler to add a parameter onto pg_standby?
We send %s to tell pg_standby the standby_mode of the server which is
calling it so it can decide how to act in each case.

That would work too, but it doesn't seem any simpler to me. On the contrary.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#20)
hackersdocs
#22Aidan Van Dyk
aidan@highrise.ca
In reply to: Heikki Linnakangas (#19)
hackersdocs
#23Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#20)
hackersdocs
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#21)
hackersdocs
In reply to: Simon Riggs (#21)
hackersdocs
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Aidan Van Dyk (#22)
hackersdocs
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Aidan Van Dyk (#22)
hackersdocs
#28Aidan Van Dyk
aidan@highrise.ca
In reply to: Heikki Linnakangas (#26)
hackersdocs
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#26)
hackersdocs
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#27)
hackersdocs
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Aidan Van Dyk (#28)
hackersdocs
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#29)
hackersdocs
#33Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#31)
hackersdocs
#34Garick Hamlin
ghamlin@isc.upenn.edu
In reply to: Kevin Grittner (#33)
hackersdocs
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#20)
hackersdocs
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#32)
hackersdocs
#37Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#35)
hackersdocs
#38Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#37)
hackersdocs
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#37)
hackersdocs
#40Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#38)
hackersdocs
#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#40)
hackersdocs
#42Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Simon Riggs (#39)
hackersdocs
#43Bruce Momjian
bruce@momjian.us
In reply to: Dimitri Fontaine (#42)
hackersdocs
#44Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#41)
hackersdocs
#45Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#27)
hackersdocs
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#45)
hackersdocs
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#46)
hackersdocs
#48Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#46)
hackersdocs
#49Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#48)
hackersdocs
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#49)
hackersdocs
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#50)
hackersdocs
#52Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#51)
hackersdocs
#53Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#50)
hackersdocs
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#53)
hackersdocs
#55Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#50)
hackersdocs
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#55)
hackersdocs
#57Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#56)
hackersdocs
#58Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#57)
hackersdocs
#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#56)
hackersdocs
#60Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#59)
hackersdocs
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#60)
hackersdocs
#62Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#60)
hackersdocs
#63Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#61)
hackersdocs
#64Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#60)
hackersdocs
#65Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#63)
hackersdocs
#66Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#64)
hackersdocs
#67Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#62)
hackersdocs
#68Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#65)
hackersdocs
#69Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#68)
hackersdocs
#70Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#57)
hackersdocs
#71Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#58)
hackersdocs
#72Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#71)
hackersdocs
#73Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#67)
hackersdocs
#74Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#68)
hackersdocs
#75Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#71)
hackersdocs
#76Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#75)
hackersdocs
#77Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#76)
hackersdocs