Remaining Streaming Replication Open Items

Started by Heikki Linnakangasalmost 16 years ago30 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I triaged the list of open items on the Streaming Replication wiki page.
I propose that we drop the ones I've marked as Drop below, and move the
remaining items to the main Open Items page for better visibility. And
of course try to resolve them as quickly as possible.

* Walsender and dblink are not interruptible on win32. - related thread

I'd actually be happy to just leave it for 9.0, but it seems like
consensus has been reached on how to fix it, and Fujii is working on a
patch, so let's follow that through.

* Add the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. Which is useful to avoid disk full in the primary.

Not only to avoid disk full in primary but also to make it feasible to
use streaming replication without archiving. It's a small change, we
should do it.

* pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the wrong name. Because a backend cannot know the actual timeline which is related to the location.

Drop. It's not clear which timeline those functions should return in
boundary cases, when replaying records from a log file where the
timeline-switch occurs.

* The documentation needs to be improved.

I've done as much as I can on my own, what we need now is feedback on
what needs to be improved. So I'd like to drop this, but let's add new
more specific items about what needs to be improved, as people speak up.

* Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

* Quotes can't be escaped in recovery.conf

Under discussion. Not specific to streaming replication, and it's a
pre-existing issue, but should be fixed IMHO.

* Change the "standby mode" name.

Bikeshedding without consensus. I like the "standby mode" the best as
discussed on that thread, better than any of the proposed alternatives.
Drop this item.

* Fix things so that any such variables inherited from the server environment are intentionally *NOT* used for making SR connections.

Drop. Besides, we have the same problem with dblink, and I don't recall
anyone complaining.

* If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

* Remove the unnecessary section about HS from recovery.conf.sample

Yeah, let's do it.

* The replication connections consume superuser_reserved_connections slots.

I'd still like to change this slightly, per my suggestion on that
thread, but I don't feel strongly about it. It doesn't seem like a very
big change to me, but Tom felt otherwise.

* Add missing description about WAL-logging.

Small documentation change. Needs to be done I guess.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Remaining Streaming Replication Open Items

On Tue, Apr 6, 2010 at 4:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I triaged the list of open items on the Streaming Replication wiki page.
I propose that we drop the ones I've marked as Drop below, and move the
remaining items to the main Open Items page for better visibility. And
of course try to resolve them as quickly as possible.

Thanks so much!!

    *  Walsender and dblink are not interruptible on win32. - related thread

I'd actually be happy to just leave it for 9.0, but it seems like
consensus has been reached on how to fix it, and Fujii is working on a
patch, so let's follow that through.

Yeah, I'm reworking the patch, but I'd like to take aim at only walreceiver
because the change for dblink might become too big at this point. Since no
one has complained about the long-term problem of dblink, I'm no sure it
really should be fixed right now.

    * Add the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. Which is useful to avoid disk full in the primary.

Not only to avoid disk full in primary but also to make it feasible to
use streaming replication without archiving. It's a small change, we
should do it.

Yep.

    * pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the wrong name. Because a backend cannot know the actual timeline which is related to the location.

Drop. It's not clear which timeline those functions should return in
boundary cases, when replaying records from a log file where the
timeline-switch occurs.

OK, but we need to add the note about that confusing behavior.
How about?:

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 57163da..da3253f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -13206,6 +13206,8 @@ postgres=# SELECT * FROM
pg_xlogfile_name_offset(pg_stop_backup());
     This is usually the desired behavior for managing transaction log archiving
     behavior, since the preceding file is the last one that currently
     needs to be archived.
+    Note that <function>pg_xlogfile_name</> and
<function>pg_xlogfile_name_offset</>
+    always return an inaccurate result during recovery.
    </para>

<para>
@@ -13279,6 +13281,11 @@ postgres=# SELECT * FROM
pg_xlogfile_name_offset(pg_stop_backup());
</table>

    <para>
+    Note that <function>pg_xlogfile_name</> and
<function>pg_xlogfile_name_offset</>
+    always return an inaccurate result from any of the above locations.
+   </para>
+
+   <para>
     The functions shown in <xref linkend="functions-admin-dbsize"> calculate
     the disk space usage of database objects.
    </para>

    * The documentation needs to be improved.

I've done as much as I can on my own, what we need now is feedback on
what needs to be improved. So I'd like to drop this, but let's add new
more specific items about what needs to be improved, as people speak up.

Yep.

    * Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

I don't think that it's too big, but OK. And, ISTM we need to add the note
about the longstanding confusing behavior if it's dropped. How about?:

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 594bd7d..f8899e4 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1339,6 +1339,7 @@ echo -17 > /proc/self/oom_adj
        active, new connections will still be allowed, but only to superusers
        (this exception allows a superuser to connect to terminate
        online backup mode).
+       If the server is in recovery, it additionally waits for recovery to end.
       </para>
      </listitem>
     </varlistentry>

    * Quotes can't be escaped in recovery.conf

Under discussion. Not specific to streaming replication, and it's a
pre-existing issue, but should be fixed IMHO.

Yep.

    * Change the "standby mode" name.

Bikeshedding without consensus. I like the "standby mode" the best as
discussed on that thread, better than any of the proposed alternatives.
Drop this item.

Yep.

    * Fix things so that any such variables inherited from the server environment are intentionally *NOT* used for making SR connections.

Drop. Besides, we have the same problem with dblink, and I don't recall
anyone complaining.

Yep, but I don't think that dblink has the same issue because it's often
used to connect to another database on the same postgres instance, which
seems proper method. The problem is that walreceiver might wrongly connect
to *its* server and get stuck because no WAL records arrive for ever.
Since currently we don't allow the standby to accept the replication
connection, the problem will not happen in 9.0, and ISTM we don't need
to address it right now. So I agree to drop.

    * If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Yep, but it seems difficult to implement that. So I'd drop the suppression.

    * Remove the unnecessary section about HS from recovery.conf.sample

Yeah, let's do it.

Yep.

    * The replication connections consume superuser_reserved_connections slots.

I'd still like to change this slightly, per my suggestion on that
thread, but I don't feel strongly about it. It doesn't seem like a very
big change to me, but Tom felt otherwise.

I feel the same.

    * Add missing description about WAL-logging.

Small documentation change. Needs to be done I guess.

Yep.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Remaining Streaming Replication Open Items

I wrote my previous email before reading this.

On Tue, Apr 6, 2010 at 3:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I triaged the list of open items on the Streaming Replication wiki page.
I propose that we drop the ones I've marked as Drop below, and move the
remaining items to the main Open Items page for better visibility. And
of course try to resolve them as quickly as possible.

    *  Walsender and dblink are not interruptible on win32. - related thread

I'd actually be happy to just leave it for 9.0, but it seems like
consensus has been reached on how to fix it, and Fujii is working on a
patch, so let's follow that through.

Agree.

    * Add the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. Which is useful to avoid disk full in the primary.

Not only to avoid disk full in primary but also to make it feasible to
use streaming replication without archiving. It's a small change, we
should do it.

Do we have a working patch?

    * pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the wrong name. Because a backend cannot know the actual timeline which is related to the location.

Drop. It's not clear which timeline those functions should return in
boundary cases, when replaying records from a log file where the
timeline-switch occurs.

Agree.

    * The documentation needs to be improved.

I've done as much as I can on my own, what we need now is feedback on
what needs to be improved. So I'd like to drop this, but let's add new
more specific items about what needs to be improved, as people speak up.

Agree. It's hard to think of this as a beta-blocker without more
specific feedback.

    * Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it. I don't think
it's a big change, and the current behavior is extremely pathological.

    * Quotes can't be escaped in recovery.conf

Under discussion. Not specific to streaming replication, and it's a
pre-existing issue, but should be fixed IMHO.

Fine with me.

    * Change the "standby mode" name.

Bikeshedding without consensus. I like the "standby mode" the best as
discussed on that thread, better than any of the proposed alternatives.
Drop this item.

OK.

    * Fix things so that any such variables inherited from the server environment are intentionally *NOT* used for making SR connections.

Drop. Besides, we have the same problem with dblink, and I don't recall
anyone complaining.

Agree. I think that whole issue is bikeshedding.

    * If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Any idea how to implement this?

    * Remove the unnecessary section about HS from recovery.conf.sample

Yeah, let's do it.

Don't care.

    * The replication connections consume superuser_reserved_connections slots.

I'd still like to change this slightly, per my suggestion on that
thread, but I don't feel strongly about it. It doesn't seem like a very
big change to me, but Tom felt otherwise.

Agree, we should fix it.

    * Add missing description about WAL-logging.

Small documentation change. Needs to be done I guess.

No strong feelings.

...Robert

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#3)
Re: Remaining Streaming Replication Open Items

Robert Haas wrote:

On Tue, Apr 6, 2010 at 3:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

* Add the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. Which is useful to avoid disk full in the primary.

Not only to avoid disk full in primary but also to make it feasible to
use streaming replication without archiving. It's a small change, we
should do it.

Do we have a working patch?

No.

* Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it. I don't think
it's a big change, and the current behavior is extremely pathological.

Oh, ok. I didn't look at the latest patch, if it looks good to you, fine
with me.

* If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Any idea how to implement this?

I'll take a look. It shouldn't be too hard.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Remaining Streaming Replication Open Items

On Tue, Apr 6, 2010 at 10:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Robert Haas wrote:

On Tue, Apr 6, 2010 at 3:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

    * Add the GUC parameter to specify the maximum number of log file segments held in pg_xlog directory to send to the standby server. Which is useful to avoid disk full in the primary.

Not only to avoid disk full in primary but also to make it feasible to
use streaming replication without archiving. It's a small change, we
should do it.

Do we have a working patch?

No.

:-(

    * Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it.  I don't think
it's a big change, and the current behavior is extremely pathological.

Oh, ok. I didn't look at the latest patch, if it looks good to you, fine
with me.

I'll commit it tonight.

    * If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Any idea how to implement this?

I'll take a look. It shouldn't be too hard.

The tricky part, I believe, is that there's more than one message that
can potentially be emitted, and you don't want ANY of them to repeat
every 2 s, so some thought needs to be given to where to hook in the
logic.

...Robert

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Remaining Streaming Replication Open Items

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I triaged the list of open items on the Streaming Replication wiki page.
I propose that we drop the ones I've marked as Drop below, and move the
remaining items to the main Open Items page for better visibility.

By "drop" do you mean "move to TODO"? At least some of these issues
should be addressed in 9.1 or later. Perhaps some can really be
dropped, but it's not clear which.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#2)
Re: Remaining Streaming Replication Open Items

Fujii Masao <masao.fujii@gmail.com> writes:

On Tue, Apr 6, 2010 at 4:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

� � * Fix things so that any such variables inherited from the server environment are intentionally *NOT* used for making SR connections.

Drop. Besides, we have the same problem with dblink, and I don't recall
anyone complaining.

Yep, but I don't think that dblink has the same issue because it's often
used to connect to another database on the same postgres instance, which
seems proper method.

Yes, dblink is a poor precedent to cite because self-connections are a sane
behavior in its case.

The problem is that walreceiver might wrongly connect
to *its* server and get stuck because no WAL records arrive for ever.
Since currently we don't allow the standby to accept the replication
connection, the problem will not happen in 9.0, and ISTM we don't need
to address it right now. So I agree to drop.

Agreed, this can be put off until we support relay replication. I think
it will be an issue then, however.

regards, tom lane

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#6)
Re: Remaining Streaming Replication Open Items

Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I triaged the list of open items on the Streaming Replication wiki page.
I propose that we drop the ones I've marked as Drop below, and move the
remaining items to the main Open Items page for better visibility.

By "drop" do you mean "move to TODO"? At least some of these issues
should be addressed in 9.1 or later. Perhaps some can really be
dropped, but it's not clear which.

Umm, yes, honestly speaking I hadn't even thought about that.

I've added the ones that should be addressed in the future to the TODO
list. I added a new subsection for standby server and streaming
replication related items:
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Erik Rijkers
er@xs4all.nl
In reply to: Heikki Linnakangas (#8)
Re: Remaining Streaming Replication Open Items

On Tue, April 6, 2010 19:29, Heikki Linnakangas wrote:
[...]

I've added the ones that should be addressed in the future to the TODO
list. I added a new subsection for standby server and streaming
replication related items:
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

I reported "Assertion failure twophase.c" a few times; see:

http://search.postgresql.org/search?m=1&amp;q=Assertion+failure+twophase.c&amp;l=&amp;d=&amp;s=

Btw, it has now also happened once without the postbio package installed - (which was unlikely to
be the cause anyway, I think).

I don't see it mentioned in the TODO, but maybe it's just deemed too elusive to be assigned a todo
entry.

Was the issue eventually found/solved?

thanks,

Erik Rijkers

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Erik Rijkers (#9)
Re: Remaining Streaming Replication Open Items

On Tue, 2010-04-06 at 20:27 +0200, Erik Rijkers wrote:

Was the issue eventually found/solved?

We think so, but the event was not conclusively traceable.

--
Simon Riggs www.2ndQuadrant.com

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#5)
Re: Remaining Streaming Replication Open Items

On Tue, 2010-04-06 at 11:06 -0400, Robert Haas wrote:

* Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it. I don't think
it's a big change, and the current behavior is extremely pathological.

Oh, ok. I didn't look at the latest patch, if it looks good to you, fine
with me.

I'll commit it tonight.

I don't see this on hackers. Have you posted it? I'd like to see what
you do before it gets committed. Thanks.

--
Simon Riggs www.2ndQuadrant.com

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#1)
Re: Remaining Streaming Replication Open Items

On Tue, 2010-04-06 at 10:09 +0300, Heikki Linnakangas wrote:

* Walsender and dblink are not interruptible on win32. - related thread

I'd actually be happy to just leave it for 9.0, but it seems like
consensus has been reached on how to fix it, and Fujii is working on a
patch, so let's follow that through.

That one is a must, for me.

I would put relaying easily above any of the other stuff. That is a
truly useful feature that we are very close to being able to have in
this release. Adding things like quotes is not moving us forwards in any
important sense.

--
Simon Riggs www.2ndQuadrant.com

#13Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#11)
1 attachment(s)
Re: Remaining Streaming Replication Open Items

On Tue, Apr 6, 2010 at 3:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2010-04-06 at 11:06 -0400, Robert Haas wrote:

    * Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it.  I don't think
it's a big change, and the current behavior is extremely pathological.

Oh, ok. I didn't look at the latest patch, if it looks good to you, fine
with me.

I'll commit it tonight.

I don't see this on hackers. Have you posted it? I'd like to see what
you do before it gets committed. Thanks.

It's the same patch Fujii Masao posted previously, for which I
previously said I would fix up the comments and docs and commit. But
here is the adjusted version, which is hopefully more clear about what
we're doing at why we're doing it.

...Robert

Attachments:

smart-shutdown.patchapplication/octet-stream; name=smart-shutdown.patchDownload
diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index eda022c..aa55714 100644
--- a/doc/src/sgml/ref/pg_ctl-ref.sgml
+++ b/doc/src/sgml/ref/pg_ctl-ref.sgml
@@ -152,6 +152,8 @@ PostgreSQL documentation
    shutdown methods can be selected with the <option>-m</option>
    option: <quote>Smart</quote> mode waits for online backup mode
    to finish and all the clients to disconnect.  This is the default.
+   If the server is in recovery, recovery and streaming replication
+   will be terminated once all clients have disconnected.
    <quote>Fast</quote> mode does not wait for clients to disconnect and
    will terminate an online backup in progress.  All active transactions are
    rolled back and clients are forcibly disconnected, then the
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 594bd7d..8db7751 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1338,7 +1338,9 @@ echo -17 > /proc/self/oom_adj
        until online backup mode is no longer active.  While backup mode is
        active, new connections will still be allowed, but only to superusers
        (this exception allows a superuser to connect to terminate
-       online backup mode).
+       online backup mode).  If the server is in recovery when a smart
+       shutdown is requested, recovery and streaming replication will be
+       stopped only after all regular sessions have terminated.
       </para>
      </listitem>
     </varlistentry>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 98ab484..47f71bd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -278,6 +278,7 @@ typedef enum
 	PM_RECOVERY_CONSISTENT,		/* consistent recovery mode */
 	PM_RUN,						/* normal "database is alive" state */
 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
+	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
 	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
@@ -2173,7 +2174,17 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-				pmState = PM_WAIT_BACKUP;
+				/*
+				 * If we're in recovery, we can't kill the startup process
+				 * right away, because at present doing so does not release
+				 * its locks.  We might want to change this in a future
+				 * release.  For the time being, the PM_WAIT_READONLY state
+				 * indicates that we're waiting for the regular (read only)
+				 * backends to die off; once they do, we'll kill the startup
+				 * and walreceiver processes.
+				 */
+				pmState = (pmState == PM_RUN) ?
+					PM_WAIT_BACKUP : PM_WAIT_READONLY;
 			}
 
 			/*
@@ -2209,6 +2220,7 @@ pmdie(SIGNAL_ARGS)
 			}
 			if (pmState == PM_RUN ||
 				pmState == PM_WAIT_BACKUP ||
+				pmState == PM_WAIT_READONLY ||
 				pmState == PM_WAIT_BACKENDS ||
 				pmState == PM_RECOVERY_CONSISTENT)
 			{
@@ -2771,6 +2783,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_RECOVERY_CONSISTENT ||
 		pmState == PM_RUN ||
 		pmState == PM_WAIT_BACKUP ||
+		pmState == PM_WAIT_READONLY ||
 		pmState == PM_SHUTDOWN)
 		pmState = PM_WAIT_BACKENDS;
 }
@@ -2846,6 +2859,26 @@ PostmasterStateMachine(void)
 			pmState = PM_WAIT_BACKENDS;
 	}
 
+	if (pmState == PM_WAIT_READONLY)
+	{
+		/*
+		 * PM_WAIT_READONLY state ends when we have no regular backends that
+		 * have been started during recovery.  We kill the startup and
+		 * walreceiver processes and transition to PM_WAIT_BACKENDS.  Ideally,
+		 * we might like to kill these processes first and then wait for
+		 * backends to die off, but that doesn't work at present because
+		 * killing the startup process doesn't release its locks.
+		 */
+		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
+		{
+			if (StartupPID != 0)
+				signal_child(StartupPID, SIGTERM);
+			if (WalReceiverPID != 0)
+				signal_child(WalReceiverPID, SIGTERM);
+			pmState = PM_WAIT_BACKENDS;
+		}
+	}
+
 	/*
 	 * If we are in a state-machine state that implies waiting for backends to
 	 * exit, see if they're all gone, and change state if so.
#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#13)
Re: Remaining Streaming Replication Open Items

On Wed, 2010-04-07 at 07:40 -0400, Robert Haas wrote:

On Tue, Apr 6, 2010 at 3:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2010-04-06 at 11:06 -0400, Robert Haas wrote:

* Redefine smart shutdown in standby mode?

Drop. Too big a change at this point.

We have a working patch for this - I want to commit it. I don't think
it's a big change, and the current behavior is extremely pathological.

Oh, ok. I didn't look at the latest patch, if it looks good to you, fine
with me.

I'll commit it tonight.

I don't see this on hackers. Have you posted it? I'd like to see what
you do before it gets committed. Thanks.

It's the same patch Fujii Masao posted previously, for which I
previously said I would fix up the comments and docs and commit. But
here is the adjusted version, which is hopefully more clear about what
we're doing at why we're doing it.

OK, that looks a lot less risky than I had understood from discussions.
The main thing for me is it doesn't interfere with Startup or
WalReceiver, so assuming it works I've got no objections. Thanks for
chasing this down, good addition.

--
Simon Riggs www.2ndQuadrant.com

#15Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#14)
Re: Remaining Streaming Replication Open Items

On Wed, Apr 7, 2010 at 8:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

OK, that looks a lot less risky than I had understood from discussions.
The main thing for me is it doesn't interfere with Startup or
WalReceiver, so assuming it works I've got no objections. Thanks for
chasing this down, good addition.

Thanks. Committed.

...Robert

#16Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#12)
Re: Remaining Streaming Replication Open Items

On Tue, Apr 6, 2010 at 4:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2010-04-06 at 10:09 +0300, Heikki Linnakangas wrote:

    *  Walsender and dblink are not interruptible on win32. - related thread

I'd actually be happy to just leave it for 9.0, but it seems like
consensus has been reached on how to fix it, and Fujii is working on a
patch, so let's follow that through.

That one is a must, for me.

I would put relaying easily above any of the other stuff. That is a
truly useful feature that we are very close to being able to have in
this release. Adding things like quotes is not moving us forwards in any
important sense.

+1. I think this is easily the most important remaining issue that we
need to fix, with the possible exception of the shutdown checkpoint
issue.

...Robert

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#15)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 10:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 7, 2010 at 8:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

OK, that looks a lot less risky than I had understood from discussions.
The main thing for me is it doesn't interfere with Startup or
WalReceiver, so assuming it works I've got no objections. Thanks for
chasing this down, good addition.

Thanks.  Committed.

Thanks. The following TODO item should be removed?

"Redefine smart shutdown in standby mode to exist as soon as all
read-only connections are gone."
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

Or change it to something like?

"Change smart shutdown in standby mode so that it kills the startup
and walreceiver process before waiting for the regular backends to die off"

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#18Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#17)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 2:54 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Apr 8, 2010 at 10:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 7, 2010 at 8:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

OK, that looks a lot less risky than I had understood from discussions.
The main thing for me is it doesn't interfere with Startup or
WalReceiver, so assuming it works I've got no objections. Thanks for
chasing this down, good addition.

Thanks.  Committed.

Thanks. The following TODO item should be removed?

"Redefine smart shutdown in standby mode to exist as soon as all
read-only connections are gone."
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

Or change it to something like?

"Change smart shutdown in standby mode so that it kills the startup
 and walreceiver process before waiting for the regular backends to die off"

Yeah, we should do one of those two things, but I don't much care which.

...Robert

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#18)
Re: Remaining Streaming Replication Open Items

On Thu, 2010-04-08 at 06:58 -0400, Robert Haas wrote:

Thanks. Committed.

Thanks. The following TODO item should be removed?

"Redefine smart shutdown in standby mode to exist as soon as all
read-only connections are gone."
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

Or change it to something like?

"Change smart shutdown in standby mode so that it kills the startup
and walreceiver process before waiting for the regular backends to die off"

Yeah, we should do one of those two things, but I don't much care which.

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

--
Simon Riggs www.2ndQuadrant.com

#20Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#19)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 7:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-04-08 at 06:58 -0400, Robert Haas wrote:

Thanks.  Committed.

Thanks. The following TODO item should be removed?

"Redefine smart shutdown in standby mode to exist as soon as all
read-only connections are gone."
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

Or change it to something like?

"Change smart shutdown in standby mode so that it kills the startup
 and walreceiver process before waiting for the regular backends to die off"

Yeah, we should do one of those two things, but I don't much care which.

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

Well, stopping recovery earlier would mean fewer locks, which would
mean a better chance for the read-only backends to finish their work
and exit quickly. But I'm not sure how much it's worth worrying
about.

...Robert

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#20)
Re: Remaining Streaming Replication Open Items

On Thu, 2010-04-08 at 07:53 -0400, Robert Haas wrote:

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

Well, stopping recovery earlier would mean fewer locks, which would
mean a better chance for the read-only backends to finish their work
and exit quickly. But I'm not sure how much it's worth worrying
about.

The purpose of the lock is to prevent access to objects when they are in
inappropriate states for access. If we stopped startup and allowed
access, how do we know that things are in sufficiently good state to
allow access? We don't. If the Startup process is holding a lock then
that is the only safe thing to do. Otherwise we might allow access to a
table with a partially built index or other screw ups.

--
Simon Riggs www.2ndQuadrant.com

#22Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#21)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 8:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-04-08 at 07:53 -0400, Robert Haas wrote:

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

Well, stopping recovery earlier would mean fewer locks, which would
mean a better chance for the read-only backends to finish their work
and exit quickly.  But I'm not sure how much it's worth worrying
about.

The purpose of the lock is to prevent access to objects when they are in
inappropriate states for access. If we stopped startup and allowed
access, how do we know that things are in sufficiently good state to
allow access? We don't. If the Startup process is holding a lock then
that is the only safe thing to do. Otherwise we might allow access to a
table with a partially built index or other screw ups.

Hmm. Good point. I guess you could really only stop the startup
process safely when it wasn't holding any locks anyhow - you couldn't
just kill it and have it release the locks.

...Robert

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#22)
Re: Remaining Streaming Replication Open Items

On Thu, 2010-04-08 at 09:40 -0400, Robert Haas wrote:

On Thu, Apr 8, 2010 at 8:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-04-08 at 07:53 -0400, Robert Haas wrote:

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

Well, stopping recovery earlier would mean fewer locks, which would
mean a better chance for the read-only backends to finish their work
and exit quickly. But I'm not sure how much it's worth worrying
about.

The purpose of the lock is to prevent access to objects when they are in
inappropriate states for access. If we stopped startup and allowed
access, how do we know that things are in sufficiently good state to
allow access? We don't. If the Startup process is holding a lock then
that is the only safe thing to do. Otherwise we might allow access to a
table with a partially built index or other screw ups.

Hmm. Good point. I guess you could really only stop the startup
process safely when it wasn't holding any locks anyhow - you couldn't
just kill it and have it release the locks.

... and if it isn't holding any locks at all, there is no reason to kill
Startup first => no TODO item.

--
Simon Riggs www.2ndQuadrant.com

#24Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#23)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 9:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-04-08 at 09:40 -0400, Robert Haas wrote:

On Thu, Apr 8, 2010 at 8:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2010-04-08 at 07:53 -0400, Robert Haas wrote:

I do. I see no reason to do the latter, ever, so should not be added to
any TODO.

Well, stopping recovery earlier would mean fewer locks, which would
mean a better chance for the read-only backends to finish their work
and exit quickly.  But I'm not sure how much it's worth worrying
about.

The purpose of the lock is to prevent access to objects when they are in
inappropriate states for access. If we stopped startup and allowed
access, how do we know that things are in sufficiently good state to
allow access? We don't. If the Startup process is holding a lock then
that is the only safe thing to do. Otherwise we might allow access to a
table with a partially built index or other screw ups.

Hmm.  Good point.  I guess you could really only stop the startup
process safely when it wasn't holding any locks anyhow - you couldn't
just kill it and have it release the locks.

... and if it isn't holding any locks at all, there is no reason to kill
Startup first => no TODO item.

I think you could shut it down at the first point at which it is
holding no locks, rather than letting it continue recovering and
potentially retake some new locks. That would be more consistent with
the general idea of what a smart shutdown is supposed to be about. I
think the real question is whether it's worth the code complexity. I
suspect most people use fast shutdown most of the time anyway in
real-world applications.

...Robert

#25Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#24)
Re: Remaining Streaming Replication Open Items

On Thu, Apr 8, 2010 at 11:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think you could shut it down at the first point at which it is
holding no locks, rather than letting it continue recovering and
potentially retake some new locks.  That would be more consistent with
the general idea of what a smart shutdown is supposed to be about.  I
think the real question is whether it's worth the code complexity.

I don't think it's worth. So I agree to just remove the TODO item:
"Redefine smart shutdown in standby mode to exist as soon as all
read-only connections are gone."
http://wiki.postgresql.org/wiki/Todo#Standby_server_mode

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#5)
1 attachment(s)
Re: Remaining Streaming Replication Open Items

Robert Haas wrote:

On Tue, Apr 6, 2010 at 10:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Robert Haas wrote:

* If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Any idea how to implement this?

I'll take a look. It shouldn't be too hard.

The tricky part, I believe, is that there's more than one message that
can potentially be emitted, and you don't want ANY of them to repeat
every 2 s, so some thought needs to be given to where to hook in the
logic.

We have the emode_for_corrupt_record() function that's used in all the
errors that indicate a corrupt WAL record, that's a perfect place to
hook this into. See attached patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

only-complain-on-first-time-a-record-is-corrupt-1.patchtext/x-diff; name=only-complain-on-first-time-a-record-is-corrupt-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index be86501..7804853 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -539,7 +539,7 @@ static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
 				   int sources);
 static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 			 bool randAccess);
-static int emode_for_corrupt_record(int emode);
+static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static bool RestoreArchivedFile(char *path, const char *xlogfname,
 					const char *recovername, off_t expectedSize);
@@ -3543,7 +3543,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 		memcpy(&bkpb, blk, sizeof(BkpBlock));
 		if (bkpb.hole_offset + bkpb.hole_length > BLCKSZ)
 		{
-			ereport(emode,
+			ereport(emode_for_corrupt_record(emode, recptr),
 					(errmsg("incorrect hole size in record at %X/%X",
 							recptr.xlogid, recptr.xrecoff)));
 			return false;
@@ -3556,7 +3556,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 	/* Check that xl_tot_len agrees with our calculation */
 	if (blk != (char *) record + record->xl_tot_len)
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recptr),
 				(errmsg("incorrect total length in record at %X/%X",
 						recptr.xlogid, recptr.xrecoff)));
 		return false;
@@ -3569,7 +3569,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 
 	if (!EQ_CRC32(record->xl_crc, crc))
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recptr),
 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
 				recptr.xlogid, recptr.xrecoff)));
 		return false;
@@ -3674,7 +3674,7 @@ retry:
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid record offset at %X/%X",
 						RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
@@ -3682,7 +3682,7 @@ retry:
 	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
 		targetRecOff == pageHeaderSize)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("contrecord is requested by %X/%X",
 						RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
@@ -3697,7 +3697,7 @@ retry:
 	{
 		if (record->xl_len != 0)
 		{
-			ereport(emode_for_corrupt_record(emode),
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("invalid xlog switch record at %X/%X",
 							RecPtr->xlogid, RecPtr->xrecoff)));
 			goto next_record_is_invalid;
@@ -3705,7 +3705,7 @@ retry:
 	}
 	else if (record->xl_len == 0)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("record with zero length at %X/%X",
 						RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
@@ -3714,14 +3714,14 @@ retry:
 		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
 		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid record length at %X/%X",
 						RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
 	}
 	if (record->xl_rmid > RM_MAX_ID)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid resource manager ID %u at %X/%X",
 						record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
@@ -3734,7 +3734,7 @@ retry:
 		 */
 		if (!XLByteLT(record->xl_prev, *RecPtr))
 		{
-			ereport(emode_for_corrupt_record(emode),
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
 							record->xl_prev.xlogid, record->xl_prev.xrecoff,
 							RecPtr->xlogid, RecPtr->xrecoff)));
@@ -3750,7 +3750,7 @@ retry:
 		 */
 		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
 		{
-			ereport(emode_for_corrupt_record(emode),
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
 							record->xl_prev.xlogid, record->xl_prev.xrecoff,
 							RecPtr->xlogid, RecPtr->xrecoff)));
@@ -3779,7 +3779,7 @@ retry:
 		{
 			readRecordBufSize = 0;
 			/* We treat this as a "bogus data" condition */
-			ereport(emode_for_corrupt_record(emode),
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record length %u at %X/%X too long",
 							total_len, RecPtr->xlogid, RecPtr->xrecoff)));
 			goto next_record_is_invalid;
@@ -3819,7 +3819,7 @@ retry:
 			/* Check that the continuation record looks valid */
 			if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
-				ereport(emode_for_corrupt_record(emode),
+				ereport(emode_for_corrupt_record(emode, *RecPtr),
 						(errmsg("there is no contrecord flag in log file %u, segment %u, offset %u",
 								readId, readSeg, readOff)));
 				goto next_record_is_invalid;
@@ -3829,7 +3829,7 @@ retry:
 			if (contrecord->xl_rem_len == 0 ||
 				total_len != (contrecord->xl_rem_len + gotlen))
 			{
-				ereport(emode_for_corrupt_record(emode),
+				ereport(emode_for_corrupt_record(emode, *RecPtr),
 						(errmsg("invalid contrecord length %u in log file %u, segment %u, offset %u",
 								contrecord->xl_rem_len,
 								readId, readSeg, readOff)));
@@ -3847,7 +3847,7 @@ retry:
 				   contrecord->xl_rem_len);
 			break;
 		}
-		if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode)))
+		if (!RecordIsValid(record, *RecPtr, emode))
 			goto next_record_is_invalid;
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 		EndRecPtr.xlogid = readId;
@@ -3861,7 +3861,7 @@ retry:
 	}
 
 	/* Record does not cross a page boundary */
-	if (!RecordIsValid(record, *RecPtr, emode_for_corrupt_record(emode)))
+	if (!RecordIsValid(record, *RecPtr, emode))
 		goto next_record_is_invalid;
 	EndRecPtr.xlogid = RecPtr->xlogid;
 	EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len);
@@ -3914,16 +3914,19 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 {
 	XLogRecPtr	recaddr;
 
+	recaddr.xlogid = readId;
+	recaddr.xrecoff = readSeg * XLogSegSize + readOff;
+
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("invalid magic number %04X in log file %u, segment %u, offset %u",
 						hdr->xlp_magic, readId, readSeg, readOff)));
 		return false;
 	}
 	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("invalid info bits %04X in log file %u, segment %u, offset %u",
 						hdr->xlp_info, readId, readSeg, readOff)));
 		return false;
@@ -3945,7 +3948,7 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 					 longhdr->xlp_sysid);
 			snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
 					 ControlFile->system_identifier);
-			ereport(emode,
+			ereport(emode_for_corrupt_record(emode, recaddr),
 					(errmsg("WAL file is from different database system"),
 					 errdetail("WAL file database system identifier is %s, pg_control database system identifier is %s.",
 							   fhdrident_str, sysident_str)));
@@ -3953,14 +3956,14 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 		}
 		if (longhdr->xlp_seg_size != XLogSegSize)
 		{
-			ereport(emode,
+			ereport(emode_for_corrupt_record(emode, recaddr),
 					(errmsg("WAL file is from different database system"),
 					 errdetail("Incorrect XLOG_SEG_SIZE in page header.")));
 			return false;
 		}
 		if (longhdr->xlp_xlog_blcksz != XLOG_BLCKSZ)
 		{
-			ereport(emode,
+			ereport(emode_for_corrupt_record(emode, recaddr),
 					(errmsg("WAL file is from different database system"),
 					 errdetail("Incorrect XLOG_BLCKSZ in page header.")));
 			return false;
@@ -3969,17 +3972,15 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	else if (readOff == 0)
 	{
 		/* hmm, first page of file doesn't have a long header? */
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("invalid info bits %04X in log file %u, segment %u, offset %u",
 						hdr->xlp_info, readId, readSeg, readOff)));
 		return false;
 	}
 
-	recaddr.xlogid = readId;
-	recaddr.xrecoff = readSeg * XLogSegSize + readOff;
 	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
 						hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
 						readId, readSeg, readOff)));
@@ -3991,7 +3992,7 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	 */
 	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("unexpected timeline ID %u in log file %u, segment %u, offset %u",
 						hdr->xlp_tli,
 						readId, readSeg, readOff)));
@@ -4009,7 +4010,7 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	 */
 	if (hdr->xlp_tli < lastPageTLI)
 	{
-		ereport(emode,
+		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("out-of-sequence timeline ID %u (after %u) in log file %u, segment %u, offset %u",
 						hdr->xlp_tli, lastPageTLI,
 						readId, readSeg, readOff)));
@@ -9245,14 +9246,13 @@ retry:
 		readOff = 0;
 		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 		{
-			ereport(emode_for_corrupt_record(emode),
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errcode_for_file_access(),
 					 errmsg("could not read from log file %u, segment %u, offset %u: %m",
 							readId, readSeg, readOff)));
 			goto next_record_is_invalid;
 		}
-		if (!ValidXLOGHeader((XLogPageHeader) readBuf,
-							 emode_for_corrupt_record(emode)))
+		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
 			goto next_record_is_invalid;
 	}
 
@@ -9260,7 +9260,7 @@ retry:
 	readOff = targetPageOff;
 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
 				readId, readSeg, readOff)));
@@ -9268,13 +9268,13 @@ retry:
 	}
 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 	{
-		ereport(emode_for_corrupt_record(emode),
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errcode_for_file_access(),
 		 errmsg("could not read from log file %u, segment %u, offset %u: %m",
 				readId, readSeg, readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode_for_corrupt_record(emode)))
+	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
 		goto next_record_is_invalid;
 
 	Assert(targetId == readId);
@@ -9316,10 +9316,17 @@ triggered:
  * 'emode' is the error mode that would be used to report a file-not-found
  * or legitimate end-of-WAL situation. It is upgraded to WARNING or PANIC
  * if a corrupt record is not expected at this point.
+ *
+ * NOTE: This function remembers the RecPtr value it was last called with,
+ * to suppress repeated messages about the same record. Only call this when
+ * you are about to ereport(), or you might cause a later message to be
+ * erroneously suppressed.
  */
 static int
-emode_for_corrupt_record(int emode)
+emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
+	static XLogRecPtr lastComplaint = {0, 0};
+
 	/*
 	 * We don't expect any invalid records in archive or in records streamed
 	 * from master. Files in the archive should be complete, and we should
@@ -9340,6 +9347,17 @@ emode_for_corrupt_record(int emode)
 		if (emode < WARNING)
 			emode = WARNING;
 	}
+	/*
+	 * If we retry reading a record in pg_xlog, only complain on the first
+	 * time to keep the noise down.
+	 */
+	else if (emode == LOG)
+	{
+		if (XLByteEQ(RecPtr, lastComplaint))
+			emode = DEBUG1;
+		else
+			lastComplaint = RecPtr;
+	}
 	return emode;
 }
 
#27Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#26)
Re: Remaining Streaming Replication Open Items

On Tue, Apr 13, 2010 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Robert Haas wrote:

On Tue, Apr 6, 2010 at 10:36 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Robert Haas wrote:

    * If standby_mode is enabled, and neither primary_conninfo nor restore_command are set, the standby would get stuck.

It's not really stuck, it will replay any WAL files you drop into
pg_xlog. I concur with Robert Haas though that it shouldn't print the
message to the log every few seconds. It should print a message the
first time it hits the end of WAL, but subsequent messages should be
suppressed until some progress has been made.

Any idea how to implement this?

I'll take a look. It shouldn't be too hard.

The tricky part, I believe, is that there's more than one message that
can potentially be emitted, and you don't want ANY of them to repeat
every 2 s, so some thought needs to be given to where to hook in the
logic.

We have the emode_for_corrupt_record() function that's used in all the
errors that indicate a corrupt WAL record, that's a perfect place to
hook this into. See attached patch.

The test for elog == LOG seems a bit fragile to me - why that
specifically? Maybe elog < PANIC? elog > DEBUG1? Both?

But it seems basically sensible to me.

...Robert

#28Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#26)
Re: Remaining Streaming Replication Open Items

On Wed, Apr 14, 2010 at 12:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We have the emode_for_corrupt_record() function that's used in all the
errors that indicate a corrupt WAL record, that's a perfect place to
hook this into. See attached patch.

One problem of the patch is that even if the content of error message
is different from the past, it would be skipped when the location of
invalid record is the same of the past. For example, if there is a
partially-filled unbroken WAL file in the standby, the following
message would be written:

record with zero length at %X/%X

Then if you drop corrupted WAL file into pg_xlog, the following message
might have to be output, but would be skipped:

invalid magic number %04X in log file %u, segment %u, offset %u

But I think that we might be able to live with the issue since it's
a very corner case.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#28)
Re: Remaining Streaming Replication Open Items

Fujii Masao wrote:

One problem of the patch is that even if the content of error message
is different from the past, it would be skipped when the location of
invalid record is the same of the past. For example, if there is a
partially-filled unbroken WAL file in the standby, the following
message would be written:

record with zero length at %X/%X

Then if you drop corrupted WAL file into pg_xlog, the following message
might have to be output, but would be skipped:

invalid magic number %04X in log file %u, segment %u, offset %u

But I think that we might be able to live with the issue since it's
a very corner case.

Yeah, we can live with that. The user is not generally interested in
what exactly is wrong with the record. It just indicates that it has the
end of valid WAL in the standby.

Applied.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#27)
Re: Remaining Streaming Replication Open Items

Robert Haas wrote:

On Tue, Apr 13, 2010 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We have the emode_for_corrupt_record() function that's used in all the
errors that indicate a corrupt WAL record, that's a perfect place to
hook this into. See attached patch.

The test for elog == LOG seems a bit fragile to me - why that
specifically? Maybe elog < PANIC? elog > DEBUG1? Both?

Suppressing anything >= ERROR wouldn't make sense, as ERRORs cause the
replay to abort. I didn't want to affect WARNINGs either, which indicate
that something is truly wrong. The only level left between DEBUG1, which
is what the message is downgraded to, and WARNING, is LOG.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com