Basebackups reported as idle

Started by Magnus Haganderabout 8 years ago15 messages
#1Magnus Hagander
magnus@hagander.net
1 attachment(s)

AFAICT, base backups running on the replication protocol are always
reported as "idle" in pg_stat_activity. This seems to have been an
oversight in the "include walsender backends in pg_stat_activity" in 10,
which does include it for walsenders in general, just not for the ones
sending base backups. (and was then improved on later with the "include all
non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since you
can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't
think it has a large risk of breaking things. Thoughts?

Also, in setting this, there is no real way to differentiate between a
regular walsender and a basebackup walsender, other than looking at the
wait events. They're both listed as walsenders. Should there be? (That
might not be as easily backpatchable, so keeping that as a separate one)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

base_backup_running.patchtext/x-patch; charset=US-ASCII; name=base_backup_running.patchDownload
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 05ca95bac2..9fc031b141 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -697,6 +697,9 @@ SendBaseBackup(BaseBackupCmd *cmd)
 
 	WalSndSetState(WALSNDSTATE_BACKUP);
 
+	/* Report to pgstat that this process is running */
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
 	if (update_process_title)
 	{
 		char		activitymsg[50];
@@ -707,6 +710,8 @@ SendBaseBackup(BaseBackupCmd *cmd)
 	}
 
 	perform_base_backup(&opt);
+
+	pgstat_report_activity(STATE_IDLE, NULL);
 }
 
 static void
#2David Steele
david@pgmasters.net
In reply to: Magnus Hagander (#1)
Re: Basebackups reported as idle

Hi Magnus,

On 12/19/17 4:56 AM, Magnus Hagander wrote:

AFAICT, base backups running on the replication protocol are always
reported as "idle" in pg_stat_activity. This seems to have been an
oversight in the "include walsender backends in pg_stat_activity" in 10,
which does include it for walsenders in general, just not for the ones
sending base backups. (and was then improved on later with the "include
all non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since
you can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
don't think it has a large risk of breaking things. Thoughts?

+1 for this being a bug, albeit a minor one.

Also, in setting this, there is no real way to differentiate between a
regular walsender and a basebackup walsender, other than looking at the
wait events. They're both listed as walsenders. Should there be?  (That
might not be as easily backpatchable, so keeping that as a separate one)

Maybe something like "walsender [backup]" or just "basebackup" since
walsender is pretty misleading? It think it would be nice to be able to
tell them apart, though I don't think it should be back-patched. People
might be relying on the name in the current versions.

Thanks!
--
-David
david@pgmasters.net

#3Michael Paquier
michael.paquier@gmail.com
In reply to: David Steele (#2)
Re: Basebackups reported as idle

On Tue, Dec 19, 2017 at 11:50 PM, David Steele <david@pgmasters.net> wrote:

On 12/19/17 4:56 AM, Magnus Hagander wrote:

AFAICT, base backups running on the replication protocol are always
reported as "idle" in pg_stat_activity. This seems to have been an
oversight in the "include walsender backends in pg_stat_activity" in 10,
which does include it for walsenders in general, just not for the ones
sending base backups. (and was then improved on later with the "include
all non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since
you can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
don't think it has a large risk of breaking things. Thoughts?

+1 for this being a bug, albeit a minor one.

+1 for adding calls to pgstat_report_activity() in WAL senders and
tracking the activity of those processes. Now I don't think that this
is the correct location as what matters is tracking if replication
commands are running or not, and not only BASE_BACKUP. So those calls
should be instead in exec_replication_command().

Also, in setting this, there is no real way to differentiate between a
regular walsender and a basebackup walsender, other than looking at the
wait events. They're both listed as walsenders. Should there be? (That
might not be as easily backpatchable, so keeping that as a separate one)

Maybe something like "walsender [backup]" or just "basebackup" since
walsender is pretty misleading? It think it would be nice to be able to
tell them apart, though I don't think it should be back-patched. People
might be relying on the name in the current versions.

You can already do a join with pg_stat_replication.pid and look for
the WAL sender state which is marked as "backup" in this case. So I am
-1 for making the current code more complicated. Please note as well
that pg_stat_activity.backend_type is set depending on the process
type, not based on what the process is doing so you would need to
invent a new logic.
--
Michael

#4Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: Basebackups reported as idle

On Wed, Dec 20, 2017 at 6:11 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Tue, Dec 19, 2017 at 11:50 PM, David Steele <david@pgmasters.net>
wrote:

On 12/19/17 4:56 AM, Magnus Hagander wrote:

AFAICT, base backups running on the replication protocol are always
reported as "idle" in pg_stat_activity. This seems to have been an
oversight in the "include walsender backends in pg_stat_activity" in 10,
which does include it for walsenders in general, just not for the ones
sending base backups. (and was then improved on later with the "include
all non-standard backends" patch).

Unlike the regular walsender it also has to set it back to IDLE, since
you can actually finish a base backup without disconnecting.

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I
don't think it has a large risk of breaking things. Thoughts?

+1 for this being a bug, albeit a minor one.

+1 for adding calls to pgstat_report_activity() in WAL senders and
tracking the activity of those processes. Now I don't think that this
is the correct location as what matters is tracking if replication
commands are running or not, and not only BASE_BACKUP. So those calls
should be instead in exec_replication_command().

Hmm. That does make sense. I guess I got stuck looking at the old code,
which is then *also* in the wrong place.

What about the attached?

Also, I noticed that the docs for exec_replication_command() says " *
Returns true if the cmd_string was recognized as WalSender command, false
* if not.". But it doesn't actually do that, it ereport(ERROR):s if it's
not. There is one "return false", but it's after an ERROR.

So all other exit points of the procedure will be ERROR, in which case we
should set back to idle automatically.

Independent of this change, that function should perhaps be made to return
void.

Also, in setting this, there is no real way to differentiate between a

regular walsender and a basebackup walsender, other than looking at the
wait events. They're both listed as walsenders. Should there be? (That
might not be as easily backpatchable, so keeping that as a separate one)

Maybe something like "walsender [backup]" or just "basebackup" since
walsender is pretty misleading? It think it would be nice to be able to
tell them apart, though I don't think it should be back-patched. People
might be relying on the name in the current versions.

You can already do a join with pg_stat_replication.pid and look for
the WAL sender state which is marked as "backup" in this case. So I am
-1 for making the current code more complicated. Please note as well
that pg_stat_activity.backend_type is set depending on the process
type, not based on what the process is doing so you would need to
invent a new logic.

Good point. I'm not sure why I didn't think of that.

It's still quite a bit weird that we call this process "walsender" when it
does other things as well. But the ship sailed on that many years ago,
changing that completely now would not be worth the breakage.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

base_backup_running2.patchtext/x-patch; charset=US-ASCII; name=base_backup_running2.patchDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6a252fcf45..ba39d07232 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1504,6 +1504,9 @@ exec_replication_command(const char *cmd_string)
 	initStringInfo(&reply_message);
 	initStringInfo(&tmpbuf);
 
+	/* Report to pgstat that this process is running */
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
 	switch (cmd_node->type)
 	{
 		case T_IdentifySystemCmd:
@@ -1570,6 +1573,9 @@ exec_replication_command(const char *cmd_string)
 	/* Send CommandComplete message */
 	EndCommand("SELECT", DestRemote);
 
+	/* Report to pgstat that this process is now idle */
+	pgstat_report_activity(STATE_IDLE, NULL);
+
 	return true;
 }
 
@@ -2089,9 +2095,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
-	/* Report to pgstat that this process is running */
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
#5Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#4)
Re: Basebackups reported as idle

On Wed, Dec 20, 2017 at 6:07 PM, Magnus Hagander <magnus@hagander.net> wrote:

What about the attached?

The new positions look good to me, still aren't you missing the case
where a SQL command is found and exec_replication_command returns
false? This should be switched to idle as well.

+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

Also, I noticed that the docs for exec_replication_command() says " *
Returns true if the cmd_string was recognized as WalSender command, false
* if not.". But it doesn't actually do that, it ereport(ERROR):s if it's
not. There is one "return false", but it's after an ERROR.

WAL senders can connect to a specific database, and if a WAL sender
command is not used in this case this goes through the normal query
processing.

It's still quite a bit weird that we call this process "walsender" when it
does other things as well. But the ship sailed on that many years ago,
changing that completely now would not be worth the breakage.

ps shows walsender as well, that's a reason why "walsender" has been decided.
--
Michael

#6Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#5)
Re: Basebackups reported as idle

On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier <michael.paquier@gmail.com

wrote:

On Wed, Dec 20, 2017 at 6:07 PM, Magnus Hagander <magnus@hagander.net>
wrote:

What about the attached?

The new positions look good to me, still aren't you missing the case
where a SQL command is found and exec_replication_command returns
false? This should be switched to idle as well.

Yes. Of course. I can't read. That's the same as the notice below about it
not returning false -- I managed to miss the extra if() there, and thought
it always exited with ERROR.

:O

+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

Do we want that though? That would be a compat break at least, wouldn't it?

It's still quite a bit weird that we call this process "walsender" when it

does other things as well. But the ship sailed on that many years ago,
changing that completely now would not be worth the breakage.

ps shows walsender as well, that's a reason why "walsender" has been
decided.

Right. It's just a weird term for what it is. But it's the term that we've
always used, so we can't go change it now.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#7Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Magnus Hagander (#1)
Re: Basebackups reported as idle

Magnus Hagander wrote:

PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't
think it has a large risk of breaking things. Thoughts?

Agreed. As long as it doesn't show up as idle-in-transaction afterwards
or something odd like that, it should be okay to backpatch. (I suppose
there's no way for it to end as being in-transaction anyway.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#6)
Re: Basebackups reported as idle

On Wed, Dec 20, 2017 at 9:02 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Yes. Of course. I can't read. That's the same as the notice below about it
not returning false -- I managed to miss the extra if() there, and thought
it always exited with ERROR.

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

Do we want that though? That would be a compat break at least, wouldn't it?

Perhaps not, I found the idea funky but you actually don't want to
show a string in exec_replication_command for a T_SQLCmd node. That's
not complicated to check either. So let's discard such a thing for
now.
--
Michael

#9Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#8)
Re: Basebackups reported as idle

On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Wed, Dec 20, 2017 at 9:02 PM, Magnus Hagander <magnus@hagander.net>
wrote:

On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Yes. Of course. I can't read. That's the same as the notice below about

it

not returning false -- I managed to miss the extra if() there, and

thought

it always exited with ERROR.

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't
it? exec_replication_command() is called from PostgresMain(), and then
calls WalSndLoop().

So I agree there is a small gap, but actually moving it to
exec_replication_command() makes that gap smaller than it was before, no?

+   /* Report to pgstat that this process is running */
+   pgstat_report_activity(STATE_RUNNING, NULL);
Bonus points if cmd_string is used instead of string? This way, you
can know what is the replication command running ;)

Do we want that though? That would be a compat break at least, wouldn't

it?

Perhaps not, I found the idea funky but you actually don't want to
show a string in exec_replication_command for a T_SQLCmd node. That's
not complicated to check either. So let's discard such a thing for
now.

Agreed :)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#9)
Re: Basebackups reported as idle

On Thu, Dec 21, 2017 at 02:46:15PM +0100, Magnus Hagander wrote:

On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't
it? exec_replication_command() is called from PostgresMain(), and then
calls WalSndLoop().

So I agree there is a small gap, but actually moving it to
exec_replication_command() makes that gap smaller than it was before, no?

My turn to read things wrong then, thinking that WalSndLoop() was the
main routine used for starting the WAL sender process. You are right removing
the call there is adapted.

Could you update the patch?
--
Michael

#11Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#10)
1 attachment(s)
Re: Basebackups reported as idle

On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Thu, Dec 21, 2017 at 02:46:15PM +0100, Magnus Hagander wrote:

On Thu, Dec 21, 2017 at 1:38 AM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

Eh. But WalSndLoop() is called *after* exec_replication_command(), isn't
it? exec_replication_command() is called from PostgresMain(), and then
calls WalSndLoop().

So I agree there is a small gap, but actually moving it to
exec_replication_command() makes that gap smaller than it was before, no?

My turn to read things wrong then, thinking that WalSndLoop() was the
main routine used for starting the WAL sender process. You are right
removing
the call there is adapted.

Could you update the patch?

I thought I had, but I can see now that email was a figment of my
imagination :)

PFA an actual patch.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

base_backup_running3.patchtext/x-patch; charset=US-ASCII; name=base_backup_running3.patchDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 6a252fcf45..f2e886f99f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1504,6 +1504,9 @@ exec_replication_command(const char *cmd_string)
 	initStringInfo(&reply_message);
 	initStringInfo(&tmpbuf);
 
+	/* Report to pgstat that this process is running */
+	pgstat_report_activity(STATE_RUNNING, NULL);
+
 	switch (cmd_node->type)
 	{
 		case T_IdentifySystemCmd:
@@ -1555,6 +1558,9 @@ exec_replication_command(const char *cmd_string)
 				ereport(ERROR,
 						(errmsg("cannot execute SQL commands in WAL sender for physical replication")));
 
+			/* Report to pgstat that this process is now idle */
+			pgstat_report_activity(STATE_IDLE, NULL);
+
 			/* Tell the caller that this wasn't a WalSender command. */
 			return false;
 
@@ -1570,6 +1576,9 @@ exec_replication_command(const char *cmd_string)
 	/* Send CommandComplete message */
 	EndCommand("SELECT", DestRemote);
 
+	/* Report to pgstat that this process is now idle */
+	pgstat_report_activity(STATE_IDLE, NULL);
+
 	return true;
 }
 
@@ -2089,9 +2098,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
-	/* Report to pgstat that this process is running */
-	pgstat_report_activity(STATE_RUNNING, NULL);
-
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
#12Michael Paquier
michael.paquier@gmail.com
In reply to: Magnus Hagander (#11)
Re: Basebackups reported as idle

On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:

On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Could you update the patch?

I thought I had, but I can see now that email was a figment of my
imagination :)

I'll take that as a fragment instead. The patch as proposed looks good
to me. Thanks for sending an updated version.
--
Michael

#13David Steele
david@pgmasters.net
In reply to: Michael Paquier (#12)
Re: Basebackups reported as idle

On 12/29/17 6:49 AM, Michael Paquier wrote:

On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:

On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Could you update the patch?

I thought I had, but I can see now that email was a figment of my
imagination :)

I'll take that as a fragment instead. The patch as proposed looks good
to me. Thanks for sending an updated version.

Looks good to me as well.

--
-David
david@pgmasters.net

#14Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Michael Paquier (#12)
Re: Basebackups reported as idle

Michael Paquier wrote:

On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:

I thought I had, but I can see now that email was a figment of my
imagination :)

I'll take that as a fragment instead.

Not at all ... https://www.merriam-webster.com/dictionary/figment
"something made up or contrived"

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Magnus Hagander
magnus@hagander.net
In reply to: David Steele (#13)
Re: Basebackups reported as idle

On Fri, Dec 29, 2017 at 3:11 PM, David Steele <david@pgmasters.net> wrote:

On 12/29/17 6:49 AM, Michael Paquier wrote:

On Thu, Dec 28, 2017 at 06:21:46PM +0100, Magnus Hagander wrote:

On Fri, Dec 22, 2017 at 2:31 AM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

Could you update the patch?

I thought I had, but I can see now that email was a figment of my
imagination :)

I'll take that as a fragment instead. The patch as proposed looks good
to me. Thanks for sending an updated version.

Looks good to me as well.

Thanks to both of you for the reviews. Backpatched and pushed!

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;