Problem during Windows service start

Started by Sakai, Teppeiover 7 years ago10 messages
#1Sakai, Teppei
sakai.teppei@jp.fujitsu.com

Hi,

This is my first posting to the mailing list.

Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start.
The Windows service status of the instance is different from actual status.

We got the following situation.
1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w'
2. Start the instance from the Windows service screen.
3. After 60 seconds, the startup process fails with a timeout.
Because crash recovery takes a lot of times.

Then, the service status of the instance become "STOPPED",
but the instance was running.
It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl).

PostgreSQL version : 9.5.12
Operating system : Windows Server 2012 R2

I think this is a bug.
I think it has not been fixed in the latest version, is my understanding correct?
If it is correct, I will fix it.

Regards,
SAKAI Teppei

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Sakai, Teppei (#1)
Re: Problem during Windows service start

Sakai, Teppei wrote:

This is my first posting to the mailing list.

Currently our customer uses PostgreSQL 9.5 and hits problem during Windows service start.
The Windows service status of the instance is different from actual status.

We got the following situation.
1. Register service with 'pg_ctl register -N "PostgreSQL" -U postgres -P xxxx -D D:\data\inst1 -w'
2. Start the instance from the Windows service screen.
3. After 60 seconds, the startup process fails with a timeout.
Because crash recovery takes a lot of times.

Then, the service status of the instance become "STOPPED",
but the instance was running.
It cannot be stopped from the Windows service screen (it can be stopped only with pg_ctl).

PostgreSQL version : 9.5.12
Operating system : Windows Server 2012 R2

I think this is a bug.
I think it has not been fixed in the latest version, is my understanding correct?
If it is correct, I will fix it.

I agree that this is not nice.

How do you propose to fix it?

Yours,
Laurenz Albe

#3Higuchi, Daisuke
higuchi.daisuke@jp.fujitsu.com
In reply to: Laurenz Albe (#2)
1 attachment(s)
RE: Problem during Windows service start

Hi,

This thread is inactive, but I want to solve this problem.
I think this problem rarely occurs in 10 or later version because of commit [1]https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c. Because "pg_ctl start -w" wait for only PID file creation. It means that timeout is not occurred even if crash recovery takes a lot of times.
However, 9.6 or earlier still wait for long time recovery complete.

How do you propose to fix it?

I think there are two solutions.

One solution is that status of Windows Service should be changed to "SERVICE_RUNNING" even if timeout is occurred because of long time recovery. I attached the patch of this solution.

Another solution is to backport commit [1]https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c to 9.6 or earlier version. However this fix change the content of PID file, so incompatible change, I think.

I would appreciate it if you give any comments.

[1]: https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c

Regards,
Daisuke, Higuchi

Attachments:

windows_service_bug_fix.patchapplication/octet-stream; name=windows_service_bug_fix.patchDownload
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index efc0729..12021ac 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1602,13 +1602,20 @@ pgwin32_ServiceMain(DWORD argc, LPTSTR *argv)
 	if (do_wait)
 	{
 		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
-		if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
+		switch (test_postmaster_connection(postmasterPID, true))
 		{
-			write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
-			pgwin32_SetServiceStatus(SERVICE_STOPPED);
-			return;
+			case PQPING_OK:
+				write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
+				break;
+			case PQPING_REJECT:
+				write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the background\n"));
+				break;
+			case PQPING_NO_RESPONSE:
+			case PQPING_NO_ATTEMPT:
+				write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed because of wrong parameter or something\n"));
+				pgwin32_SetServiceStatus(SERVICE_STOPPED);
+				return;
 		}
-		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
 	}
 
 	pgwin32_SetServiceStatus(SERVICE_RUNNING);
#4Michael Paquier
michael@paquier.xyz
In reply to: Higuchi, Daisuke (#3)
Re: Problem during Windows service start

On Wed, Jan 09, 2019 at 05:28:29AM +0000, Higuchi, Daisuke wrote:

One solution is that status of Windows Service should be changed to
"SERVICE_RUNNING" even if timeout is occurred because of long time
recovery. I attached the patch of this solution.

You should register this patch to the next commit fest in the section
for bug fixes to not lose sight of it;
https://commitfest.postgresql.org/22/

+    case PQPING_NO_ATTEMPT:
+        write_eventlog(EVENTLOG_ERROR_TYPE,
_("Server startup failed because of wrong parameter or something\n"));
+        pgwin32_SetServiceStatus(SERVICE_STOPPED);

I haven't put much thoughts into what you propose here, but this status
message is not really helpful for the user.

Another solution is to backport commit [1] to 9.6 or earlier
version. However this fix change the content of PID file, so
incompatible change, I think.

Yeah, let's not go down that road.
--
Michael

#5Higuchi, Daisuke
higuchi.daisuke@jp.fujitsu.com
In reply to: Michael Paquier (#4)
1 attachment(s)
RE: Problem during Windows service start

Michael Paquier wrote:

You should register this patch to the next commit fest in the section for bug fixes to not lose sight of it;

Thank you for picking up my post. I registered to the next CF.

I haven't put much thoughts into what you propose here, but this status message is not really helpful for the user.

Thank you for review. I attached the updated patch. There are some reasons for server startup failing, so I changed the message like following:

+			case PQPING_NO_RESPONSE:
+			case PQPING_NO_ATTEMPT:
+				write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n"));
+				pgwin32_SetServiceStatus(SERVICE_STOPPED);

Regards,
Daisuke, Higuchi

Attachments:

windows_service_bug_fix_v2.patchapplication/octet-stream; name=windows_service_bug_fix_v2.patchDownload
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index efc0729..3ddab6c 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1602,13 +1602,20 @@ pgwin32_ServiceMain(DWORD argc, LPTSTR *argv)
 	if (do_wait)
 	{
 		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
-		if (test_postmaster_connection(postmasterPID, true) != PQPING_OK)
+		switch (test_postmaster_connection(postmasterPID, true))
 		{
-			write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
-			pgwin32_SetServiceStatus(SERVICE_STOPPED);
-			return;
+			case PQPING_OK:
+				write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
+				break;
+			case PQPING_REJECT:
+				write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the background\n"));
+				break;
+			case PQPING_NO_RESPONSE:
+			case PQPING_NO_ATTEMPT:
+				write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n"));
+				pgwin32_SetServiceStatus(SERVICE_STOPPED);
+				return;
 		}
-		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
 	}
 
 	pgwin32_SetServiceStatus(SERVICE_RUNNING);
#6Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Higuchi, Daisuke (#5)
RE: Problem during Windows service start

Hi Higuchi-san,

(1)
What made you think this problem rarely occurs in PG 10 or later? Looking at the following code, this seems to happen in PG 10+ too.

if (do_wait)
{
write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY)
{
write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
pgwin32_SetServiceStatus(SERVICE_STOPPED);
return;
}
write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
}

(2)
What state should we consider SERVICE_RUNNING as? Isn't it the state where the server has completed startup processing and accepts connections? If no, how is it different from SERVICE_STARTING?

(I know that when -w (wait) is not specified, the status becomes SERVICE_RUNNING whether or not the server completes startup processing...)

(3)
+ write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the background\n"));

This message is new, isn't it? I think the existing message "Time out..." is enough.

(4)
+ write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n"));

The first sentence (phenomenon) and the second line (detail or action) should be separated with a newline. Below are some examples in pg_ctl.c. Note that write_stderr() writes to the eventlog when running under a Windows service.

write_stderr(_("%s: could not start server\n"
"Examine the log output.\n"),

write_stderr(_("The program \"%s\" was found by \"%s\"\n"
"but was not the same version as %s.\n"
"Check your installation.\n"),

Regards
Takayuki Tsunakawa

#7Ramanarayana
raam.soft@gmail.com
In reply to: Tsunakawa, Takayuki (#6)
Re: Problem during Windows service start

Hi,

If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it be correct
to set the status of windows service to SERVICE_START_PENDING ?

I would like to take this up if no one is working on this.

Regards,
Ram.

#8Higuchi, Daisuke
higuchi.daisuke@jp.fujitsu.com
In reply to: Ramanarayana (#7)
RE: Problem during Windows service start

Hi,

Thank you for picking up this and I'm sorry for delay reply.

If wait_for_postmaster returns POSTMASTER_STILL_STARTING will it
be correct to set the status of windows service to SERVICE_START_PENDING ?

Yes, I think this is the best. Currently, I do not have good solution to change the status from SERVICE_START_PENDING to SERVICE_RUNNING after recovery is completed.

I would like to take this up if no one is working on this.

Thank you for your proposing! I would like to ask if possible.
I do not have much time to think about this topic now...

Regards,
Higuchi

#9Ramanarayana
raam.soft@gmail.com
In reply to: Higuchi, Daisuke (#8)
1 attachment(s)
Re: Problem during Windows service start

Hi,

Please find the proposed patch for review. I will attach it to commitfest
as well

Regards,
Ram.

Attachments:

windows_service_bug_fix_v2.patchapplication/octet-stream; name=windows_service_bug_fix_v2.patchDownload
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index febb076..5ecc0e9 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1653,13 +1653,20 @@
 	if (do_wait)
 	{
 		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n"));
-		if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY)
+		switch (wait_for_postmaster(postmasterPID, true))
 		{
-			write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
-			pgwin32_SetServiceStatus(SERVICE_STOPPED);
-			return;
+			case POSTMASTER_READY:
+				write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
+				break;
+			case POSTMASTER_STILL_STARTING:
+				write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
+				pgwin32_SetServiceStatus(SERVICE_START_PENDING);
+				return;
+			case POSTMASTER_FAILED:
+				write_eventlog(EVENTLOG_ERROR_TYPE, _("could not start server\n"));
+				pgwin32_SetServiceStatus(SERVICE_STOPPED);
+				return;
 		}
-		write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n"));
 	}
 
 	pgwin32_SetServiceStatus(SERVICE_RUNNING);
#10Higuchi, Daisuke
higuchi.daisuke@jp.fujitsu.com
In reply to: Ramanarayana (#9)
RE: Problem during Windows service start

Hi,

+           case POSTMASTER_STILL_STARTING:
+               write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n"));
+               pgwin32_SetServiceStatus(SERVICE_START_PENDING);
+               return;

Could this patch solve first post's problem [1]/messages/by-id/99C4246E73ED1B478BBB9671718426203E37F485@G01JPEXMBKW03 ?
I think no one could change the service status to SERVICE_RUNNING even if the server has been started properly after timeout is occurred.

In 9.6 or earlier, the main use case where the problem is occurred is when timeout is occured because of long time recovery. Even if recovery takes a lot of time and timeout is ocurred, recovery continues in background. In this case, I want to set the status to SERVICE_RUNNING after recovery is completed.

In 10 or later, I understand wait_for_postmaster does not wait until recovery is completed, so I think this problem rarely occurs in PG 10 or later.

[1]: /messages/by-id/99C4246E73ED1B478BBB9671718426203E37F485@G01JPEXMBKW03

Regards,
Daisuke, Higuchi