Cluster::restart dumping logs when stop fails

Started by Andres Freundabout 2 years ago10 messageshackers

andres@anarazel.de

about 2 years ago

Hi,

I recently started to be bothered by regress_* logs after some kinds of test
failures containing the whole log of a test failure. E.g. in
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-04-06%2016%3A28%3A38

...
### Restarting node "standby"
# Running: pg_ctl -w -D /home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata -l /home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log restart
waiting for server to shut down........................................................................................................................................................................................................................................................................................................................................................................... failed
pg_ctl: server does not shut down
# pg_ctl restart failed; logfile:
2024-04-06 16:33:37.496 UTC [2628363][postmaster][:0] LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.1, 64-bit
2024-04-06 16:33:37.503 UTC [2628363][postmaster][:0] LOG: listening on Unix socket "/tmp/55kikMaTyW/.s.PGSQL.63274"
<many lines>

Looks like the printing of the entire log was added in:

commit 33774978c78175095da9e6c276e8bcdb177725f8
Author: Daniel Gustafsson <dgustafsson@postgresql.org>
Date: 2023-09-22 13:35:37 +0200

Avoid using internal test methods in SSL tests

It might be useful to print a few lines, but the whole log files can be
several megabytes worth of output. In the buildfarm that leads to the same
information being collected multiple times, and locally it makes it hard to
see where the "normal" contents of regress_log* continue.

Greetings,

Andres Freund

Daniel Gustafsson

daniel@yesql.se

about 2 years ago

In reply to: Andres Freund (#1)

Re: Cluster::restart dumping logs when stop fails

On 6 Apr 2024, at 23:44, Andres Freund <andres@anarazel.de> wrote:

It might be useful to print a few lines, but the whole log files can be
several megabytes worth of output.

The non-context aware fix would be to just print the last 1024 (or something)
bytes from the logfile:

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 54e1008ae5..53d4751ffc 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -951,8 +951,8 @@ sub start

 	if ($ret != 0)
 	{
-		print "# pg_ctl start failed; logfile:\n";
-		print PostgreSQL::Test::Utils::slurp_file($self->logfile);
+		print "# pg_ctl start failed; logfile excerpt:\n";
+		print substr PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024;

# pg_ctl could have timed out, so check to see if there's a pid file;
# otherwise our END block will fail to shut down the new postmaster.

Would that be a reasonable fix?

--
Daniel Gustafsson

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Daniel Gustafsson (#2)

Re: Cluster::restart dumping logs when stop fails

Hi,

On 2024-04-07 00:19:35 +0200, Daniel Gustafsson wrote:

On 6 Apr 2024, at 23:44, Andres Freund <andres@anarazel.de> wrote:

It might be useful to print a few lines, but the whole log files can be
several megabytes worth of output.

The non-context aware fix would be to just print the last 1024 (or something)
bytes from the logfile:

That'd be better, yes. I'd mainly log the path to the logfile though, that's
probably at least as helpful for actually investigating the issue?

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 54e1008ae5..53d4751ffc 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -951,8 +951,8 @@ sub start

if ($ret != 0)
{
-		print "# pg_ctl start failed; logfile:\n";
-		print PostgreSQL::Test::Utils::slurp_file($self->logfile);
+		print "# pg_ctl start failed; logfile excerpt:\n";
+		print substr PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024;

# pg_ctl could have timed out, so check to see if there's a pid file;
# otherwise our END block will fail to shut down the new postmaster.

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

Greetings,

Andres Freund

Daniel Gustafsson

daniel@yesql.se

about 2 years ago

In reply to: Andres Freund (#3)

Re: Cluster::restart dumping logs when stop fails

On 7 Apr 2024, at 02:49, Andres Freund <andres@anarazel.de> wrote:
On 2024-04-07 00:19:35 +0200, Daniel Gustafsson wrote:

On 6 Apr 2024, at 23:44, Andres Freund <andres@anarazel.de> wrote:

The non-context aware fix would be to just print the last 1024 (or something)
bytes from the logfile:

That'd be better, yes. I'd mainly log the path to the logfile though, that's
probably at least as helpful for actually investigating the issue?

IIRC this was modelled around how it used to be earlier, in v14 in the
pre-refactored version we print the full logfile.

How about doing the below backpatched down to 15?

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 54e1008ae5..a2f9409842 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -951,8 +951,7 @@ sub start

 	if ($ret != 0)
 	{
-		print "# pg_ctl start failed; logfile:\n";
-		print PostgreSQL::Test::Utils::slurp_file($self->logfile);
+		print "# pg_ctl start failed; see logfile for details: " . $self->logfile . "\n";

# pg_ctl could have timed out, so check to see if there's a pid file;
# otherwise our END block will fail to shut down the new postmaster.
@@ -1090,8 +1089,7 @@ sub restart

 	if ($ret != 0)
 	{
-		print "# pg_ctl restart failed; logfile:\n";
-		print PostgreSQL::Test::Utils::slurp_file($self->logfile);
+		print "# pg_ctl restart failed; see logfile for details: " . $self->logfile . "\n";

# pg_ctl could have timed out, so check to see if there's a pid file;
# otherwise our END block will fail to shut down the new postmaster.

--
Daniel Gustafsson

Andrew Dunstan

andrew@dunslane.net

about 2 years ago

In reply to: Andres Freund (#3)

Re: Cluster::restart dumping logs when stop fails

On 2024-04-06 Sa 20:49, Andres Freund wrote:

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

We should enhance slurp_file() so it uses SEEK_END if the offset is
negative. It would be a trivial patch:

diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 42d5a50dc8..8256573957 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -524,7 +524,7 @@ sub slurp_file

     if (defined($offset))
     {
-       seek($fh, $offset, SEEK_SET)
+       seek($fh, $offset, ($offset < 0 ? SEEK_END : SEEK_SET))
           or croak "could not seek \"$filename\": $!";
     }

cheers

andrew

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Daniel Gustafsson

daniel@yesql.se

about 2 years ago

In reply to: Andrew Dunstan (#5)

Re: Cluster::restart dumping logs when stop fails

On 7 Apr 2024, at 14:51, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-04-06 Sa 20:49, Andres Freund wrote:

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

We should enhance slurp_file() so it uses SEEK_END if the offset is negative.

Absolutely agree. Reading the thread I think Andres argues for not printing
anything at all in this case but we should support negative offsets anyways, it
will fort sure come in handy.

--
Daniel Gustafsson

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Daniel Gustafsson (#6)

Re: Cluster::restart dumping logs when stop fails

On 2024-04-07 16:52:05 +0200, Daniel Gustafsson wrote:

On 7 Apr 2024, at 14:51, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-04-06 Sa 20:49, Andres Freund wrote:

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

We should enhance slurp_file() so it uses SEEK_END if the offset is negative.

Absolutely agree. Reading the thread I think Andres argues for not printing
anything at all in this case but we should support negative offsets anyways, it
will fort sure come in handy.

I'm ok with printing path + some content or just the path.

- Andres

Daniel Gustafsson

daniel@yesql.se

about 2 years ago

In reply to: Andres Freund (#7)

Re: Cluster::restart dumping logs when stop fails

On 7 Apr 2024, at 18:28, Andres Freund <andres@anarazel.de> wrote:

On 2024-04-07 16:52:05 +0200, Daniel Gustafsson wrote:

On 7 Apr 2024, at 14:51, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-04-06 Sa 20:49, Andres Freund wrote:

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

We should enhance slurp_file() so it uses SEEK_END if the offset is negative.

Absolutely agree. Reading the thread I think Andres argues for not printing
anything at all in this case but we should support negative offsets anyways, it
will fort sure come in handy.

I'm ok with printing path + some content or just the path.

I think printing the last 512 bytes or so would be a good approach, I'll take
care of it later tonight. That would be a backpatchable change IMHO.

--
Daniel Gustafsson

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Daniel Gustafsson (#8)

Re: Cluster::restart dumping logs when stop fails

On 2024-04-07 18:51:40 +0200, Daniel Gustafsson wrote:

On 7 Apr 2024, at 18:28, Andres Freund <andres@anarazel.de> wrote:

On 2024-04-07 16:52:05 +0200, Daniel Gustafsson wrote:

On 7 Apr 2024, at 14:51, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2024-04-06 Sa 20:49, Andres Freund wrote:

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

We should enhance slurp_file() so it uses SEEK_END if the offset is negative.

Absolutely agree. Reading the thread I think Andres argues for not printing
anything at all in this case but we should support negative offsets anyways, it
will fort sure come in handy.

I'm ok with printing path + some content or just the path.

I think printing the last 512 bytes or so would be a good approach, I'll take
care of it later tonight. That would be a backpatchable change IMHO.

+1 - thanks for quickly improving this.

#10

Daniel Gustafsson

daniel@yesql.se

about 2 years ago

In reply to: Daniel Gustafsson (#8)

Re: Cluster::restart dumping logs when stop fails

On 7 Apr 2024, at 18:51, Daniel Gustafsson <daniel@yesql.se> wrote:

On 7 Apr 2024, at 18:28, Andres Freund <andres@anarazel.de> wrote:

I'm ok with printing path + some content or just the path.

I think printing the last 512 bytes or so would be a good approach, I'll take
care of it later tonight. That would be a backpatchable change IMHO.

In the end I opted for just printing the path to avoid introducing logic (at
this point in the cycle) for ensuring that a negative offset doesn't exceed the
filesize. Since it changes behavior I haven't backpatched it, but can do that
if we think it's more of a fix than a change (I think it can be argued either
way, I have no strong feelings).

--
Daniel Gustafsson