Improve error reporting in 027_stream_regress test

Started by Nazir Bilal Yavuz6 months ago30 messages
#1Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
1 attachment(s)

Hi,

This patch aims to improve 027_stream_regress test's regression test
error reporting per Andres' suggestion [1]/messages/by-id/k46gdpibwaevxgb3cefgkl4weykcggal6evlbg5pcei4aswtli@wrs732j5co3p. It basically dumps 50
lines from head and tail of the regression.diff file to the failure
message if both primary and standby is alive and the regression test
is failed.

I used the diag() function to dump regression.diff to the failure
message. I am not sure if that is the best function to use, open to
suggestions.

One disadvantage of this patch is that regress_log_027_stream_regress
has both complete and head+tail of the regression.diff file.

[1]: /messages/by-id/k46gdpibwaevxgb3cefgkl4weykcggal6evlbg5pcei4aswtli@wrs732j5co3p

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v1-0001-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Improve-error-reporting-in-027_stream_regress-tes.patch
#2Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Nazir Bilal Yavuz (#1)
Re: Improve error reporting in 027_stream_regress test

Hi,

On 2025-06-30 16:01:04 +0300, Nazir Bilal Yavuz wrote:

This patch aims to improve 027_stream_regress test's regression test
error reporting per Andres' suggestion [1].

Thanks for working on that!

One thing I don't yet like is that I think we should report if the primary is
alive *before* reporting the diff and skipping reporting it if the primary
crashed. It's not interesting to report a long diff if the server crashed IMO.

Greetings,

Andres Freund

#3Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-06-30 16:01:04 +0300, Nazir Bilal Yavuz wrote:

This patch aims to improve 027_stream_regress test's regression test
error reporting per Andres' suggestion [1].

Thanks for working on that!

One thing I don't yet like is that I think we should report if the primary is
alive *before* reporting the diff and skipping reporting it if the primary
crashed. It's not interesting to report a long diff if the server crashed IMO.

I agree with you. So, the current logic is:

If primary is not alive: Do not report anything.
If only primary is alive: Report the entire diff file.
If both primary and standby are alive: Report entire diff file and add
head+tail of diff to the failure message.

Done like above in v2.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v2-0001-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improve-error-reporting-in-027_stream_regress-tes.patch
#4Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#3)
Re: Improve error reporting in 027_stream_regress test

On Tue, Jul 01, 2025 at 10:57:11AM +0300, Nazir Bilal Yavuz wrote:

On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:

One thing I don't yet like is that I think we should report if the primary is
alive *before* reporting the diff and skipping reporting it if the primary
crashed. It's not interesting to report a long diff if the server crashed IMO.

Right. That's just more noise one needs to dig into to find out
what's wrong.

I agree with you. So, the current logic is:

If primary is not alive: Do not report anything.

Okay.

If only primary is alive: Report the entire diff file.

Hmm. Is that actually useful as we know that the standby has been
stalen down when running the test? Even if we report something, we
could always trim the output, like the standby case.

If both primary and standby are alive: Report entire diff file and add
head+tail of diff to the failure message.

+=item $node->is_alive()
+
+Check if the node is alive.
+
+=cut
+
+sub is_alive {
+    my ($self) = @_;
+
+    my $host = $self->host;
+    my $port = $self->port;
+    my $null = File::Spec->devnull;
+
+    my $cmd = "pg_isready -h $host -p $port";
+        return !system("$cmd >$null 2>&1");
+}

It's strange to re-add a dependency to system() while we have wrappers
around it, like system_log() in Utils.pm.

You may want to add a "local %ENV = $self->_get_env();" to make sure
that we use a command related to the build of the node initialized.

+sub regression_log_helper
+{
+	my ($diff_file, $lines_to_dump) = @_;
+	my @lines;
+
+	open my $fh, '<', $diff_file or die "couldn't open file: $diff_file\n";
+
+	# Read all lines to process them below
+	while (my $line = <$fh>)
+	{
+		push @lines, $line;
+	}
+	close $fh;

We could use that as a routine of its own in Utils.pm? It's not
really something only for diff files, more something that we can use
to trim the contents of a file, which could be the tail of a server
log file as well.. It's always difficult to measure what a "good"
number of lines is, but perhaps we could use an environment variable
to make that configurable rather than enforce a policy of 50 lines
because we think it's good enough?

The fact that we expect the primary and the standby to be alive once
the tests are run and the addition of two tests related to them seems
unrelated to the diff report improvements, so I'd suggest to do both
things separately.
--
Michael

#5Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#4)
Re: Improve error reporting in 027_stream_regress test

Hi,

Thanks for looking into this!

On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 01, 2025 at 10:57:11AM +0300, Nazir Bilal Yavuz wrote:

On Mon, 30 Jun 2025 at 18:01, Andres Freund <andres@anarazel.de> wrote:

One thing I don't yet like is that I think we should report if the primary is
alive *before* reporting the diff and skipping reporting it if the primary
crashed. It's not interesting to report a long diff if the server crashed IMO.

Right. That's just more noise one needs to dig into to find out
what's wrong.

I agree with you. So, the current logic is:

If primary is not alive: Do not report anything.

Okay.

If only primary is alive: Report the entire diff file.

Hmm. Is that actually useful as we know that the standby has been
stalen down when running the test? Even if we report something, we
could always trim the output, like the standby case.

I guess you are right. Also, I tried killing standby while the
027_stream_regress test was running and the test did not fail; instead
it continued until timeout. If that is always the case, then it makes
sense to report if and only if both primary and standby are alive.

If both primary and standby are alive: Report entire diff file and add
head+tail of diff to the failure message.

+=item $node->is_alive()
+
+Check if the node is alive.
+
+=cut
+
+sub is_alive {
+    my ($self) = @_;
+
+    my $host = $self->host;
+    my $port = $self->port;
+    my $null = File::Spec->devnull;
+
+    my $cmd = "pg_isready -h $host -p $port";
+        return !system("$cmd >$null 2>&1");
+}

It's strange to re-add a dependency to system() while we have wrappers
around it, like system_log() in Utils.pm.

system_log() logs which command is run. I actually do not want to log
the command because it needs to be run after the regression tests and
before the 'regression tests pass'. If I log the command it looks like
this:

# All 228 tests passed.
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
(21.939s) ok 2 - regression tests pass
(0.000s) ok 3 - primary is alive after the regression tests
(0.000s) ok 4 - standby is alive after the regression tests

I think it looks like these pg_isready commands run randomly.

You may want to add a "local %ENV = $self->_get_env();" to make sure
that we use a command related to the build of the node initialized.

You are right, I will apply this.

+sub regression_log_helper
+{
+       my ($diff_file, $lines_to_dump) = @_;
+       my @lines;
+
+       open my $fh, '<', $diff_file or die "couldn't open file: $diff_file\n";
+
+       # Read all lines to process them below
+       while (my $line = <$fh>)
+       {
+               push @lines, $line;
+       }
+       close $fh;

We could use that as a routine of its own in Utils.pm? It's not
really something only for diff files, more something that we can use
to trim the contents of a file, which could be the tail of a server
log file as well.. It's always difficult to measure what a "good"
number of lines is, but perhaps we could use an environment variable
to make that configurable rather than enforce a policy of 50 lines
because we think it's good enough?

I think this is a good idea. How about a function called file_trim()
with three arguments: the file name, a mode ('head' or 'tail'), and
the number of lines to trim from that end. This approach may require
reading the file more than once, but it is easy to use.

The fact that we expect the primary and the standby to be alive once
the tests are run and the addition of two tests related to them seems
unrelated to the diff report improvements, so I'd suggest to do both
things separately.

Correct, I will separate them.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#6Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#5)
Re: Improve error reporting in 027_stream_regress test

On Wed, Jul 16, 2025 at 02:32:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:

Hmm. Is that actually useful as we know that the standby has been
stalen down when running the test? Even if we report something, we
could always trim the output, like the standby case.

My fingers and brain have flipped here. I meant likely
s/stalen/taken/.

I guess you are right. Also, I tried killing standby while the
027_stream_regress test was running and the test did not fail; instead
it continued until timeout. If that is always the case, then it makes
sense to report if and only if both primary and standby are alive.

Okay. Understood. I'm pretty sure that somebody around you has also
run the test and crashed the standby on replay, to note that the
pattern on HEAD was bad.

system_log() logs which command is run. I actually do not want to log
the command because it needs to be run after the regression tests and
before the 'regression tests pass'. If I log the command it looks like
this:

# All 228 tests passed.
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
(21.939s) ok 2 - regression tests pass
(0.000s) ok 3 - primary is alive after the regression tests
(0.000s) ok 4 - standby is alive after the regression tests

I think it looks like these pg_isready commands run randomly.

FWIW, I'm usually in favor of a bit more logging, to understand what
happens in the test script under-the-hood. That's less guessing when
calling these routines. If I'm outvoted, that's fine.

We could use that as a routine of its own in Utils.pm? It's not
really something only for diff files, more something that we can use
to trim the contents of a file, which could be the tail of a server
log file as well.. It's always difficult to measure what a "good"
number of lines is, but perhaps we could use an environment variable
to make that configurable rather than enforce a policy of 50 lines
because we think it's good enough?

I think this is a good idea. How about a function called file_trim()
with three arguments: the file name, a mode ('head' or 'tail'), and
the number of lines to trim from that end. This approach may require
reading the file more than once, but it is easy to use.

Sounds fine to me. Thanks!
--
Michael

#7Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#6)
3 attachment(s)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Thu, 17 Jul 2025 at 03:08, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jul 16, 2025 at 02:32:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 16 Jul 2025 at 04:39, Michael Paquier <michael@paquier.xyz> wrote:

Hmm. Is that actually useful as we know that the standby has been
stalen down when running the test? Even if we report something, we
could always trim the output, like the standby case.

My fingers and brain have flipped here. I meant likely
s/stalen/taken/.

I guess you are right. Also, I tried killing standby while the
027_stream_regress test was running and the test did not fail; instead
it continued until timeout. If that is always the case, then it makes
sense to report if and only if both primary and standby are alive.

Okay. Understood. I'm pretty sure that somebody around you has also
run the test and crashed the standby on replay, to note that the
pattern on HEAD was bad.

Updated, now error reporting happens only if regression tests are
failed and both primary and standby are alive.

system_log() logs which command is run. I actually do not want to log
the command because it needs to be run after the regression tests and
before the 'regression tests pass'. If I log the command it looks like
this:

# All 228 tests passed.
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10493 >/dev/null 2>&1
# Running: pg_isready -h /tmp/EXCcSldGZE -p 10494 >/dev/null 2>&1
(21.939s) ok 2 - regression tests pass
(0.000s) ok 3 - primary is alive after the regression tests
(0.000s) ok 4 - standby is alive after the regression tests

I think it looks like these pg_isready commands run randomly.

FWIW, I'm usually in favor of a bit more logging, to understand what
happens in the test script under-the-hood. That's less guessing when
calling these routines. If I'm outvoted, that's fine.

I don't have a strong opinion on this. My point was 'pg_isready' being
on top of 'regression test pass' and not being on top of
'primary|standby is alive after the regression tests' gives a false
impression to me. However, if you say this is okay; then I can update
it.

We could use that as a routine of its own in Utils.pm? It's not
really something only for diff files, more something that we can use
to trim the contents of a file, which could be the tail of a server
log file as well.. It's always difficult to measure what a "good"
number of lines is, but perhaps we could use an environment variable
to make that configurable rather than enforce a policy of 50 lines
because we think it's good enough?

I think this is a good idea. How about a function called file_trim()
with three arguments: the file name, a mode ('head' or 'tail'), and
the number of lines to trim from that end. This approach may require
reading the file more than once, but it is easy to use.

Sounds fine to me. Thanks!

I added that as 0001. I used a shifting method for the 'tail'
direction to not use too much memory. I found that there is
'File::ReadBackwards' in Perl but you need to install it, so I didn't
use it.

Now patch is divided into three:

0001: 'Add trim_file() helper to Utils.pm' -> Which effectively does
nothing, just adds a function to be used for a subsequent patch. This
function accepts 'line_count' as an argument but
'PG_TEST_FILE_TRIM_LINES' environment variable overrides it. Should I
document 'PG_TEST_FILE_TRIM_LINES' somewhere?

0002: 'Improve error reporting in 027_stream_regress test' -> Uses
trim_file() function to improve error reporting by including the head
and tail of regression.diffs directly in the failure message.

0003: 'Check primary and standby are alive after regression tests in
027_stream_regress' -> Add test for checking status of primary and
standby after the regression tests in 027_stream_regress. Also, it
makes error reporting happen only if regression tests are failed and
both primary and standby are alive.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v3-0001-Add-trim_file-helper-to-Utils.pm.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Add-trim_file-helper-to-Utils.pm.patch
v3-0002-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Improve-error-reporting-in-027_stream_regress-tes.patch
v3-0003-Check-primary-and-standby-are-alive-after-regress.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Check-primary-and-standby-are-alive-after-regress.patch
#8Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#7)
Re: Improve error reporting in 027_stream_regress test

On Fri, Jul 18, 2025 at 11:57:07AM +0300, Nazir Bilal Yavuz wrote:

I added that as 0001. I used a shifting method for the 'tail'
direction to not use too much memory. I found that there is
'File::ReadBackwards' in Perl but you need to install it, so I didn't
use it.

Some old buildfarm may complain when it comes to that, like some of
Tom's stuff (?).

0001: 'Add trim_file() helper to Utils.pm' -> Which effectively does
nothing, just adds a function to be used for a subsequent patch. This
function accepts 'line_count' as an argument but
'PG_TEST_FILE_TRIM_LINES' environment variable overrides it. Should I
document 'PG_TEST_FILE_TRIM_LINES' somewhere?

Documentation is required in regress.sgml, yes.

0002: 'Improve error reporting in 027_stream_regress test' -> Uses
trim_file() function to improve error reporting by including the head
and tail of regression.diffs directly in the failure message.

The structure is strange, it seems to me that we should target things
so as we have only one PG_TEST_FILE_TRIM_LINES defined in the tree,
not two with one local to 027.

0003: 'Check primary and standby are alive after regression tests in
027_stream_regress' -> Add test for checking status of primary and
standby after the regression tests in 027_stream_regress. Also, it
makes error reporting happen only if regression tests are failed and
both primary and standby are alive.

This one looks acceptable to me, so applied to begin with something,
splitting things into two pieces for clarity with some tweaks. I have
changed things to use system_log() at the end, with fat-commas to link
the long options and their arguments.
--
Michael

#9Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#8)
Re: Improve error reporting in 027_stream_regress test

Michael Paquier <michael@paquier.xyz> writes:

On Fri, Jul 18, 2025 at 11:57:07AM +0300, Nazir Bilal Yavuz wrote:

I added that as 0001. I used a shifting method for the 'tail'
direction to not use too much memory. I found that there is
'File::ReadBackwards' in Perl but you need to install it, so I didn't
use it.

Some old buildfarm may complain when it comes to that, like some of
Tom's stuff (?).

We're trying to avoid requiring anything that's not in a "standard"
Perl installation, with the exception of course of IPC::Run which
there's no way to avoid using. Now, the "standard" set is a bit
fuzzy, but certainly if a module is not present in some major
distro's minimum Perl package set then using it is a problem.

regards, tom lane

#10Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#8)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Sat, 19 Jul 2025 at 09:06, Michael Paquier <michael@paquier.xyz> wrote:

This one looks acceptable to me, so applied to begin with something,
splitting things into two pieces for clarity with some tweaks. I have
changed things to use system_log() at the end, with fat-commas to link
the long options and their arguments.

Thanks!

I realized that we actually don't trim the file, we do the opposite;
read the file from both ends. Sorry for not realizing earlier. I will
update the remaining patches according to that but I think trim_file()
is helpful, too. What do you think about adding both

```
trim_file() -> trims $n lines from head or tail of the file and
returns the remaining lines.
read_file_ends() -> returns $n lines from head or tail of the file.
```

although trim_file() will not be used in these particular patches?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#11Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#8)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Sat, 19 Jul 2025 at 09:06, Michael Paquier <michael@paquier.xyz> wrote:

The structure is strange, it seems to me that we should target things
so as we have only one PG_TEST_FILE_TRIM_LINES defined in the tree,
not two with one local to 027.

I see your point. Then the problem is regression_log_helper() can't
check if $lines_to_dump * 2 is bigger than the number of lines in the
regression.diffs file. So, it can't decide whether or not to dump all
of the regression.diffs. Would it be okay if we repeat some lines in
both the head and tail lines?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#12Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#10)
Re: Improve error reporting in 027_stream_regress test

On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:

I realized that we actually don't trim the file, we do the opposite;
read the file from both ends. Sorry for not realizing earlier. I will
update the remaining patches according to that but I think trim_file()
is helpful, too. What do you think about adding both

I did not review the contents of the patch yet, as I was rather unsure
about the right semantics here.

```
trim_file() -> trims $n lines from head or tail of the file and
returns the remaining lines.
read_file_ends() -> returns $n lines from head or tail of the file.
```

although trim_file() will not be used in these particular patches?

And this made me wonder over the weekend if only showing the head
and/or tail of a file is always the best set of properties.
Then I came up that this could be something close to what git-grep
-A/-B is able to do. For example, for a crash, it would be much
cheaper to target a log entry that matches with what we see in the
usual crash stacks, then print the surroundings. The "trim" behavior
you have proposed is a subset of that, where the matching patterns are
the end and/or the beginning of the file to show.

So the API could com down to a pattern matching, with "head" and
"tail" being the optional number of lines we'd want to print around
the pattern we are looking for.
--
Michael

#13Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#12)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Tue, 22 Jul 2025 at 03:56, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:

I realized that we actually don't trim the file, we do the opposite;
read the file from both ends. Sorry for not realizing earlier. I will
update the remaining patches according to that but I think trim_file()
is helpful, too. What do you think about adding both

I did not review the contents of the patch yet, as I was rather unsure
about the right semantics here.

```
trim_file() -> trims $n lines from head or tail of the file and
returns the remaining lines.
read_file_ends() -> returns $n lines from head or tail of the file.
```

although trim_file() will not be used in these particular patches?

And this made me wonder over the weekend if only showing the head
and/or tail of a file is always the best set of properties.
Then I came up that this could be something close to what git-grep
-A/-B is able to do. For example, for a crash, it would be much
cheaper to target a log entry that matches with what we see in the
usual crash stacks, then print the surroundings. The "trim" behavior
you have proposed is a subset of that, where the matching patterns are
the end and/or the beginning of the file to show.

So the API could com down to a pattern matching, with "head" and
"tail" being the optional number of lines we'd want to print around
the pattern we are looking for.

That makes sense and could be really useful in more general cases but
I think that discussion might be better suited for a separate thread.

For this specific case, we are dealing with regression.diffs; which
already contains the relevant diff output. We don't need to search for
patterns here; we just want to include a small portion of the file in
the error message, to spare users from having to open the file
manually.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#14Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Nazir Bilal Yavuz (#13)
2 attachment(s)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Thu, 24 Jul 2025 at 13:34, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Tue, 22 Jul 2025 at 03:56, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jul 21, 2025 at 11:53:00AM +0300, Nazir Bilal Yavuz wrote:

I realized that we actually don't trim the file, we do the opposite;
read the file from both ends. Sorry for not realizing earlier. I will
update the remaining patches according to that but I think trim_file()
is helpful, too. What do you think about adding both

I did not review the contents of the patch yet, as I was rather unsure
about the right semantics here.

```
trim_file() -> trims $n lines from head or tail of the file and
returns the remaining lines.
read_file_ends() -> returns $n lines from head or tail of the file.
```

although trim_file() will not be used in these particular patches?

And this made me wonder over the weekend if only showing the head
and/or tail of a file is always the best set of properties.
Then I came up that this could be something close to what git-grep
-A/-B is able to do. For example, for a crash, it would be much
cheaper to target a log entry that matches with what we see in the
usual crash stacks, then print the surroundings. The "trim" behavior
you have proposed is a subset of that, where the matching patterns are
the end and/or the beginning of the file to show.

So the API could com down to a pattern matching, with "head" and
"tail" being the optional number of lines we'd want to print around
the pattern we are looking for.

That makes sense and could be really useful in more general cases but
I think that discussion might be better suited for a separate thread.

For this specific case, we are dealing with regression.diffs; which
already contains the relevant diff output. We don't need to search for
patterns here; we just want to include a small portion of the file in
the error message, to spare users from having to open the file
manually.

I wanted to show what is in my mind, v4 is attached. Summary is:

- 0001 introduces the read_file_ends() function, which reads lines
from either the beginning or end of a given file. It includes a
force_line_count argument that, when set to true, ensures that the
specified number of lines is read from the file regardless of whether
the PG_TEST_FILE_READ_LINES environment variable is set.

- 0002 is the actual patch that improves error reporting in the
027_stream_regress test by using the read_file_ends() function. It
adds a regression_log_helper() function, which reads the
PG_TEST_FILE_READ_LINES environment variable and then calls
read_file_ends() with force_line_count set to true. This approach
avoids any potential race condition where the environment variable
might be modified after being read in the regression_log_helper() and
before used in the read_file_ends().

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v4-0001-Add-read_file_ends-helper-to-Utils.pm.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Add-read_file_ends-helper-to-Utils.pm.patch
v4-0002-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Improve-error-reporting-in-027_stream_regress-tes.patch
#15Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#14)
Re: Improve error reporting in 027_stream_regress test

On Mon, Jul 28, 2025 at 03:34:06PM +0300, Nazir Bilal Yavuz wrote:

I wanted to show what is in my mind, v4 is attached. Summary is:

- 0001 introduces the read_file_ends() function, which reads lines
from either the beginning or end of a given file. It includes a
force_line_count argument that, when set to true, ensures that the
specified number of lines is read from the file regardless of whether
the PG_TEST_FILE_READ_LINES environment variable is set.

Hmm. I am really wondering if 'head' and 'tail' should just be
shortcuts of some regexp patterns based on ^ and $, and that we could
allow the subroutine to use a regexp for the pattern matching. I'll
look at that in details, but it may take me a couple of days before
getting down to it.

- 0002 is the actual patch that improves error reporting in the
027_stream_regress test by using the read_file_ends() function. It
adds a regression_log_helper() function, which reads the
PG_TEST_FILE_READ_LINES environment variable and then calls
read_file_ends() with force_line_count set to true. This approach
avoids any potential race condition where the environment variable
might be modified after being read in the regression_log_helper() and
before used in the read_file_ends().

Why do you think the environment variable could be changed mid-flight
in the context of a single test run?
--
Michael

#16Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#15)
Re: Improve error reporting in 027_stream_regress test

Hi,

On 2025-07-29 10:46:16 +0900, Michael Paquier wrote:

On Mon, Jul 28, 2025 at 03:34:06PM +0300, Nazir Bilal Yavuz wrote:

I wanted to show what is in my mind, v4 is attached. Summary is:

- 0001 introduces the read_file_ends() function, which reads lines
from either the beginning or end of a given file. It includes a
force_line_count argument that, when set to true, ensures that the
specified number of lines is read from the file regardless of whether
the PG_TEST_FILE_READ_LINES environment variable is set.

Hmm. I am really wondering if 'head' and 'tail' should just be
shortcuts of some regexp patterns based on ^ and $, and that we could
allow the subroutine to use a regexp for the pattern matching. I'll
look at that in details, but it may take me a couple of days before
getting down to it.

I don't really get what the point of designing that mechanism is before we
have a usecase. If we need it, we can expand it at that time.

Greetings,

Andres Freund

#17Alexander Lakhin
Alexander Lakhin
exclusion@gmail.com
In reply to: Nazir Bilal Yavuz (#3)
Re: Improve error reporting in 027_stream_regress test

Hello Nazir and Michael!

01.07.2025 10:57, Nazir Bilal Yavuz wrote:

I agree with you. So, the current logic is:

If primary is not alive: Do not report anything.
If only primary is alive: Report the entire diff file.
If both primary and standby are alive: Report entire diff file and add
head+tail of diff to the failure message.

Done like above in v2.

The new check has failed on mamba [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2025-07-28%2007%3A46%3A26, apparently because this animal is
too slow for pg_isready:

regress_log_027_stream_regress:
...
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) #   Failed test 'standby alive after regression test run'
#   at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) #          got: '0'
#     expected: '1'

027_stream_regress_standby_1.log:
2025-07-28 07:37:27.237 EDT [22920:1] LOG:  starting PostgreSQL 19devel on powerpc-unknown-netbsd10.1, compiled by cc
(nb3 20231008) 10.5.0, 32-bit
2025-07-28 07:37:27.239 EDT [22920:2] LOG:  listening on Unix socket "/home/buildfarm/bf-data/tmp/ep8AOH4m7l/.s.PGSQL.24781"
2025-07-28 07:37:27.281 EDT [25395:1] LOG:  database system was interrupted; last known up at 2025-07-28 07:36:48 EDT
2025-07-28 07:37:27.282 EDT [25395:2] LOG:  starting backup recovery with redo LSN 0/02000028, checkpoint LSN
0/02000080, on timeline ID 1
2025-07-28 07:37:27.283 EDT [25395:3] LOG:  entering standby mode
2025-07-28 07:37:27.287 EDT [25395:4] LOG:  redo starts at 0/02000028
...
2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG:  connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG:  connection authenticated: user="buildfarm" method=trust
(/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG:  connection authorized: user=buildfarm database=postgres
application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG: could not send data to client: Broken pipe

### 3 seconds is the pg_isready's default timeout

2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL: connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.670 user=buildfarm
database=postgres host=[local]
...

What do you think of increasing the timeout (e.g. , to 10 seconds)?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2025-07-28%2007%3A46%3A26

Best regards,
Alexander

#18Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Lakhin (#17)
Re: Improve error reporting in 027_stream_regress test

Alexander Lakhin <exclusion@gmail.com> writes:

The new check has failed on mamba [1], apparently because this animal is
too slow for pg_isready:

There is something strange happening on mamba --- not sure what,
but its cycle time for the past week has been a lot more than normal.
I plan to power-cycle it tomorrow and see if that does anything.
In the meantime, I'd not put a lot of stock in that failure.

regards, tom lane

#19Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#18)
1 attachment(s)
Re: Improve error reporting in 027_stream_regress test

On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

The new check has failed on mamba [1], apparently because this animal is
too slow for pg_isready:

There is something strange happening on mamba --- not sure what,
but its cycle time for the past week has been a lot more than normal.
I plan to power-cycle it tomorrow and see if that does anything.
In the meantime, I'd not put a lot of stock in that failure.

As far as I can see, based on the logs, the standby seems to be
lagging behind in terms of replay. Anyway, a consistent state is
reached way before the pg_isready call is done (07:37:27 vs 08:01:50),
so pg_isready should report something as the standby is ready for
connections. And it's true that 3s would be very short in smallish
environments.

We are getting PQPING_NO_RESPONSE meaning a lack of report activity
from the postmaster. An increase in timeout may help, but the host
seems like it's facing a high workload so it's not really possible to
come with a perfect number, just an estimation. How about adding a
--timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
attached? At least that would be more in line with the other tests,
and we'd have more leverage over the timing of is_alive(). Default is
180s.
--
Michael

Attachments:

0001-Add-optional-timeout-to-pg_isready-call-for-TAP-test.patchtext/x-diff; charset=us-ascii
#20Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#15)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Tue, 29 Jul 2025 at 04:46, Michael Paquier <michael@paquier.xyz> wrote:

- 0002 is the actual patch that improves error reporting in the
027_stream_regress test by using the read_file_ends() function. It
adds a regression_log_helper() function, which reads the
PG_TEST_FILE_READ_LINES environment variable and then calls
read_file_ends() with force_line_count set to true. This approach
avoids any potential race condition where the environment variable
might be modified after being read in the regression_log_helper() and
before used in the read_file_ends().

Why do you think the environment variable could be changed mid-flight
in the context of a single test run?

I mean it is a run-time variable, so it is up to the user. The timing
between these two operations are extremely tight, so it is very
unlikely but not impossible. Do you think this edge case is not worth
accounting for?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#21Nazir Bilal Yavuz
Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#19)
Re: Improve error reporting in 027_stream_regress test

Hi,

Thank you all for reporting the failure and sharing your inputs!

On Tue, 29 Jul 2025 at 09:07, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:

Alexander Lakhin <exclusion@gmail.com> writes:

The new check has failed on mamba [1], apparently because this animal is
too slow for pg_isready:

There is something strange happening on mamba --- not sure what,
but its cycle time for the past week has been a lot more than normal.
I plan to power-cycle it tomorrow and see if that does anything.
In the meantime, I'd not put a lot of stock in that failure.

As far as I can see, based on the logs, the standby seems to be
lagging behind in terms of replay. Anyway, a consistent state is
reached way before the pg_isready call is done (07:37:27 vs 08:01:50),
so pg_isready should report something as the standby is ready for
connections. And it's true that 3s would be very short in smallish
environments.

We are getting PQPING_NO_RESPONSE meaning a lack of report activity
from the postmaster. An increase in timeout may help, but the host
seems like it's facing a high workload so it's not really possible to
come with a perfect number, just an estimation. How about adding a
--timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
attached? At least that would be more in line with the other tests,
and we'd have more leverage over the timing of is_alive(). Default is
180s.

I think it makes sense to add this timeout, as it would be more in
line with the other tests like you said.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#22Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#19)
Re: Improve error reporting in 027_stream_regress test

Michael Paquier <michael@paquier.xyz> writes:

On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:

There is something strange happening on mamba --- not sure what,
but its cycle time for the past week has been a lot more than normal.

We are getting PQPING_NO_RESPONSE meaning a lack of report activity
from the postmaster. An increase in timeout may help, but the host
seems like it's facing a high workload so it's not really possible to
come with a perfect number, just an estimation. How about adding a
--timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
attached?

Yeah, looking closely at the failure logs [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2025-07-28%2007%3A46%3A26, the standby postmaster
reported pg_isready's connection request in good order at about
08:01:47, but then nothing happened for a few seconds:

2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG: connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG: connection authenticated: user="buildfarm" method=trust (/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG: connection authorized: user=buildfarm database=postgres application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG: could not send data to client: Broken pipe
2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL: connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.670 user=buildfarm database=postgres host=[local]

The client side reported failure at 08:01:50 and a bit, matching
pg_isready's default 3-second timeout:

# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24780
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24780 - accepting connections
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) # Failed test 'standby alive after regression test run'
# at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) # got: '0'
# expected: '1'

So I agree that this looks a lot like pg_isready timed out. It's not
too surprising if the machine was heavily loaded at that point, since
the standby would probably still be replaying the last actions of the
regression tests.

+1 for lengthening the timeout, as I see you already did.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&amp;dt=2025-07-28%2007%3A46%3A26

#23Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
Re: Improve error reporting in 027_stream_regress test

I wrote:

Michael Paquier <michael@paquier.xyz> writes:

We are getting PQPING_NO_RESPONSE meaning a lack of report activity
from the postmaster. An increase in timeout may help, but the host
seems like it's facing a high workload so it's not really possible to
come with a perfect number, just an estimation. How about adding a
--timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
attached?

+1 for lengthening the timeout, as I see you already did.

I just marked the CF entry for this thread committed [1]https://commitfest.postgresql.org/patch/5900/, because I
saw that the cfbot was reporting its patch-of-record as committed.
However, looking back now at the thread, it looks like there was
some unfinished work from earlier patches. If you are still
interested in pursuing that, please post a current patch to
de-confuse the bot and set the CF entry back to needs-review.

regards, tom lane

[1]: https://commitfest.postgresql.org/patch/5900/

#24Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#23)
Re: Improve error reporting in 027_stream_regress test

On Sat, Nov 15, 2025 at 04:58:17PM -0500, Tom Lane wrote:

I just marked the CF entry for this thread committed [1], because I
saw that the cfbot was reporting its patch-of-record as committed.

Thanks. I've had this item in my bucket of things to do for some time
but got drifted away. I am not sure if I would be able to come back
to it in the next couple of weeks, either.
--
Michael

#25Brandon Tat
Brandon Tat
brandontat6@gmail.com
In reply to: Michael Paquier (#24)
Re: Improve error reporting in 027_stream_regress test

Hi Michael,

Regarding the function regression_log_helper(), this function reads all the lines in the logs at line 219 of src/test/recovery/t/027_stream_regress.pl. It seems wasteful to read the file again twice in read_file_ends(). Alternatively, we could read the file once within regression_log_helper() and index lines to emit the lines that we want.

- Brandon & Ben

#26Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Brandon Tat (#25)
Re: Improve error reporting in 027_stream_regress test

On Wed, Dec 03, 2025 at 11:01:31PM -0800, Brandon Tat wrote:

Regarding the function regression_log_helper(), this function reads
all the lines in the logs at line 219 of
src/test/recovery/t/027_stream_regress.pl. It seems wasteful to read
the file again twice in read_file_ends(). Alternatively, we could
read the file once within regression_log_helper() and index lines to
emit the lines that we want.

It seems to me that you are looking at v4-0001 and v4-0002 posted at
[1]: /messages/by-id/CAN55FZ09GfUAg5X5CFDgSA0K-u3W8e2HoWHW7CdMFdo_7Fa-Ug@mail.gmail.com -- Michael
read_file_ends() once with the file opened once, with three modes
instead of the two presented in the patch: fetch the head, the tail,
or both at the same time. Yes, that would be more efficient.

While looking at the patch with fresher eyes (didn't look at this
thread for a couple of months, sorry), it looks like there is no point
in having regression_log_helper() at all. We could just return the
tail and the head in a single call of read_file_ends() with two output
variables. Then we could embed in read_file_ends() the knowledge that
if we are dealing with a file that has less lines than twice
PG_TEST_FILE_READ_LINES, we can just print the whole file, returning
only the full contents as in a variable for what would have been the
head content, leaving the tail content empty.

If somebody would like to send a patch among these lines, feel free..

[1]: /messages/by-id/CAN55FZ09GfUAg5X5CFDgSA0K-u3W8e2HoWHW7CdMFdo_7Fa-Ug@mail.gmail.com -- Michael
--
Michael

#27Bilal Yavuz
Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#26)
2 attachment(s)
Re: Improve error reporting in 027_stream_regress test

Hi Michael, Brandon and Ben,

Thank you for looking into this!

On Thu, 4 Dec 2025 at 10:42, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Dec 03, 2025 at 11:01:31PM -0800, Brandon Tat wrote:

Regarding the function regression_log_helper(), this function reads
all the lines in the logs at line 219 of
src/test/recovery/t/027_stream_regress.pl. It seems wasteful to read
the file again twice in read_file_ends(). Alternatively, we could
read the file once within regression_log_helper() and index lines to
emit the lines that we want.

It seems to me that you are looking at v4-0001 and v4-0002 posted at
[1], which I did not author. So your suggestion would be to call
read_file_ends() once with the file opened once, with three modes
instead of the two presented in the patch: fetch the head, the tail,
or both at the same time. Yes, that would be more efficient.

While looking at the patch with fresher eyes (didn't look at this
thread for a couple of months, sorry), it looks like there is no point
in having regression_log_helper() at all. We could just return the
tail and the head in a single call of read_file_ends() with two output
variables. Then we could embed in read_file_ends() the knowledge that
if we are dealing with a file that has less lines than twice
PG_TEST_FILE_READ_LINES, we can just print the whole file, returning
only the full contents as in a variable for what would have been the
head content, leaving the tail content empty.

If somebody would like to send a patch among these lines, feel free..

I applied these feedbacks in v5. I wanted to cover all possible cases
so I think 0002 might be a bit more complicated than it needs to be.

What do you think about the current implementation?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v5-0001-Add-read_file_ends-helper-to-Utils.pm.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Add-read_file_ends-helper-to-Utils.pm.patch
v5-0002-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Improve-error-reporting-in-027_stream_regress-tes.patch
#28Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Bilal Yavuz (#27)
2 attachment(s)
Re: Improve error reporting in 027_stream_regress test

On Thu, Dec 04, 2025 at 03:33:30PM +0300, Bilal Yavuz wrote:

I applied these feedbacks in v5. I wanted to cover all possible cases
so I think 0002 might be a bit more complicated than it needs to be.

What do you think about the current implementation?

I'm finding that a bit bloated. My own attempt is the attached, which
is much simpler, returning only array for the lines of @tail and
@head. I am not sure to see the use-case in favor of enforcing the
line count for the caller of the new routine based on what's on this
thread, so I have left that out to simplify the patch.

v5 had a mistake: slurp_file() on the full diffs should be removed
once we dump only the tail and head. It is true that 027 would add an
extra line if regression.diffs is too short when written this way, but
the information reported is the same, while keeping the code simpler
in 027. A portion of the comment at the top of the block printing the
diffs could be removed: with the limitation of lines in place we don't
bloat the output anymore.
--
Michael

Attachments:

v6-0001-Add-read_head_tail-helper-to-Utils.pm.patchtext/x-diff; charset=us-ascii
v6-0002-Improve-error-reporting-in-027_stream_regress-tes.patchtext/x-diff; charset=us-ascii
#29Bilal Yavuz
Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#28)
Re: Improve error reporting in 027_stream_regress test

Hi,

On Fri, 5 Dec 2025 at 07:37, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Dec 04, 2025 at 03:33:30PM +0300, Bilal Yavuz wrote:

I applied these feedbacks in v5. I wanted to cover all possible cases
so I think 0002 might be a bit more complicated than it needs to be.

What do you think about the current implementation?

I'm finding that a bit bloated. My own attempt is the attached, which
is much simpler, returning only array for the lines of @tail and
@head. I am not sure to see the use-case in favor of enforcing the
line count for the caller of the new routine based on what's on this
thread, so I have left that out to simplify the patch.

I liked your version more. My only complaint was removing the
line_count argument but we can easily add it back when we need it.

v5 had a mistake: slurp_file() on the full diffs should be removed
once we dump only the tail and head. It is true that 027 would add an
extra line if regression.diffs is too short when written this way, but
the information reported is the same, while keeping the code simpler
in 027. A portion of the comment at the top of the block printing the
diffs could be removed: with the limitation of lines in place we don't
bloat the output anymore.

You are right, thank you for handling that.

v6 LGTM.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#30Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Bilal Yavuz (#29)
Re: Improve error reporting in 027_stream_regress test

On Fri, Dec 05, 2025 at 10:50:25AM +0300, Bilal Yavuz wrote:

I liked your version more. My only complaint was removing the
line_count argument but we can easily add it back when we need it.

Yes, I am not sure about that at this stage. It would be good to see
more use cases for this new API before deciding anything, to let the
callers of this new routine enforce their fate.

You are right, thank you for handling that.

Okay, applied then. That should close this thread.
--
Michael