Instability in postgres_fdw regression tests

Started by Tom Lane4 months ago5 messageshackers

tgl@sss.pgh.pa.us

4 months ago

Several BF animals have intermittently shown this regression diff:

diff -U3 /home/bf/bf-build/culicidae/HEAD/pgsql/contrib/postgres_fdw/expected/postgres_fdw.out /home/bf/bf-build/culicidae/HEAD/pgsql.build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /home/bf/bf-build/culicidae/HEAD/pgsql/contrib/postgres_fdw/expected/postgres_fdw.out	2025-12-29 19:48:22.661603936 +0100
+++ /home/bf/bf-build/culicidae/HEAD/pgsql.build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out	2026-02-10 00:31:31.856460156 +0100
@@ -6519,6 +6519,7 @@
 UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
   c1  | c2 | c3  | c4 | c5 | c6 |     c7     | c8 
 ------+----+-----+----+----+----+------------+----
+ 2010 |  0 | bar |    |    |    | ft2        | 
  2001 |  1 | bar |    |    |    | ft2        | 
  2002 |  2 | bar |    |    |    | ft2        | 
  2003 |  3 | bar |    |    |    | ft2        | 
@@ -6528,7 +6529,6 @@
  2007 |  7 | bar |    |    |    | ft2        | 
  2008 |  8 | bar |    |    |    | ft2        | 
  2009 |  9 | bar |    |    |    | ft2        | 
- 2010 |  0 | bar |    |    |    | ft2        | 
 (10 rows)

EXPLAIN (verbose, costs off)

The above is from culicidae [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2026-02-09%2023%3A21%3A25, and greenfly has shown it a few times
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=greenfly&dt=2026-01-20%2004%3A32%3A29
and there are a few more in the past 90 days.

It's pretty clear what is happening: the rows we are looking at are
being returned by a seqscan, and they were just inserted a few lines
above into a table that has been modified multiple times already.
So the test is reliant on them being inserted in sequence at the
end of the table, yet sometimes the last row is going into free
space someplace earlier.

One's first instinct is to blame autovacuum, but the test already
goes out of its way to disable that:

-- Disable autovacuum for these tables to avoid unexpected effects of that
ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');

After experimenting for awhile I think I have (part of) the answer.
All of the failing animals are using meson, which means that this
"installcheck" test case is probably running in parallel with other
test scripts in other databases in the same cluster. I've not
reproduced the exact symptom seen in the buildfarm, but I can easily
make the postgres_fdw test put these rows at different ctids if
I leave a transaction sitting open in a different database. So I
think we are seeing some effect of opportunistic page pruning behaving
differently depending on whether there is a concurrent transaction.

It's not clear to me that it's worth running this to ground in any
more detail than that. The behavior is not wrong; it's the test's
fault to assume that these rows will be returned in a deterministic
order. So I think the right fix is to adjust the test query,
along the lines of

-UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
+WITH cte AS (
+  UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *
+) SELECT * FROM cte ORDER BY c1;

Thoughts, better ideas?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2026-02-09%2023%3A21%3A25
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=greenfly&dt=2026-01-20%2004%3A32%3A29
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=scorpion&dt=2025-12-09%2015%3A23%3A53
[4]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2025-12-04%2017%3A50%3A38

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Tom Lane (#1)

Re: Instability in postgres_fdw regression tests

On Tue, Feb 10, 2026 at 12:06:02PM -0500, Tom Lane wrote:

It's not clear to me that it's worth running this to ground in any
more detail than that. The behavior is not wrong; it's the test's
fault to assume that these rows will be returned in a deterministic
order. So I think the right fix is to adjust the test query,
along the lines of
-UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
+WITH cte AS (
+  UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *
+) SELECT * FROM cte ORDER BY c1;

+1. I faintly recall looking into this a while ago and, for some reason, I
was worried that this would become a game of Whac-A-Mole, so apparently I
didn't follow through. But fixing this query is still an improvement over
the status quo.

--
nathan

Alexander Lakhin

exclusion@gmail.com

4 months ago

In reply to: Tom Lane (#1)

Re: Instability in postgres_fdw regression tests

Hello Tom,

10.02.2026 19:06, Tom Lane wrote:

Several BF animals have intermittently shown this regression diff:

diff -U3 /home/bf/bf-build/culicidae/HEAD/pgsql/contrib/postgres_fdw/expected/postgres_fdw.out /home/bf/bf-build/culicidae/HEAD/pgsql.build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
--- /home/bf/bf-build/culicidae/HEAD/pgsql/contrib/postgres_fdw/expected/postgres_fdw.out	2025-12-29 19:48:22.661603936 +0100
+++ /home/bf/bf-build/culicidae/HEAD/pgsql.build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out	2026-02-10 00:31:31.856460156 +0100
@@ -6519,6 +6519,7 @@
UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
c1  | c2 | c3  | c4 | c5 | c6 |     c7     | c8
------+----+-----+----+----+----+------------+----
+ 2010 |  0 | bar |    |    |    | ft2        |
2001 |  1 | bar |    |    |    | ft2        |
2002 |  2 | bar |    |    |    | ft2        |
2003 |  3 | bar |    |    |    | ft2        |
@@ -6528,7 +6529,6 @@
2007 |  7 | bar |    |    |    | ft2        |
2008 |  8 | bar |    |    |    | ft2        |
2009 |  9 | bar |    |    |    | ft2        |
- 2010 |  0 | bar |    |    |    | ft2        |
(10 rows)

EXPLAIN (verbose, costs off)

The above is from culicidae [1], and greenfly has shown it a few times
[2], and here's one from scorpion [3], and crake on a back branch [4],
and there are a few more in the past 90 days.

FWIW, all the failures of this ilk are tracked at [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#postgres_fdw.sql_might_fail_due_to_autovacuum.

It's pretty clear what is happening: the rows we are looking at are
being returned by a seqscan, and they were just inserted a few lines
above into a table that has been modified multiple times already.
So the test is reliant on them being inserted in sequence at the
end of the table, yet sometimes the last row is going into free
space someplace earlier.

One's first instinct is to blame autovacuum, but the test already
goes out of its way to disable that:

-- Disable autovacuum for these tables to avoid unexpected effects of that
ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');

I reproduced and reported the failure before, please look at [2]/messages/by-id/867266ef-3dd1-44a9-a203-27cb5d2be58d@gmail.com —
probably some information there could be helpful...

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#postgres_fdw.sql_might_fail_due_to_autovacuum
[2]: /messages/by-id/867266ef-3dd1-44a9-a203-27cb5d2be58d@gmail.com

Best regards,
Alexander

Tom Lane

tgl@sss.pgh.pa.us

4 months ago

In reply to: Nathan Bossart (#2)

Re: Instability in postgres_fdw regression tests

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Feb 10, 2026 at 12:06:02PM -0500, Tom Lane wrote:
It's not clear to me that it's worth running this to ground in any
more detail than that. The behavior is not wrong; it's the test's
fault to assume that these rows will be returned in a deterministic
order. So I think the right fix is to adjust the test query,
along the lines of
-UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
+WITH cte AS (
+  UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *
+) SELECT * FROM cte ORDER BY c1;

+1. I faintly recall looking into this a while ago and, for some reason, I
was worried that this would become a game of Whac-A-Mole, so apparently I
didn't follow through. But fixing this query is still an improvement over
the status quo.

Yeah, it's certainly fair to wonder where else we have
even-lower-probability test interactions. But I don't think
getting rid of the interaction is realistic, especially given
Alexander's results (which I confess to having forgotten about)
that show that autovacuum is involved in this somehow despite
being disabled on this particular table. So the answer has to
be to make the test case more robust against such things.

regards, tom lane

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

4 months ago

In reply to: Tom Lane (#4)

Re: Instability in postgres_fdw regression tests

On Wed, Feb 11, 2026 at 3:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:
On Tue, Feb 10, 2026 at 12:06:02PM -0500, Tom Lane wrote:
It's not clear to me that it's worth running this to ground in any
more detail than that. The behavior is not wrong; it's the test's
fault to assume that these rows will be returned in a deterministic
order. So I think the right fix is to adjust the test query,
along the lines of
-UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
+WITH cte AS (
+  UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *
+) SELECT * FROM cte ORDER BY c1;
+1. I faintly recall looking into this a while ago and, for some reason, I
was worried that this would become a game of Whac-A-Mole, so apparently I
didn't follow through. But fixing this query is still an improvement over
the status quo.

Yeah, it's certainly fair to wonder where else we have
even-lower-probability test interactions. But I don't think
getting rid of the interaction is realistic, especially given
Alexander's results (which I confess to having forgotten about)
that show that autovacuum is involved in this somehow despite
being disabled on this particular table. So the answer has to
be to make the test case more robust against such things.

+1 for that. I noticed this problem because of Alexander's report,
but I completely forgot about it...

Thanks!

Best regards,
Etsuro Fujita