Stabilizing the test_decoding checks, take N

Started by Tom Laneover 3 years ago12 messages
#1Tom Lane
tgl@sss.pgh.pa.us

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-04-15%2011%3A59%3A16:

diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out	2022-04-15 07:59:17.000000000 -0400
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out	2022-04-15 09:06:36.000000000 -0400
@@ -77,10 +77,12 @@
  streaming change for transaction
  streaming change for transaction
  streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
  streaming change for transaction
  closing a streamed block for transaction
  committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one. That could only happen if the tests ran long enough
for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-12%2010%3A24%3A22, and attempted to fix it [3]https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b779d7d8fdae088d70da5ed9fcd8205035676df3
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day. Since test_decoding already adds
some custom settings to postgresql.conf, this'll take just a one-line
addition to test_decoding/logical.conf.

I wonder whether we ought to then revert these tests' use of
skip-empty-xacts, or at least start having a mix of cases.
It seems to me that we'd rather know about it if there are unexpected
empty transactions. Is there anything we're using that for other than
to hide the effects of autovacuum?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-04-15%2011%3A59%3A16

[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-12%2010%3A24%3A22

[3]: https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b779d7d8fdae088d70da5ed9fcd8205035676df3

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: Stabilizing the test_decoding checks, take N

Hi,

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]:

diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out	2022-04-15 07:59:17.000000000 -0400
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out	2022-04-15 09:06:36.000000000 -0400
@@ -77,10 +77,12 @@
streaming change for transaction
streaming change for transaction
streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one. That could only happen if the tests ran long enough
for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes. I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day. Since test_decoding already adds
some custom settings to postgresql.conf, this'll take just a one-line
addition to test_decoding/logical.conf.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: Stabilizing the test_decoding checks, take N

Hi,

Adding Amit, I think this is stuff he worked on...

On 2022-04-17 07:31:04 -0700, Andres Freund wrote:

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]:

diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out	2022-04-15 07:59:17.000000000 -0400
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out	2022-04-15 09:06:36.000000000 -0400
@@ -77,10 +77,12 @@
streaming change for transaction
streaming change for transaction
streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one. That could only happen if the tests ran long enough
for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes. I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day. Since test_decoding already adds
some custom settings to postgresql.conf, this'll take just a one-line
addition to test_decoding/logical.conf.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: Stabilizing the test_decoding checks, take N

Andres Freund <andres@anarazel.de> writes:

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes. I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.

Hmm, I'll defer to somebody who knows that code better about whether
there's an actual bug. However ...

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

It seems likely to me that trying to make a test like this one blind to
autovacuum/autoanalyze activity will make it less useful, not more so.
Why is such blindness desirable?

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: Stabilizing the test_decoding checks, take N

Hi

On 2022-04-17 12:01:53 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-04-16 13:11:59 -0400, Tom Lane wrote:

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

I don't quite understand this bit - the logic test_decoding uses to
decide if a transaction is "empty" is just whether a tuple was
output. And there shouldn't be any as part of auto-analyze, because we
don't decode catalog changes. I suspect there's something broken in the
streaming logic (potentially just in test_decoding) around
skip_empty_xacts.

Hmm, I'll defer to somebody who knows that code better about whether
there's an actual bug. However ...

I think the most expedient way to prevent this type of failure is to run
the test_decoding tests with autovacuum_naptime cranked up so far as to
make it a non-issue, like maybe a day.

I'm a bit worried about this approach - we've IIRC had past bugs that
came only to light because of autovacuum starting. I wonder if we rather
should do the opposite and reduce naptime so it'll be seen on fast
machines, rather than very slow ones.

It seems likely to me that trying to make a test like this one blind to
autovacuum/autoanalyze activity will make it less useful, not more so.
Why is such blindness desirable?

Maybe I misunderstood - I thought you were proposing to prevent
autovacuum by increasing naptime? Won't that precisely blind us to
autovacuum/analyze? Hiding empty xacts happens "very late", so all the
decoding etc still happens.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: Stabilizing the test_decoding checks, take N

Andres Freund <andres@anarazel.de> writes:

On 2022-04-17 12:01:53 -0400, Tom Lane wrote:

It seems likely to me that trying to make a test like this one blind to
autovacuum/autoanalyze activity will make it less useful, not more so.
Why is such blindness desirable?

Maybe I misunderstood - I thought you were proposing to prevent
autovacuum by increasing naptime? Won't that precisely blind us to
autovacuum/analyze? Hiding empty xacts happens "very late", so all the
decoding etc still happens.

My concern is basically that if we hack the code so it does not report
autovacuum activity, that might result in it also not reporting other
things that are more interesting. So I think an external method of
suppressing test noise due to autovac is more advisable.

regards, tom lane

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#3)
Re: Stabilizing the test_decoding checks, take N

On Sun, Apr 17, 2022 at 8:02 PM Andres Freund <andres@anarazel.de> wrote:

Adding Amit, I think this is stuff he worked on...

I'll look into this and share my findings.

--
With Regards,
Amit Kapila.

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#1)
Re: Stabilizing the test_decoding checks, take N

On Sat, Apr 16, 2022 at 10:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]:

diff -U3 /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out 2022-04-15 07:59:17.000000000 -0400
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out  2022-04-15 09:06:36.000000000 -0400
@@ -77,10 +77,12 @@
streaming change for transaction
streaming change for transaction
streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one. That could only happen if the tests ran long enough
for autovacuum_naptime to elapse, but prairiedog is a pretty slow machine.
(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

The possible reason here is that this extra (auto-analyze) transaction
causes the logical decoding work mem to reach before the last change
of the test's transaction. As can be seen from the logs, it just
closed the stream before the last change and then opened a new stream
for the last change. Now, it is true that the auto-analyze changes
won't be decoded as they don't perform DML operation on any
non-catalog table but it could generate some invalidation message
which needs to be processed even though we won't send anything related
to it to the downstream.

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

--
With Regards,
Amit Kapila.

#9Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#8)
Re: Stabilizing the test_decoding checks, take N

On Mon, Apr 18, 2022 at 11:19 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Apr 16, 2022 at 10:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

My pet dinosaur prairiedog just failed in the contrib/test_decoding
tests [1]:

diff -U3

/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out

---

/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/stream.out
2022-04-15 07:59:17.000000000 -0400

+++

/Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/results/stream.out
2022-04-15 09:06:36.000000000 -0400

@@ -77,10 +77,12 @@
streaming change for transaction
streaming change for transaction
streaming change for transaction
+ closing a streamed block for transaction
+ opening a streamed block for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
-(13 rows)
+(15 rows)

Looking at the postmaster log, it's obvious where this extra transaction
came from: auto-analyze ran on pg_type concurrently with the test step
just before this one. That could only happen if the tests ran long

enough

for autovacuum_naptime to elapse, but prairiedog is a pretty slow

machine.

(And I hasten to point out that some other animals, such as those running
valgrind or CLOBBER_CACHE_ALWAYS, are even slower.)

We've seen this sort of problem before [2], and attempted to fix it [3]
by making these tests ignore empty transactions. But of course
auto-analyze's transaction wasn't empty, so that didn't help.

The possible reason here is that this extra (auto-analyze) transaction
causes the logical decoding work mem to reach before the last change
of the test's transaction. As can be seen from the logs, it just
closed the stream before the last change and then opened a new stream
for the last change. Now, it is true that the auto-analyze changes
won't be decoded as they don't perform DML operation on any
non-catalog table but it could generate some invalidation message
which needs to be processed even though we won't send anything related
to it to the downstream.

This analysis seems right to me.

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

I will try to reproduce this, maybe by reducing the autovacuum_naptime or
parallelly running some script that continuously performs DDL-only
transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#9)
2 attachment(s)
Re: Stabilizing the test_decoding checks, take N

On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that continuously performs DDL-only transactions.

I have reproduced it [1]Result without vacuum analyze: data ------------------------------------------ opening a streamed block for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction closing a streamed block for transaction committing streamed transaction (13 rows) by repeatedly running the attached
script(stream.sql) from one session and parallely running the vacuum
analysis from the another session.

I have also changed the config for testing decoding to set the
autovacuum_naptime to 1d (patch attached)

[1]: Result without vacuum analyze: data ------------------------------------------ opening a streamed block for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction streaming change for transaction closing a streamed block for transaction committing streamed transaction (13 rows)
Result without vacuum analyze:
data
------------------------------------------
opening a streamed block for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
(13 rows)

Result with parallely running VACUUM ANALYZE

data
------------------------------------------
opening a streamed block for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
streaming change for transaction
closing a streamed block for transaction
opening a streamed block for transaction
streaming change for transaction
closing a streamed block for transaction
committing streamed transaction
(15 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

stream.sqlapplication/sql; name=stream.sqlDownload
0001-Increase-autovacuum_naptime-for-test-decoding-testin.patchtext/x-patch; charset=US-ASCII; name=0001-Increase-autovacuum_naptime-for-test-decoding-testin.patchDownload
From ab4854cd98cc24213d7fcee3e9b6bcb2dcb93d92 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 19 Apr 2022 11:26:56 +0530
Subject: [PATCH] Increase autovacuum_naptime for test decoding testing

Occasionally the decoding of test cases fails due to the small autovacuum_naptime
on the slow machine.  Because the invalidation messages generated by the analyze
command make logical_decoding_workmem limit exhaust faster.
---
 contrib/test_decoding/logical.conf | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contrib/test_decoding/logical.conf b/contrib/test_decoding/logical.conf
index 07c4d3d..fc042c6 100644
--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
 wal_level = logical
 max_replication_slots = 4
+autovacuum_naptime = 1d
 logical_decoding_work_mem = 64kB
-- 
1.8.3.1

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#10)
Re: Stabilizing the test_decoding checks, take N

On Tue, Apr 19, 2022 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that continuously performs DDL-only transactions.

I have reproduced it [1] by repeatedly running the attached
script(stream.sql) from one session and parallely running the vacuum
analysis from the another session.

I have also changed the config for testing decoding to set the
autovacuum_naptime to 1d (patch attached)

Thanks, I am also able to see similar results. This shows the analysis
was right. I will push the autovacuum_naptime change in HEAD and 14
(as both contains this test) tomorrow unless someone thinks otherwise.

--
With Regards,
Amit Kapila.

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#11)
Re: Stabilizing the test_decoding checks, take N

On Tue, Apr 19, 2022 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 19, 2022 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Apr 18, 2022 at 3:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

This needs to be verified once by doing some manual testing as it may
not be easily reproducible every time. If this happens to be true then
I think your suggestion related to increasing autovacuum_naptime would
work.

I will try to reproduce this, maybe by reducing the autovacuum_naptime or parallelly running some script that continuously performs DDL-only transactions.

I have reproduced it [1] by repeatedly running the attached
script(stream.sql) from one session and parallely running the vacuum
analysis from the another session.

I have also changed the config for testing decoding to set the
autovacuum_naptime to 1d (patch attached)

Thanks, I am also able to see similar results. This shows the analysis
was right. I will push the autovacuum_naptime change in HEAD and 14
(as both contains this test) tomorrow unless someone thinks otherwise.

Pushed.

--
With Regards,
Amit Kapila.