fairywren timeout failures on the pg_amcheck/005_opclass_damage test

Started by Alexander Lakhinover 1 year ago5 messages

exclusion@gmail.com

over 1 year ago

Hello hackers,

Please take a look at today's failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-25%2003%3A23%3A19, that raises several questions
at once:
117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT 3001.48s exit status 1
180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT 3001.43s exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

But looking at the previous test run [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-23%2023%3A01%3A55, marked 'OK', I can see almost
the same:
115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT 3001.54s exit status 1
176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT 3001.26s exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

So it's not clear to me, why is the latter test run considered failed
unlike the former?

As to the 005_opclass_damage failure itself, we can find successful test
runs with duration close to 3000s:
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-05-02%2019%3A03%3A08
[4]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-04-30%2014%3A03%3A06

(The average duration across 35 successful runs in June was around 1300s;
but in July only 5 test runs were successful.)

The other question is: why is 005_opclass_damage taking so much time there?
Looking at the last three runs in REL_17_STABLE, I see:
87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK               22.80s   10 subtests passed
87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK               19.60s   10 subtests passed
87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK                6.09s   10 subtests passed

I suppose, the most significant factor here is
'HEAD' => [
'debug_parallel_query = regress'
]
in the buildfarm client's config.

Indeed, it affects timing of the test very much, judging by what I'm
seeing locally, on Linux (I guess, process-related overhead might be
higher on Windows):
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.07 csys = 0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok 40854 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.10 csys = 0.20 CPU)

(Thus we can see 30x duration increase.)

It's worth to note that such increase is rather new (introduced by
5ae208720); in REL_16_STABLE I'm seeing:
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:18:52] t/005_opclass_damage.pl .. ok 1453 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.11 csys = 0.93 CPU)

$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:19:18] t/005_opclass_damage.pl .. ok 8032 ms ( 0.00 usr 0.00 sys + 0.82 cusr 0.11 csys = 0.93 CPU)

So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-25%2003%3A23%3A19
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-23%2023%3A01%3A55
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-05-02%2019%3A03%3A08
[4]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-04-30%2014%3A03%3A06

Best regards,
Alexander

Andrew Dunstan

andrew@dunslane.net

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

On 2024-07-25 Th 8:00 AM, Alexander Lakhin wrote:

Hello hackers,

Please take a look at today's failure [1], that raises several questions
at once:
117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
TIMEOUT        3001.48s   exit status 1
180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage
TIMEOUT        3001.43s   exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

But looking at the previous test run [2], marked 'OK', I can see almost
the same:
115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
TIMEOUT        3001.54s   exit status 1
176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage
TIMEOUT        3001.26s   exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

So it's not clear to me, why is the latter test run considered failed
unlike the former?

Yesterday I changed the way we detect failure in the buildfarm client,
and pushed that to fairywren (and a few more of my animals). Previously
it did not consider a timeout to be a failure, but now it does. I'm
going to release that soon.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Alexander Lakhin (#1)

Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

25.07.2024 15:00, Alexander Lakhin wrote:

The other question is: why is 005_opclass_damage taking so much time there?
...
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok 1370 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.07 csys = 0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok 40854 ms ( 0.00 usr 0.00 sys + 0.10 cusr 0.10 csys = 0.20 CPU)

...
So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?

When running this test with "log_min_messages = DEBUG2" in my extra.config,
I see thousands of the following messages in the test log:
2024-07-26 09:32:54.544 UTC [2572189:46] DEBUG: postmaster received pmsignal signal
2024-07-26 09:32:54.544 UTC [2572189:47] DEBUG: registering background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.544 UTC [2572189:48] DEBUG: starting background worker process "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:49] DEBUG: unregistering background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:50] DEBUG: background worker "parallel worker" (PID 2572205) exited with exit code 0
2024-07-26 09:32:54.547 UTC [2572189:51] DEBUG: postmaster received pmsignal signal
2024-07-26 09:32:54.547 UTC [2572189:52] DEBUG: registering background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:53] DEBUG: starting background worker process "parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:54] DEBUG: unregistering background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:55] DEBUG: background worker "parallel worker" (PID 2572206) exited with exit code 0
...

grep ' registering background worker' \
src/bin/pg_amcheck/tmp_check/log/005_opclass_damage_test.log | wc -l
15669

So this test launches more than 15000 processes when debug_parallel_query
is enabled.

As far as I can see, this is happening because of the "PARALLEL UNSAFE"
marking is ignored when the function is called by CREATE INDEX/amcheck.

Namely, with a function defined as:
CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;

SELECT int4_asc_cmp(1, 2);
executed without parallel workers. Whilst when it's used by an index:
CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
...
OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);

INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);

CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
launches 1000 parallel workers.

(This is reminiscent of bug #18314.)

One way to workaround this is to disable debug_parallel_query in the test
and another I find possible is to set max_parallel_workers = 0.

Best regards,
Alexander

Andrew Dunstan

andrew@dunslane.net

over 1 year ago

In reply to: Alexander Lakhin (#3)

Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

On 2024-07-26 Fr 7:00 AM, Alexander Lakhin wrote:

25.07.2024 15:00, Alexander Lakhin wrote:

The other question is: why is 005_opclass_damage taking so much time
there?
...
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*"
PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok     1370 ms ( 0.00 usr 0.00
sys + 0.10 cusr 0.07 csys = 0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/
PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok    40854 ms ( 0.00 usr 0.00
sys + 0.10 cusr 0.10 csys = 0.20 CPU)

...
So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?

When running this test with "log_min_messages = DEBUG2" in my
extra.config,
I see thousands of the following messages in the test log:
2024-07-26 09:32:54.544 UTC [2572189:46] DEBUG: postmaster received
pmsignal signal
2024-07-26 09:32:54.544 UTC [2572189:47] DEBUG: registering
background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.544 UTC [2572189:48] DEBUG: starting background
worker process "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:49] DEBUG: unregistering
background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:50] DEBUG: background worker
"parallel worker" (PID 2572205) exited with exit code 0
2024-07-26 09:32:54.547 UTC [2572189:51] DEBUG: postmaster received
pmsignal signal
2024-07-26 09:32:54.547 UTC [2572189:52] DEBUG: registering
background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.547 UTC [2572189:53] DEBUG: starting background
worker process "parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:54] DEBUG: unregistering
background worker "parallel worker for PID 2572197"
2024-07-26 09:32:54.549 UTC [2572189:55] DEBUG: background worker
"parallel worker" (PID 2572206) exited with exit code 0
...

grep ' registering background worker' \
src/bin/pg_amcheck/tmp_check/log/005_opclass_damage_test.log | wc -l
15669

So this test launches more than 15000 processes when debug_parallel_query
is enabled.

As far as I can see, this is happening because of the "PARALLEL UNSAFE"
marking is ignored when the function is called by CREATE INDEX/amcheck.

Namely, with a function defined as:
    CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE
sql AS $$
        SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1
END; $$;

SELECT int4_asc_cmp(1, 2);
executed without parallel workers. Whilst when it's used by an index:
CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
...
OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);

INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);

CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
launches 1000 parallel workers.

(This is reminiscent of bug #18314.)

One way to workaround this is to disable debug_parallel_query in the test
and another I find possible is to set max_parallel_workers = 0.

But wouldn't either of those just be masking the problem?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: Andrew Dunstan (#4)

Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test

26.07.2024 15:41, Andrew Dunstan wrote:

One way to workaround this is to disable debug_parallel_query in the test
and another I find possible is to set max_parallel_workers = 0.

But wouldn't either of those just be masking the problem?

Yes, I'm inclined to consider this behavior a problem (what if the table
contained 1M rows?), that's why I called those solutions workarounds.

Of course, there are parallel_setup_cost and parallel_tuple_cost
parameters, which can prevent this from happening in the wild, but still...

Best regards.
Alexander