Instability of partition_prune regression test results

Started by Tom Laneover 6 years ago5 messageshackers

tgl@sss.pgh.pa.us

over 6 years ago

Every so often the partition_prune test falls over, for example
here, here, and here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-08-15%2021%3A45%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-21%2022%3A19%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-09-11%2018%3A46%3A47

The reason for the failures is quite apparent: we sometimes don't
get as many workers as we hoped for. The test script is not quite
100% naive about that, but it's only designed to filter out the
"loops" counts of parallel scan nodes. As these examples show,
that's utterly inadequate. The "Workers Launched" field is variable
too, obviously, and so are the rows and loops counts for every plan
node up to the Gather.

I experimented with adjusting explain_parallel_append() to filter
more fields, but soon realized that we'd have to filter out basically
everything that makes it useful to run EXPLAIN ANALYZE at all.

Therefore, I think it's time to give up this testing methodology
as a bad idea, and fall back to the time-honored way of running a
plain EXPLAIN and then the actual query, as per the attached patch.

(Note: there's some roughly similar code in select_parallel.sql,
but as far as I could find it fails seldom if at all. Likely that
is because we don't run select_parallel in parallel with other
test scripts. So perhaps an argument could be made to leave
partition_prune.sql alone and just run it by itself. I do not
care for that answer though, as it will make the regression test
suite slower, plus I do not see any argument that this testing method
actually provides any info we don't get the traditional way.)

BTW, another aspect of this test script that could stand to be
nuked from orbit is this method for getting a custom plan:

-- Execute query 5 times to allow choose_custom_plan
-- to start considering a generic plan.
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);

We should drop that in favor of plan_cache_mode = force_custom_plan,
IMO. But I didn't include that change in this patch.

regards, tom lane

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 6 years ago

In reply to: Tom Lane (#1)

Re: Instability of partition_prune regression test results

On Fri, Sep 27, 2019 at 7:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Every so often the partition_prune test falls over, for example
here, here, and here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-08-15%2021%3A45%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2019-08-21%2022%3A19%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2019-09-11%2018%3A46%3A47

The reason for the failures is quite apparent: we sometimes don't
get as many workers as we hoped for. The test script is not quite
100% naive about that, but it's only designed to filter out the
"loops" counts of parallel scan nodes. As these examples show,
that's utterly inadequate. The "Workers Launched" field is variable
too, obviously, and so are the rows and loops counts for every plan
node up to the Gather.

I experimented with adjusting explain_parallel_append() to filter
more fields, but soon realized that we'd have to filter out basically
everything that makes it useful to run EXPLAIN ANALYZE at all.

Therefore, I think it's time to give up this testing methodology
as a bad idea, and fall back to the time-honored way of running a
plain EXPLAIN and then the actual query, as per the attached patch.

Isn't the point of using ANALYZE here to show that the exec-param
based run-time pruning is working (those "never executed" strings)?

BTW, another aspect of this test script that could stand to be
nuked from orbit is this method for getting a custom plan:

-- Execute query 5 times to allow choose_custom_plan
-- to start considering a generic plan.
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);
execute ab_q4 (1, 8);

We should drop that in favor of plan_cache_mode = force_custom_plan,
IMO.

Thanks,
Amit

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Amit Langote (#2)

Re: Instability of partition_prune regression test results

Amit Langote <amitlangote09@gmail.com> writes:

On Fri, Sep 27, 2019 at 7:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I experimented with adjusting explain_parallel_append() to filter
more fields, but soon realized that we'd have to filter out basically
everything that makes it useful to run EXPLAIN ANALYZE at all.
Therefore, I think it's time to give up this testing methodology
as a bad idea, and fall back to the time-honored way of running a
plain EXPLAIN and then the actual query, as per the attached patch.

Isn't the point of using ANALYZE here to show that the exec-param
based run-time pruning is working (those "never executed" strings)?

Hm. Well, if you want to see those, we could do it as attached.

regards, tom lane

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 6 years ago

In reply to: Tom Lane (#3)

Re: Instability of partition_prune regression test results

On Sat, Sep 28, 2019 at 12:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Fri, Sep 27, 2019 at 7:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I experimented with adjusting explain_parallel_append() to filter
more fields, but soon realized that we'd have to filter out basically
everything that makes it useful to run EXPLAIN ANALYZE at all.
Therefore, I think it's time to give up this testing methodology
as a bad idea, and fall back to the time-honored way of running a
plain EXPLAIN and then the actual query, as per the attached patch.

Isn't the point of using ANALYZE here to show that the exec-param
based run-time pruning is working (those "never executed" strings)?

Hm. Well, if you want to see those, we could do it as attached.

Perfect, thanks.

Regards,
Amit

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Amit Langote (#4)

Re: Instability of partition_prune regression test results

Amit Langote <amitlangote09@gmail.com> writes:

On Sat, Sep 28, 2019 at 12:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Langote <amitlangote09@gmail.com> writes:

Isn't the point of using ANALYZE here to show that the exec-param
based run-time pruning is working (those "never executed" strings)?

Hm. Well, if you want to see those, we could do it as attached.

Perfect, thanks.

OK, pushed that way.

regards, tom lane

Instability of partition_prune regression test results

Attachments:

Attachments: