neqjoinsel versus "refresh materialized view concurrently"

Started by Jeff Janesalmost 8 years ago9 messages
#1Jeff Janes
jeff.janes@gmail.com

The following commit has caused a devastating performance regression
in concurrent refresh of MV:

commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Nov 29 22:00:29 2017 -0500

Fix neqjoinsel's behavior for semi/anti join cases.

The below reproduction goes from taking about 1 second to refresh, to
taking an amount of time I don't have the patience to measure.

drop table foobar2 cascade;
create table foobar2 as select * from generate_series(1,200000);
create materialized view foobar3 as select * from foobar2;
create unique index on foobar3 (generate_series );
analyze foobar3;
refresh materialized view CONCURRENTLY foobar3 ;

When I interrupt the refresh, I get a message including this line:

CONTEXT: SQL statement "SELECT newdata FROM pg_temp_3.pg_temp_16420
newdata WHERE newdata IS NOT NULL AND EXISTS (SELECT * FROM
pg_temp_3.pg_temp_16420 newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid OPERATOR(pg_catalog.<>)
newdata.ctid) LIMIT 1"

So I makes sense that the commit in question could have caused a change in
the execution plan. Because these are temp tables, I can't easily get my
hands on them to investigate further.

Cheers,

Jeff

#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Jeff Janes (#1)
Re: neqjoinsel versus "refresh materialized view concurrently"

On Wed, Mar 14, 2018 at 8:07 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

The following commit has caused a devastating performance regression
in concurrent refresh of MV:

commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Nov 29 22:00:29 2017 -0500

Fix neqjoinsel's behavior for semi/anti join cases.

The below reproduction goes from taking about 1 second to refresh, to taking
an amount of time I don't have the patience to measure.

drop table foobar2 cascade;
create table foobar2 as select * from generate_series(1,200000);
create materialized view foobar3 as select * from foobar2;
create unique index on foobar3 (generate_series );
analyze foobar3;
refresh materialized view CONCURRENTLY foobar3 ;

When I interrupt the refresh, I get a message including this line:

CONTEXT: SQL statement "SELECT newdata FROM pg_temp_3.pg_temp_16420 newdata
WHERE newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16420
newdata2 WHERE newdata2 IS NOT NULL AND newdata2 OPERATOR(pg_catalog.*=)
newdata AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1"

So I makes sense that the commit in question could have caused a change in
the execution plan. Because these are temp tables, I can't easily get my
hands on them to investigate further.

Ouch. A quadratic join. This looks like an invisible correlation problem.

load 'auto_explain';
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;

drop table if exists t cascade;

create table t as select generate_series(1, 5000);
create materialized view mv as select * from t;
create unique index on mv(generate_series);
analyze mv;
refresh materialized view concurrently mv;

HEAD:

LOG: duration: 26101.612 ms plan:
Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16452 newdata WHERE
newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16452
newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
Limit (cost=0.00..90.52 rows=1 width=28) (actual
time=26101.608..26101.608 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.00..225220.96 rows=2488 width=28)
(actual time=26101.606..26101.606 rows=0 loops=1)
Join Filter: ((newdata2.ctid <> newdata.ctid) AND (newdata.* *=
newdata2.*))
Rows Removed by Join Filter: 25000000
-> Seq Scan on pg_temp_16452 newdata (cost=0.00..73.00
rows=4975 width=34) (actual time=0.022..15.448 rows=5000 loops=1)
Filter: (newdata.* IS NOT NULL)
-> Materialize (cost=0.00..97.88 rows=4975 width=34) (actual
time=0.000..0.500 rows=5000 loops=5000)
-> Seq Scan on pg_temp_16452 newdata2 (cost=0.00..73.00
rows=4975 width=34) (actual time=0.010..4.033 rows=5000 loops=1)
Filter: (newdata2.* IS NOT NULL)

And with commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97 reverted:

LOG: duration: 36.358 ms plan:
Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16470 newdata WHERE
newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16470
newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
Limit (cost=756.95..939.50 rows=1 width=28) (actual
time=36.354..36.354 rows=0 loops=1)
-> Merge Semi Join (cost=756.95..2947.51 rows=12 width=28) (actual
time=36.352..36.352 rows=0 loops=1)
Merge Cond: (newdata.* *= newdata2.*)
Join Filter: (newdata2.ctid <> newdata.ctid)
Rows Removed by Join Filter: 5000
-> Sort (cost=378.48..390.91 rows=4975 width=34) (actual
time=9.622..10.300 rows=5000 loops=1)
Sort Key: newdata.* USING *<
Sort Method: quicksort Memory: 622kB
-> Seq Scan on pg_temp_16470 newdata (cost=0.00..73.00
rows=4975 width=34) (actual time=0.021..4.986 rows=5000 loops=1)
Filter: (newdata.* IS NOT NULL)
-> Sort (cost=378.48..390.91 rows=4975 width=34) (actual
time=7.378..8.010 rows=5000 loops=1)
Sort Key: newdata2.* USING *<
Sort Method: quicksort Memory: 622kB
-> Seq Scan on pg_temp_16470 newdata2 (cost=0.00..73.00
rows=4975 width=34) (actual time=0.017..3.034 rows=5000 loops=1)
Filter: (newdata2.* IS NOT NULL)

--
Thomas Munro
http://www.enterprisedb.com

#3Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#2)
Re: neqjoinsel versus "refresh materialized view concurrently"

On Wed, Mar 14, 2018 at 11:34 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

LOG: duration: 26101.612 ms plan:
Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16452 newdata WHERE
newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16452
newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
Limit (cost=0.00..90.52 rows=1 width=28) (actual
time=26101.608..26101.608 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.00..225220.96 rows=2488 width=28)
(actual time=26101.606..26101.606 rows=0 loops=1)
Join Filter: ((newdata2.ctid <> newdata.ctid) AND (newdata.* *=
newdata2.*))
Rows Removed by Join Filter: 25000000
-> Seq Scan on pg_temp_16452 newdata (cost=0.00..73.00
rows=4975 width=34) (actual time=0.022..15.448 rows=5000 loops=1)
Filter: (newdata.* IS NOT NULL)
-> Materialize (cost=0.00..97.88 rows=4975 width=34) (actual
time=0.000..0.500 rows=5000 loops=5000)
-> Seq Scan on pg_temp_16452 newdata2 (cost=0.00..73.00
rows=4975 width=34) (actual time=0.010..4.033 rows=5000 loops=1)
Filter: (newdata2.* IS NOT NULL)

This plan is chosen because we're looking for just one row (LIMIT 1)
that has equal data but a different ctid. In this case we're not
going to find one, so we'll pay the full enormous cost of the nested
loop, but the startup cost is estimated as 0 and we think we are going
to find a row straight away. That's because we don't know that it's
unlikely for there to be a row with the same columns but a different
ctid.

There is a fundamental and complicated estimation problem lurking here
of course and I'm not sure what to think about that yet. Maybe there
is a very simple fix for this particular problem:

--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -660,7 +660,7 @@ refresh_by_match_merge(Oid matviewOid, Oid
tempOid, Oid relowner,
                                         "(SELECT * FROM %s newdata2
WHERE newdata2 IS NOT NULL "
                                         "AND newdata2
OPERATOR(pg_catalog.*=) newdata "
                                         "AND newdata2.ctid
OPERATOR(pg_catalog.<>) "
-                                        "newdata.ctid) LIMIT 1",
+                                        "newdata.ctid)",
                                         tempname, tempname);
        if (SPI_execute(querybuf.data, false, 1) != SPI_OK_SELECT)
                elog(ERROR, "SPI_exec failed: %s", querybuf.data);

That gets me back to the sort-merge plan, but maybe it's too superficial.

--
Thomas Munro
http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#2)
Re: neqjoinsel versus "refresh materialized view concurrently"

Thomas Munro <thomas.munro@enterprisedb.com> writes:

This looks like an invisible correlation problem.

Yeah --- the planner has no idea that the join rows satisfying
newdata.* *= newdata2.* are likely to be exactly the ones not
satisfying newdata.ctid <> newdata2.ctid. It's very accidental
that we got a good plan before.

I've not looked to see where this query is generated, but I wonder
if we could make things better by dropping the LIMIT 1 and instead
using an executor count parameter to stop early.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#3)
Re: neqjoinsel versus "refresh materialized view concurrently"

Thomas Munro <thomas.munro@enterprisedb.com> writes:

There is a fundamental and complicated estimation problem lurking here
of course and I'm not sure what to think about that yet. Maybe there
is a very simple fix for this particular problem:

Ah, I see you thought of the same hack I did.

I think this may actually be a good fix, and here's the reason: this plan
is in fact being driven entirely off planner default estimates, because
we don't have any estimation code that knows what to do with
"wholerowvar *= wholerowvar". I'm suspicious that we could drop the
preceding ANALYZE as being a waste of cycles, except maybe it's finding
out the number of rows for us. In any case, LIMIT 1 is only a good idea
to the extent that the planner knows what it's doing, and this is an
example where it demonstrably doesn't and won't any time soon.

regards, tom lane

#6Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: neqjoinsel versus "refresh materialized view concurrently"

On Wed, Mar 14, 2018 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

There is a fundamental and complicated estimation problem lurking here
of course and I'm not sure what to think about that yet. Maybe there
is a very simple fix for this particular problem:

Ah, I see you thought of the same hack I did.

I think this may actually be a good fix, and here's the reason: this plan
is in fact being driven entirely off planner default estimates, because
we don't have any estimation code that knows what to do with
"wholerowvar *= wholerowvar". I'm suspicious that we could drop the
preceding ANALYZE as being a waste of cycles, except maybe it's finding
out the number of rows for us. In any case, LIMIT 1 is only a good idea
to the extent that the planner knows what it's doing, and this is an
example where it demonstrably doesn't and won't any time soon.

Hmm. I wonder if the ANALYZE might have been needed to avoid the
nested loop plan at some point in history.

Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
scenario and some smaller and larger examples I tried. The query is
already executed with SPI_execute(..., 1) so it'll give up after one
row anyway. The regression test includes a case that causes a row to
be produced here and that's passing ('ERROR: new data for
materialized view "mvtest_mv" contains duplicate rows without any null
columns').

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Fix-performance-regression-in-REFRESH-MATERIALIZED-V.patchapplication/octet-stream; name=0001-Fix-performance-regression-in-REFRESH-MATERIALIZED-V.patchDownload
From af3009cfef4f8750ac9a07a190db32f9c0fc7863 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 14 Mar 2018 12:32:37 +1300
Subject: [PATCH] Fix performance regression in REFRESH MATERIALIZED VIEW
 CONCURRENTLY.

Jeff Janes discovered that commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97
had made one of the queries run by REFRESH MATERIALIZED VIEW CONCURRENTLY
perform badly.  The root cause is bad cardinality estimation for correlated
quals, but a principled solution to that problem is some way off.  In this
case we don't expect the join to produce any rows so the whole join will run
to completion.  By removing the LIMIT clause, we can avoid a nested loop join
that would otherwise be chosen for its low startup cost.

Thomas Munro and Tom Lane
Discussion: https://postgr.es/m/CAMkU%3D1z-JoGymHneGHar1cru4F1XDfHqJDzxP_CtK5cL3DOfmg%40mail.gmail.com
---
 src/backend/commands/matview.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ab6a889b129..e828f2bff58 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -660,7 +660,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 					 "(SELECT * FROM %s newdata2 WHERE newdata2 IS NOT NULL "
 					 "AND newdata2 OPERATOR(pg_catalog.*=) newdata "
 					 "AND newdata2.ctid OPERATOR(pg_catalog.<>) "
-					 "newdata.ctid) LIMIT 1",
+					 "newdata.ctid)",
 					 tempname, tempname);
 	if (SPI_execute(querybuf.data, false, 1) != SPI_OK_SELECT)
 		elog(ERROR, "SPI_exec failed: %s", querybuf.data);
-- 
2.16.2

#7Jeff Janes
jeff.janes@gmail.com
In reply to: Thomas Munro (#6)
Re: neqjoinsel versus "refresh materialized view concurrently"

On Tue, Mar 13, 2018 at 4:57 PM, Thomas Munro <thomas.munro@enterprisedb.com

wrote:

On Wed, Mar 14, 2018 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@enterprisedb.com> writes:

There is a fundamental and complicated estimation problem lurking here
of course and I'm not sure what to think about that yet. Maybe there
is a very simple fix for this particular problem:

Ah, I see you thought of the same hack I did.

I think this may actually be a good fix, and here's the reason: this plan
is in fact being driven entirely off planner default estimates, because
we don't have any estimation code that knows what to do with
"wholerowvar *= wholerowvar". I'm suspicious that we could drop the
preceding ANALYZE as being a waste of cycles, except maybe it's finding
out the number of rows for us. In any case, LIMIT 1 is only a good idea
to the extent that the planner knows what it's doing, and this is an
example where it demonstrably doesn't and won't any time soon.

Hmm. I wonder if the ANALYZE might have been needed to avoid the
nested loop plan at some point in history.

Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
scenario and some smaller and larger examples I tried. The query is
already executed with SPI_execute(..., 1) so it'll give up after one
row anyway. The regression test includes a case that causes a row to
be produced here and that's passing ('ERROR: new data for
materialized view "mvtest_mv" contains duplicate rows without any null
columns').

Is there any good way to make the regression tests fail if the plan reverts
to the bad one? The only thing I can think of would be to make the table
bigger so the regression tests becomes "noticeably slower", but that is
pretty vague and not user friendly to formally pass and just hope it is
slow enough for someone to investigate.

Cheers,

Jeff

#8Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Jeff Janes (#7)
Re: neqjoinsel versus "refresh materialized view concurrently"

On Wed, Mar 14, 2018 at 2:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Mar 13, 2018 at 4:57 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
scenario and some smaller and larger examples I tried. The query is
already executed with SPI_execute(..., 1) so it'll give up after one
row anyway. The regression test includes a case that causes a row to
be produced here and that's passing ('ERROR: new data for
materialized view "mvtest_mv" contains duplicate rows without any null
columns').

Is there any good way to make the regression tests fail if the plan reverts
to the bad one? The only thing I can think of would be to make the table
bigger so the regression tests becomes "noticeably slower", but that is
pretty vague and not user friendly to formally pass and just hope it is slow
enough for someone to investigate.

I can't think of a good way. I guess it can still pick a nested loop
if it thinks there'll only be a couple of loops. This patch tells it
to pay attention to the total cost, not the startup cost, so as soon
as it thinks there is more than a hand full of rows the quadratic cost
will exceed the sort/merge's logarithmic cost.

Since I've had hash joins on the mind recently I couldn't help
noticing that you can't get a hash join out of this query's "record
image" based join qual (or even a regular row-based =). Were these
operators rigged up specifically to allow merge joins for this very
matview code? Just for fun, I tried a quick and dirty hack to get
past that by naming the columns explicitly in the query instead.
Maybe that's unfair for a single-column tabe or doesn't have the right
semantics, I'm not sure... But here are the rough times from my
laptop with a 5 million row version of your test:

patched just to remove LIMIT 1: 85s, of which 61s in the dup-check query (sort)
hacked to name columns: 38s, of which 14s in the dup-check query (hash)
hacked to name columns, work_mem=1GB: 31s, of which 4s in the
dup-check query (hash)

The reason I thought about that is because Parallel Hash is really
good at scaling big stupid self-joins (though for that we'd need
UNLOGGED instead of TEMP tables, which I didn't have time to try out
today).

--
Thomas Munro
http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#8)
Re: neqjoinsel versus "refresh materialized view concurrently"

Thomas Munro <thomas.munro@enterprisedb.com> writes:

On Wed, Mar 14, 2018 at 2:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Is there any good way to make the regression tests fail if the plan reverts
to the bad one? The only thing I can think of would be to make the table
bigger so the regression tests becomes "noticeably slower", but that is
pretty vague and not user friendly to formally pass and just hope it is slow
enough for someone to investigate.

I can't think of a good way. I guess it can still pick a nested loop
if it thinks there'll only be a couple of loops. This patch tells it
to pay attention to the total cost, not the startup cost, so as soon
as it thinks there is more than a hand full of rows the quadratic cost
will exceed the sort/merge's logarithmic cost.

Right. After further thought, the key point here is that in non-error
cases the query will produce no rows, meaning that it must be executed
to completion before we can be sure of that. But applying a LIMIT
encourages the planner to pick a fast-start (slow-completion) plan,
which is not going to be what we want. If in fact the query were going
to produce a lot of rows, and the planner could accurately predict that,
then maybe a LIMIT would be useful --- but there's no hope of estimates
on wholerowvar *= wholerowvar being accurate any time soon, let alone
correctly handling the correlation with ctid <> ctid. So the LIMIT
is just an invitation to trouble and we may as well remove it.

Committed that way. I also changed EXISTS(SELECT * ...) to
EXISTS(SELECT 1 ...), in hopes of saving a few microseconds of
parsing effort.

Since I've had hash joins on the mind recently I couldn't help
noticing that you can't get a hash join out of this query's "record
image" based join qual (or even a regular row-based =).

Yeah, because there's no hash support for recordimage. So there's even
less reason to be worried about how smart the planner is for this query:
basically, it might do a nestloop for a very small number of rows, but
otherwise it's gonna have to go for a merge join.

My earlier thought that we might be able to skip the ANALYZE step
seems wrong, though. It's true that it does little for this query,
but the follow-on query to build a diff table can probably make
good use of the stats.

regards, tom lane