Two issues with REFRESH MATERIALIZED VIEW CONCURRENTLY
I noticed the following two issues while looking at the code that handles
REFRESH MATERIALIZED VIEW CONCURRENTLY (refresh_by_match_merge() in matview.c):
1.
At the beginning of the function, there is some code that checks for duplicate
rows, but it does not catch the following case:
CREATE TABLE t(a text, b text);
INSERT INTO t VALUES('test', null);
CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES('test', null); -- t now contains two identical rows
REFRESH MATERIALIZED VIEW CONCURRENTLY m;
-> no error, but m still contains only one row!
REFRESH MATERIALIZED VIEW m;
-> error (as expected)
2.
Do I understand correctly that the join creating the "diff" table is given
equality conditions for all columns referenced in any unique indexes? This
led me to think that a unique index on a column with many null entries
would enlarge the "diff" table.
In the following example, creating the second unique index noticeably worsens
the performance of REFRESH MATERIALIZED VIEW CONCURRENTLY:
CREATE MATERIALIZED VIEW s AS SELECT generate_series as x, null as y FROM generate_series(1, 1000000);
CREATE UNIQUE INDEX ON s(x);
REFRESH MATERIALIZED VIEW CONCURRENTLY s;
-> runs for ~1700 ms
CREATE UNIQUE INDEX ON s(y);
REFRESH MATERIALIZED VIEW CONCURRENTLY s;
-> runs for ~9000 ms
Kind regards,
Giuliano
On Sun, 8 Feb 2026 at 22:49, Giuliano Gagliardi <gogi@gogi.tv> wrote:
I noticed the following two issues while looking at the code that handles
REFRESH MATERIALIZED VIEW CONCURRENTLY (refresh_by_match_merge() in
matview.c):1.
At the beginning of the function, there is some code that checks for
duplicate
rows, but it does not catch the following case:CREATE TABLE t(a text, b text);
INSERT INTO t VALUES('test', null);
CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES('test', null); -- t now contains two identical rowsREFRESH MATERIALIZED VIEW CONCURRENTLY m;
-> no error, but m still contains only one row!
REFRESH MATERIALIZED VIEW m;
-> error (as expected)
Interesting issue and thanks for pointing it out.
Going over the code in the function you mentioned(refresh_by_match_merge()
in matview.c), I found out that it is explicitly checking for the columns
where it is not NULL.
appendStringInfo(&querybuf,
"SELECT newdata.*::%s FROM %s newdata "
"WHERE newdata.* IS NOT NULL AND EXISTS "
"(SELECT 1 FROM %s newdata2 WHERE newdata2.* IS NOT NULL "
"AND newdata2.* OPERATOR(pg_catalog.*=) newdata.* "
"AND newdata2.ctid OPERATOR(pg_catalog.<>) "
"newdata.ctid)",
It is mentioned in the comments above as well that it checks for the
duplicates in the rows without NULLs.
However, if I changed the query as in the attached patch, it errors out as
otherwise I would have expected.
Honestly, I do not understand why it is checking for duplicates excluding
null values.
Behaviour wise this definitely seems like a bug, but I am not sure if the
attached patch is the right way to fix it.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Attachments:
0001-Check-for-duplicate-rows-with-NULLs.patchapplication/octet-stream; name=0001-Check-for-duplicate-rows-with-NULLs.patchDownload+3-4
Hi Giuliano,
Thank you for the test case, yes I am able to reproduce the behavior for
issue1
I noticed the following two issues while looking at the code that handles
REFRESH MATERIALIZED VIEW CONCURRENTLY (refresh_by_match_merge() in
matview.c):1.
At the beginning of the function, there is some code that checks for
duplicate
rows, but it does not catch the following case:CREATE TABLE t(a text, b text);
INSERT INTO t VALUES('test', null);
CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES('test', null); -- t now contains two identical rowsREFRESH MATERIALIZED VIEW CONCURRENTLY m;
-> no error, but m still contains only one row!
REFRESH MATERIALIZED VIEW m;
-> error (as expected)Adding the output here for a complete picture.
postgres=# CREATE TABLE t(a text, b text);
CREATE TABLE
postgres=# INSERT INTO t VALUES('test', null);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# INSERT INTO t VALUES('test', null);
INSERT 0 1
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m;
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM m;
a | b
------+---
test |
(1 row)
postgres=# REFRESH MATERIALIZED VIEW m;
ERROR: could not create unique index "m_a_idx"
DETAIL: Key (a)=(test) is duplicated.
postgres=# SELECT * FROM m;
a | b
------+---
test |
(1 row)
Yes, I believe "REFRESH MATERIALIZED VIEW CONCURRENTLY m;" should ideally
throw the same error as REFRESH MATERIALIZED VIEW m;
I am still trying to understand the CONCURRENTLY behavior in detail and
will share more of my findings on this potential issue.
Regards,
Surya Poondla
Hi Giuliano,
Regarding the issue 1,
I noticed the following two issues while looking at the code that handles
REFRESH MATERIALIZED VIEW CONCURRENTLY (refresh_by_match_merge() in
matview.c):1.
At the beginning of the function, there is some code that checks for
duplicate
rows, but it does not catch the following case:CREATE TABLE t(a text, b text);
INSERT INTO t VALUES('test', null);
CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES('test', null); -- t now contains two identical rowsREFRESH MATERIALIZED VIEW CONCURRENTLY m;
-> no error, but m still contains only one row!
REFRESH MATERIALIZED VIEW m;
-> error (as expected)
Thank you for the pointers, I made a patch in refresh_by_match_merge()
which reports an error in the REFRESH MATERIALIZED VIEW CONCURRENTLY case
too.
The issue was REFRESH MATERIALIZED VIEW CONCURRENTLY was incorrectly
skipping duplicate detection for rows containing any NULL values. This was
happening because the "WHERE newdata.* IS NOT NULL" condition returns false
if any column contains NULL.
My patch removes the "IS NOT NULL" preconditions from the duplicate
detection query. The query now correctly checks all rows using the record
equality operator (*=), which treats NULL as equal to NULL (i.e True).
Here is the output with my patch:
postgres=# CREATE TABLE t(a text, b text);
CREATE TABLE
postgres=# INSERT INTO t VALUES('test', null);
INSERT 0 1
postgres=#
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=#
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# SELECT * FROM m;
a | b
------+---
test |
(1 row)
postgres=# SELECT * FROM t;
a | b
------+---
test |
(1 row)
postgres=# INSERT INTO t VALUES('test', null);
INSERT 0 1
postgres=# SELECT * FROM t;
a | b
------+---
test |
test |
(2 rows)
postgres=# SELECT * FROM m;
a | b
------+---
test |
(1 row)
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m;
2026-02-11 15:57:46.751 PST [39510] ERROR: new data for materialized view
"m" contains duplicate rows
2026-02-11 15:57:46.751 PST [39510] DETAIL: Row: (test,)
2026-02-11 15:57:46.751 PST [39510] STATEMENT: REFRESH MATERIALIZED VIEW
CONCURRENTLY m;
ERROR: new data for materialized view "m" contains duplicate rows
DETAIL: Row: (test,)
postgres=#
postgres=# REFRESH MATERIALIZED VIEW m;
2026-02-11 15:57:55.877 PST [39510] ERROR: could not create unique index
"m_a_idx"
2026-02-11 15:57:55.877 PST [39510] DETAIL: Key (a)=(test) is duplicated.
2026-02-11 15:57:55.877 PST [39510] STATEMENT: REFRESH MATERIALIZED VIEW m;
ERROR: could not create unique index "m_a_idx"
DETAIL: Key (a)=(test) is duplicated.
postgres=#
Regards,
Surya Poondla
Attachments:
0001-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-to-detect.patchapplication/octet-stream; name=0001-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-to-detect.patchDownload+11-10
Hi All,
Looks like postgres mailing threads has some delay, the mailing list was
not updated timely and I couldn't see the patch that Rafia sent earlier
yesterday. I could only see their patch today and coincedentally it matches
the patch I suggested.
Right now, I am exploring the issue 2.
Regards,
Surya Poondla
Show quoted text
Hi All,
Right now, I am exploring the issue 2.
I am not sure if anyone else has submitted a patch for issue 2. I don't see
any updates from the mailing list yet. I guess the mailing list has some
delays, apologies if my efforts look duplicated to other people's efforts.
I have a potential patch for issue 2.
I was able to reproduce the issue, saw the performance degradation, and,
with my patch (attached) I see an improvement in the REFRESH MATERIALIZED
VIEW CONCURRENTLY.
The main crux of issue 2 is: when a materialized view has unique index on a
nullable column, and when we did REFRESH MATERIALIZED VIEW CONCURRENTLY it
would include that column in the FULL OUTER
JOIN condition used to detect changes.
The nullable column was showing severe performance degradation because NULL
= NULL comparisons evaluate to NULL, making all rows appear different even
when unchanged!
The fix I explored is to skip nullable columns when building the FULL OUTER
JOIN conditions. Only include columns with NOT NULL constraints from unique
indexes. The record equality operator (*=) is always included and handles
nullable columns correctly.
Here is the output and performance improvement:
postgres=# \timing on
Timing is on.
postgres=# DROP MATERIALIZED VIEW IF EXISTS s CASCADE;
NOTICE: materialized view "s" does not exist, skipping
DROP MATERIALIZED VIEW
Time: 0.858 ms
postgres=#
postgres=# CREATE MATERIALIZED VIEW s AS SELECT generate_series as x, null
as y FROM generate_series(1, 1000000);
SELECT 1000000
Time: 1076.254 ms (00:01.076)
postgres=#
postgres=# CREATE UNIQUE INDEX ON s(x);
CREATE INDEX
Time: 375.026 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3807.143 ms (00:03.807)
postgres=# CREATE UNIQUE INDEX ON s(y);
CREATE INDEX
Time: 331.382 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3636.049 ms (00:03.636)
postgres=#
As we can see the REFRESH MATERIALIZED VIEW CONCURRENTLY now takes 3636.049
ms
With the current patch for issue 2, there is a trade-off.
The fix skips nullable columns from the join condition to avoid slowness
when NULLs exist (9s vs 3s in testing). This may slightly slow down cases
where nullable columns (unique index) never contain NULLs.
Users can restore full performance by adding the NOT NULL constraints to
the column if they know there will never be any nulls on that column.
I would love to hear any feedback on this tradeoff and am happy to
implement relevant changes.
Note: The attached patch addresses both issue 1, issue 2.
Regards,
Surya Poondla.
Attachments:
0002-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-NULL-hand.patchapplication/octet-stream; name=0002-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-NULL-hand.patchDownload+43-15
On Thu, 12 Feb 2026 at 18:08, surya poondla <suryapoondla4@gmail.com> wrote:
Hi All,
Right now, I am exploring the issue 2.
I am not sure if anyone else has submitted a patch for issue 2. I don't
see any updates from the mailing list yet. I guess the mailing list has
some delays, apologies if my efforts look duplicated to other people's
efforts.I have a potential patch for issue 2.
I was able to reproduce the issue, saw the performance degradation, and,
with my patch (attached) I see an improvement in the REFRESH MATERIALIZED
VIEW CONCURRENTLY.The main crux of issue 2 is: when a materialized view has unique index on
a nullable column, and when we did REFRESH MATERIALIZED VIEW CONCURRENTLY
it would include that column in the FULL OUTER
JOIN condition used to detect changes.
The nullable column was showing severe performance degradation because
NULL = NULL comparisons evaluate to NULL, making all rows appear different
even when unchanged!The fix I explored is to skip nullable columns when building the FULL
OUTER JOIN conditions. Only include columns with NOT NULL constraints from
unique indexes. The record equality operator (*=) is always included and
handles nullable columns correctly.Here is the output and performance improvement:
postgres=# \timing on
Timing is on.
postgres=# DROP MATERIALIZED VIEW IF EXISTS s CASCADE;
NOTICE: materialized view "s" does not exist, skipping
DROP MATERIALIZED VIEW
Time: 0.858 ms
postgres=#
postgres=# CREATE MATERIALIZED VIEW s AS SELECT generate_series as x, null
as y FROM generate_series(1, 1000000);SELECT 1000000
Time: 1076.254 ms (00:01.076)
postgres=#
postgres=# CREATE UNIQUE INDEX ON s(x);
CREATE INDEX
Time: 375.026 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3807.143 ms (00:03.807)
postgres=# CREATE UNIQUE INDEX ON s(y);
CREATE INDEX
Time: 331.382 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3636.049 ms (00:03.636)
postgres=#As we can see the REFRESH MATERIALIZED VIEW CONCURRENTLY now takes 3636.049
msWith the current patch for issue 2, there is a trade-off.
The fix skips nullable columns from the join condition to avoid slowness
when NULLs exist (9s vs 3s in testing). This may slightly slow down cases
where nullable columns (unique index) never contain NULLs.
Users can restore full performance by adding the NOT NULL constraints to
the column if they know there will never be any nulls on that column.I would love to hear any feedback on this tradeoff and am happy to
implement relevant changes.Note: The attached patch addresses both issue 1, issue 2.
Regards,
Surya Poondla.Thanks for working on this.
Firstly, since both are different issues, it makes sense to write patches
for each of them separately.
Secondly, for issue 1 it is important to understand why the code was
explicitly done for null columns, what are the scenarios in which this
modified code could cause issues.
Also, for issue 1, additional test case should be added.
For issue 2, it would be helpful if you may share some performance numbers
to confirm if this solution is only improving the performance and not
causing any regressions.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Hi All,
Also, for issue 1, additional test case should be added.
Sure, I will add test cases for issue 1.
For issue 2, it would be helpful if you may share some performance numbers
to confirm if this solution is only improving the performance and not
causing any regressions.
I ran check, check-world and didn't see any regressions.
Here is the output and performance improvement:
postgres=# \timing on
Timing is on.
postgres=# DROP MATERIALIZED VIEW IF EXISTS s CASCADE;
NOTICE: materialized view "s" does not exist, skipping
DROP MATERIALIZED VIEW
Time: 0.858 ms
postgres=#
postgres=# CREATE MATERIALIZED VIEW s AS SELECT generate_series as x,
null as y FROM generate_series(1, 1000000);SELECT 1000000
Time: 1076.254 ms (00:01.076)
postgres=#
postgres=# CREATE UNIQUE INDEX ON s(x);
CREATE INDEX
Time: 375.026 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3807.143 ms (00:03.807)
postgres=# CREATE UNIQUE INDEX ON s(y);
CREATE INDEX
Time: 331.382 ms
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY s;
REFRESH MATERIALIZED VIEW
Time: 3636.049 ms (00:03.636)
postgres=#As we can see the REFRESH MATERIALIZED VIEW CONCURRENTLY now takes 3636.049
msRegrading the performance, (quoting the output from my previous message)
with unique index having NULL values we see that both "REFRESH MATERIALIZED
VIEW CONCURRENTLY s;" operations (operation 1 was after CREATE UNIQUE INDEX
ON s(x); and operation 2 was after CREATE UNIQUE INDEX ON s(x);) take about
the same time. Without the patch, operation 2 was taking around ~11000
ms, due to NULL = NULL comparison checks and this was causing the
degradation.
Regarding different commits to each issue, I don't have any
particular opinion but since both the issues are related to the same
function and NULL comparison, I feel we can have a single commit, but open
to create 2 commits too.
Regards,
Surya Poondla
Hi All,
Thank you Rafia for the suggestions.
I split both the bugs in 2 different commits, attaching the patches here.
For bug1, I added the test case for NULL values too.
For bug 2, I only changed matview.c and added no test case as the timings
are not constant.
I ran the regression tests for both the patches and all tests succeeded in
both cases.
Regards,
Surya Poondla
Show quoted text
Attachments:
0003-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-to-detect_bug1.patchapplication/octet-stream; name=0003-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-to-detect_bug1.patchDownload+36-11
0003-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-performance_bug2.patchapplication/octet-stream; name=0003-Fix-REFRESH-MATERIALIZED-VIEW-CONCURRENTLY-performance_bug2.patchDownload+27-4
On Mon, 2 Mar 2026 at 14:45, surya poondla <suryapoondla4@gmail.com> wrote:
Hi All,
Thank you Rafia for the suggestions.
I split both the bugs in 2 different commits, attaching the patches here.For bug1, I added the test case for NULL values too.
Thanks for working on this. This looks good to me.
+-- test that duplicate rows containing NULLs are also detected (bug fix)
I wouldn't use bug fix here, it looks fine without it.
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
In order to take this further, I think it would be good to add this to
commitfest.
On Wed, 4 Mar 2026 at 17:01, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
On Mon, 2 Mar 2026 at 14:45, surya poondla <suryapoondla4@gmail.com>
wrote:Hi All,
Thank you Rafia for the suggestions.
I split both the bugs in 2 different commits, attaching the patches here.For bug1, I added the test case for NULL values too.
Thanks for working on this. This looks good to me.
+-- test that duplicate rows containing NULLs are also detected (bug fix)
I wouldn't use bug fix here, it looks fine without it.--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
--
Regards,
Rafia Sabih
CYBERTEC PostgreSQL International GmbH
Hi Rafia,
Thank you for the suggestion.
I created the commit fest entries:
https://commitfest.postgresql.org/patch/6579/ (for bug 1)
https://commitfest.postgresql.org/patch/6580/ (for the performance
improvement)
Regards,
Surya Poondla
Hi,
I think we might want the "*=" operator treat NULL as not equal to NULL and
this is why we add "IS NOT NULL" to the duplicate detection query.
Your patch treats NULL as equal to NULL, which is different from the SQL
standard, may confuse users.
So I think we should make the "*=" operator treat NULL as not equal to NULL
or add a new operator to implement it. Thoughts?
--
Regards,
ChangAo Chen
I think we might want the "*=" operator treat NULL as not equal to NULL and
this is why we add "IS NOT NULL" to the duplicate detection query.Your patch treats NULL as equal to NULL, which is different from the SQL
standard, may confuse users.So I think we should make the "*=" operator treat NULL as not equal to NULL
or add a new operator to implement it. Thoughts?
Attach a patch. I add a new built-in function called record_image_eq_variant
which considers two NULLs not equal so that each row can match at most one
row during the full join.
--
Regards,
ChangAo Chen
Attachments:
v4-0001-Fix-wrong-result-of-refresh-matview-concurrently.patchapplication/octet-stream; charset=utf-8; name=v4-0001-Fix-wrong-result-of-refresh-matview-concurrently.patchDownload+75-20
Hi ChangAo,
Thank you for the detailed review.
For issue 1, my fix removes the IS NOT NULL guard from the pre-check so
that *= can detect all duplicate rows, including those containing NULLs.
(Note: The semantics of *= has always treated NULL as equal to NULL.)
The reasoning is straightforward: the JOIN uses *= to match newdata rows
against MV rows. If newdata contains two *=-equal rows, both would match
the same MV row in the JOIN, producing a wrong diff. The pre-check must
therefore use the same *= semantics to catch exactly those duplicates
which is what my fix does by removing the IS NOT NULL guard.
The IS NOT NULL guard was the bug as it was hiding real duplicates from
detection.
Your approach leaves the pre-check unchanged and instead replaces *= in
the JOIN with record_image_eq_variant (NULL != NULL). I see two concerns:
1. record_image_eq_variant applies NULL != NULL globally to all rows in
the JOIN, not just duplicate ones. This means any unchanged row
containing any NULL in any column will never match its counterpart
during the JOIN, causing a DELETE + INSERT for that row on every
refresh even when the data has not changed. The original issue 2 was
specifically about nullable indexed columns, your fix extends the
performance problem to all nullable columns anywhere in the row,
which makes the performance worse than issue 2.
2. The error surfaced becomes a unique_violation from the index rather
than the explicit "contains duplicate rows" message, which is harder
for users to diagnose.
Regards,
Surya Poondla
Hi Surya,
For issue 1, my fix removes the IS NOT NULL guard from the pre-check so
that *= can detect all duplicate rows, including those containing NULLs.
(Note: The semantics of *= has always treated NULL as equal to NULL.)The reasoning is straightforward: the JOIN uses *= to match newdata rows
against MV rows. If newdata contains two *=-equal rows, both would match
the same MV row in the JOIN, producing a wrong diff. The pre-check must
therefore use the same *= semantics to catch exactly those duplicates
which is what my fix does by removing the IS NOT NULL guard.
The IS NOT NULL guard was the bug as it was hiding real duplicates from detection.
If I understand correctly, your fix will break the following case which works well currently:
CREATE TABLE t (a int, b int);
INSERT INTO t VALUES (null, null);
CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES (null, null);
REFRESH MATERIALIZED VIEW CONCURRENTLY m;
Your fix will report an error because of the two (null, null) rows. On master, this case
works well because of the join condition "mv.a = newdada.a" which considers two
NULLs not equal, so we will get a correct diff table.
Your approach leaves the pre-check unchanged and instead replaces *= in
the JOIN with record_image_eq_variant (NULL != NULL). I see two concerns:
1. record_image_eq_variant applies NULL != NULL globally to all rows in
the JOIN, not just duplicate ones. This means any unchanged row
containing any NULL in any column will never match its counterpart
during the JOIN, causing a DELETE + INSERT for that row on every
refresh even when the data has not changed. The original issue 2 was
specifically about nullable indexed columns, your fix extends the
performance problem to all nullable columns anywhere in the row,
which makes the performance worse than issue 2.
Yeah, you're right. Any row containing any NULL in any column will get into the
diff table. But it's for correctness. Maybe user should avoid using CONCURRENTLY
with a lot of rows containing NULL.
2. The error surfaced becomes a unique_violation from the index rather
than the explicit "contains duplicate rows" message, which is harder
for users to diagnose.
I don't think it's a big issue.
--
Regards,
ChangAo Chen
Hi ChangAo,
Thank you for the continued review, your test case helped me find a
real issue in v3 patch, and I updated the fix accordingly.
You were correct that v3 was wrong. Removing IS NOT NULL and using *= alone
in the pre-check was too aggressive: *= treats NULL=NULL=true, which caused
(NULL,NULL)×2 with a unique index on a nullable column to be flagged
as duplicates even though the unique index allows multiple NULLs.
The updated patch takes a different approach. Instead of removing the guard
and relying on *= alone, the pre-check now uses the same per-column equality
operators as the FULL OUTER JOIN accumulated during the same
index-scanning loop. These operators treat NULL=NULL as false, which is
consistent with how
unique indexes actually work (NULLs are distinct).
For (test, NULL)×2 with a single index on a (non-null):
newdata2.a = newdata.a i.e 'test'='test' that is TRUE
newdata2.* *= newdata.* that is TRUE
Thus the duplicate is caught and error is raised
For (NULL, NULL)×2 with a single index on a (nullable):
newdata2.a = newdata.a i.e NULL=NULL that is NULL (false)
Thus no duplicate is caught, the refresh correctly succeeds, MV gets 2 rows
For (test, NULL)×2 with a composite index on (a, b):
newdata2.a = newdata.a i.e TRUE
newdata2.b = newdata.b i.e NULL=NULL that is NULL (false)
Combined: NULL is not caught, refresh correctly succeeds
Regarding your record_image_eq_variant approach: it correctly handles the
NULL-in-indexed-column case, but it introduces a performance regression for
unchanged rows where any non-indexed column contains NULL.
For example, an unchanged row (1, NULL) with a unique index on non-null
index on a would require a DELETE+INSERT on every CONCURRENTLY refresh
because
record_image_eq_variant((1,NULL),(1,NULL)) returns false.
This makes CONCURRENTLY impractical for any table where rows contain NULLs
in non-indexed columns.
The updated patch for bug 1.
Here is some additional tests I did:
postgres=# SET client_min_messages = WARNING;
SET
postgres=# -- Test 1: (test,NULL)×2, single index on 'a' should ERROR
postgres=# CREATE TABLE t(a text, b text);
INSERT INTO t VALUES('test', NULL);
CREATE MATERIALIZED VCREATE TABLE
postgres=# INSERT INTO t VALUES('test', NULL);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
CREATE UNIQUE INDEX ON m(a);
INSERT INTO t VALUES('test', NULL);
REFSELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# INSERT INTO t VALUES('test', NULL);
INSERT 0 1
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m; -- must error
ERROR: new data for materialized view "m" contains duplicate rows
DETAIL: Row: (test,)
postgres=# DROP TABLE t CASCADE;
DROP TABLE
postgres=# -- Test 2: (NULL,NULL)×2, single index on 'a' should SUCCEED
postgres=# CREATE TABLE t(a int, b int);
CREATE TABLE
postgres=# INSERT INTO t VALUES(NULL, NULL);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# INSERT INTO t VALUES(NULL, NULL);
INSERT 0 1
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m; -- must succeed
SELECT COUNT(*) FROM m;
DROP TABLE t CASCADE;REFRESH MATERIALIZED VIEW
postgres=# SELECT COUNT(*) FROM m; -- should be 2
count
-------
2
(1 row)
postgres=# DROP TABLE t CASCADE;
DROP TABLE
postgres=# --Test 3: (test,NULL)×2, composite index (a,b) should SUCCEED
postgres=# CREATE TABLE t(a text, b text);
CREATE TABLE
postgres=# INSERT INTO t VALUES('test', NULL);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a, b);
INSERT INTO t VALUES('test', NUL CREATE UNIQUE INDEX ON m(a, b);
CREATE INDEX
postgres=# INSERT INTO t VALUES('test', NULL);
INSERT 0 1
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m; -- must succeed
REFRESH MATERIALIZED VIEW
postgres=# SELECT COUNT(*) FROM m;
DROP TABLE t CASCADE;
SELECT COUNT(*) FROM m; -- must be 2
count
-------
2
(1 row)
postgres=# DROP TABLE t CASCADE;
DROP TABLE
postgres=# -- Test 4: unchanged (1,NULL), index on 'a' should SUCCEED.
postgres=# CREATE TABLE t(a int, b int);
CREATE TABLE
postgres=# INSERT INTO t VALUES(1, NULL);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m; -- must succeed
REFRESH MATERIALIZED VIEW
postgres=# SELECT * FROM m; -- must still show (1,)
a | b
---+---
1 |
(1 row)
postgres=# DROP TABLE t CASCADE;
DROP TABLE
postgres=# -- Test 5: (1,NULL)×2, separate index on a AND b, should ERROR
postgres=# CREATE TABLE t(a int, b int);
CREATE TABLE
postgres=# INSERT INTO t VALUES(1, NULL);
INSERT 0 1
postgres=# CREATE MATERIALIZED VIEW m AS SELECT * FROM t;
SELECT 1
postgres=# CREATE UNIQUE INDEX ON m(a);
CREATE INDEX
postgres=# CREATE UNIQUE INDEX ON m(b);
CREATE INDEX
postgres=# INSERT INTO t VALUES(1, NULL);
INSERT 0 1
postgres=# REFRESH MATERIALIZED VIEW CONCURRENTLY m; -- must error
ERROR: duplicate key value violates unique constraint "m_a_idx"
DETAIL: Key (a)=(1) already exists.
CONTEXT: SQL statement "INSERT INTO public.m SELECT (diff.newdata).* FROM
pg_temp_2.pg_temp_16535_2 diff WHERE tid IS NULL"
postgres=# DROP TABLE t CASCADE;
DROP TABLE
postgres=#
Made some minor changes to bug2 patch too.
Regards,
Surya Poondla