[Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

Started by Adam Brusselback2 days ago4 messages
#1Adam Brusselback
Adam Brusselback
adambrusselback@gmail.com
2 attachment(s)

Attached is a patch implementing support for a WHERE clause in REFRESH
MATERIALIZED VIEW.

The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';

I was inspired to implement this feature after watching the Hacking
Postgres discussion on the topic:
https://www.youtube.com/watch?v=6cZvHjDrmlQ

This allows the user to restrict the refresh operation to a subset of the
view. The qualification is applied to the view's output columns. The
optimizer can then push this condition down to the underlying base tables,
avoiding a full scan when only a known subset of data has changed.

Implementation notes:

1. The grammar accepts an optional WHERE clause. We forbid volatile
functions in the clause to ensure correctness.

2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified,
the operation performs an in-place modification using a `ROW EXCLUSIVE`
lock.
* This mode requires a unique index to ensure constraint violations
are handled correctly (e.g., when a row's values change such that it
"drifts" into or out of the `WHERE` clause scope).
* It executes a Prune + Upsert strategy:
* `DELETE` all rows in the materialized view that match the `WHERE`
clause.
* `INSERT` the new data from the source query.
* It uses `ON CONFLICT DO UPDATE` during the insert phase to handle
concurrency edge cases, ensuring the refresh is robust against constraint
violations.

3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses
the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting
the scope of the diff (and the temporary table population) to the rows
matching the predicate. This requires an `EXCLUSIVE` lock and a unique
index, consistent with existing concurrent refresh behavior. It is much
slower than `Non-Concurrent Partial Refresh`

4. The execution logic uses SPI to inject the predicate into the source
queries during execution.

I have attached a benchmark suite to validate performance and correctness:

* `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing
system (`invoices` and `invoice_lines`). It includes an aggregated
materialized view (`invoice_summary`) and a control table
(`invoice_summary_table`).
* `workload_*.sql`: pgbench scripts simulating a high-churn environment
(45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset
sizes while generating significant refresh work.
* `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across
multiple scale factors and concurrency levels.

The benchmark compares strategies for keeping a summary up to date (vs
baseline):
* Partial Refresh: Triggers on the base table collect modified IDs and
execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
* Materialized Table (Control): A standard table maintained via complex
PL/pgSQL triggers (the traditional manual workaround).
* Full Refresh (Legacy): Manually refresh the view after changes.

Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 5309.05 | 0.002x 0.437x 0.470x
20000 50 | 1209.32 | 0.010x 0.600x 0.598x
20000 1000 | 56.05 | 0.164x 0.594x 0.576x
400000 1 | 5136.91 | 0 x 0.450x 0.487x
400000 50 | 1709.17 | 0 x 0.497x 0.482x
400000 1000 | 110.35 | 0.006x 0.507x 0.460x

Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 19197.50 | 0x 0.412x 0.435x
20000 50 | 1016.14 | 0.007x 0.966x 1.036x
20000 1000 | 9.94 | 0.708x 1.401x 1.169x
400000 1 | 19637.36 | 0x 0.436x 0.483x
400000 50 | 4669.32 | 0x 0.574x 0.566x
400000 1000 | 23.26 | 0.029x 1.147x 0.715x

Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 30358.32 | 0x 0.440x 0.457x
20000 50 | 262.75 | 0.026x 2.943x 2.740x
20000 1000 | 11.28 | 0.575x 0.840x 0.578x
400000 1 | 36007.15 | 0x 0.430x 0.464x
400000 50 | 6664.58 | 0x 0.563x 0.494x
400000 1000 | 11.61 | 0.058x 1.000x 1.277x

In these tests, the partial refresh behaves as O(delta) rather than
O(total), performing comparably to the manual PL/pgSQL approach but with
significantly lower code complexity for the user.

I recognize that adding a WHERE clause to REFRESH is an extension to the
SQL standard. I believe the syntax is intuitive, but I am open to
discussion regarding alternative implementation strategies or syntax if the
community feels a different approach is warranted.

New regression tests are included in the patch.

This is my first time submitting a patch to PostgreSQL, so please bear with
me if I've missed anything or made any procedural mistakes. I'm happy to
address any feedback.

Thanks,
Adam Brusselback

Attachments:

benchmark.zipapplication/zip; name=benchmark.zip
001_implement_mat_view_where.patchtext/x-patch; charset=US-ASCII; name=001_implement_mat_view_where.patch
#2Kirk Wolak
Kirk Wolak
wolakk@gmail.com
In reply to: Adam Brusselback (#1)
Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

On Mon, Dec 8, 2025 at 3:58 PM Adam Brusselback <adambrusselback@gmail.com>
wrote:

Attached is a patch implementing support for a WHERE clause in REFRESH
MATERIALIZED VIEW.

The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';

I was inspired to implement this feature after watching the Hacking
Postgres discussion on the topic:
https://www.youtube.com/watch?v=6cZvHjDrmlQ

+1 (But I was in that hacking session).

Our situation was a wonderful MV with all the columns we needed (some
hard to calculate) to augment search data done millions of times/day. It
was a thing of beauty. Until we realized we needed to update 1 record
(vendor inventory UPDATE date/time) every time we processed a file
(something we do 24x7, a hundred times each hour!

For that ONE field, we ended up doing REFRESH MV concurrently; OVER
2,000 times per day.

Our understanding is that many people run into this exact issue. The
cache needs small frequent updates.
(After reading the code that handles MVs, we just created our own TABLE,
and maintain it with a scheduler to rebuild HOURLY,
and when we process the file, a Simple UPDATE is issued for the one column).

While this "Works", the CONCEPT of this patch (untested by me, as of
yet), would have fixed this with far less effort,
and would be easier to maintain.

After I review the code, I will add additional comments.

I am curious what others think? (And FWIW, I believe that the larger the
MV, the MORE this feature is needed,
vs refreshing the ENTIRE view).

Regards...

Show quoted text
#3Nico Williams
Nico Williams
nico@cryptonector.com
In reply to: Kirk Wolak (#2)
Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

On Tue, Dec 09, 2025 at 12:08:58AM -0500, Kirk Wolak wrote:

Our understanding is that many people run into this exact issue. The
cache needs small frequent updates.
(After reading the code that handles MVs, we just created our own TABLE,
and maintain it with a scheduler to rebuild HOURLY,
and when we process the file, a Simple UPDATE is issued for the one column).

While this "Works", the CONCEPT of this patch (untested by me, as of
yet), would have fixed this with far less effort,
and would be easier to maintain.

After I review the code, I will add additional comments.

I am curious what others think? (And FWIW, I believe that the larger the
MV, the MORE this feature is needed,
vs refreshing the ENTIRE view).

What I do is I have my own materialized view infrastructure, written
entirely in PlPgSQL, and I completely avoid PG's MV support. This
alternative MV scheme creates an actual table for each MV, which means:

- one can update the MV directly (and I do, via triggers)

- one can have triggers on the MV (e.g., to record history)

This is has been very handy for me.

I also have a state table in which to keep track of whether an MV needs
a refresh, and I have a function i can use to mark an MV as needing a
refresh. Marking an MV as needing a refresh sends a NOTIFY, and then I
have a daemon that will refresh views as needed (with some
debouncing/coalescing of notifications).

This way I can have MVs with very complex underlying queries for which
some kinds of updates I can easily write fast triggers for and others
where I can't (or where they would slow down transactions too much) I
simply mark the MV as needing a refresh.

Typical MV queries I have that this works very well for include
transitive reachability closure computations (e.g., all the groups a
thing is a member of, directly and indirectly, or vice versa --
recursive CTEs basically). Though I do now have triggers that can do a
reasonably good job of synchronously and quickly updating MVs with such
queries, it's just I didn't always.

Refreshes are always 'concurrent'.

This is my 80/20 solution to the "Incremental View Maintenance" (IVM)
problem.

A not very current version is here:

https://github.com/twosigma/postgresql-contrib/blob/master/mat_views.sql

If you like it I might be able to get a newer version out. The version
above has a few minor issues:

- it uses DELETE FROM instead of TRUNCATE for its' sort-of temp tables

- using TRUNCATE ends up requiring some care to avoid occasional
deadlocks with VACUUM that are due to using tables as types of the
columns of the deltas tables

- logging -- lots of logging in the newest version

Another issue is that I rely on NATURAL FULL OUTER JOIN to avoid having
to generate ON conditions, but that means that all columns of the
underlying VIEW must not have NULLs. As I've not needed to support
nullable columns in these MVs, I don't mind.

Nico
--

#4Adam Brusselback
Adam Brusselback
adambrusselback@gmail.com
In reply to: Kirk Wolak (#2)
Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

Our situation was a wonderful MV with all the columns we needed (some
hard to calculate) to augment search data done millions of times/day. It
was a thing of beauty. Until we realized we needed to update 1 record
(vendor inventory UPDATE date/time) every time we processed a file
(something we do 24x7, a hundred times each hour!

For that ONE field, we ended up doing REFRESH MV concurrently; OVER
2,000 times per day.

Thanks for the feedback and the use case, Kirk.

Regarding that specific scenario where a single column ("last updated" or
similar) churns significantly faster than the heavy-computation columns:
Even with this patch, you might find it beneficial to separate that
high-velocity column into its own small materialized view (or regular view)
and join it to the main MV at query time. That will reduce the bloat you
get on the main MV by quite a lot, especially if you have very wide rows
(which it seems like you do).

I initially tried to implement logic that would allow for direct UPDATEs
(which would enable HOT updates). However, to handle rows that matched the
predicate but were no longer present in the new source data, I had to run
an anti-join to identify them for deletion. That approach caused
performance issues, so I settled on the "Prune + Upsert" strategy (DELETE
matching rows, then INSERT from source).

Because this patch performs a delete/insert cycle, updating that one
timestamp column will still result in rewriting the whole tuple in the MV.

For that ONE field, we ended up doing REFRESH MV concurrently; OVER
2,000 times per day.

That said, 2,000 refreshes per day is nothing for this implementation,
provided your updates are selective enough and your queries allow for
predicate push-down to the base tables.

I look forward to your thoughts after reviewing the code.

Thanks,
Adam Brusselback