Idea: GSoC - Query Rewrite with Materialized Views

Started by Eric Grinsteinalmost 11 years ago12 messages
#1Eric Grinstein
eric@aluno.puc-rio.br

Hello Everyone,

I'm Eric Grinstein, an Comp. Engineering undergrad at PUC-Rio, Brazil.
I'm very excited about contributing to Postgres, and thought GSoC would be
a very good program to get me started.

Some commercial RDBMS such as oracle implement a feature called 'Query
Rewrite'
using materialized views. When a query is being processed, the system
checks whether a materialized view created on a similar query could be
useful for the query. If so, it rewrites it so it can take advantage of the
MV, usually dispensing a lot of calculation.

I think the implementation of the feature would be a nice project to work
on, and very benefical to Postgres. Many people (including myself) miss the
feature on it, and

So, could you give me your opinion about it? If I made a proposal, would
the community's response be positive? If you have things to add, new ideas,
I'd be very excited to hear them.

Thank you for your time! Hope to make my first patch soon!

Eric

#2Kevin Grittner
kgrittn@ymail.com
In reply to: Eric Grinstein (#1)
Re: Idea: GSoC - Query Rewrite with Materialized Views

Eric Grinstein <eric@aluno.puc-rio.br> wrote:

I'm very excited about contributing to Postgres, and thought GSoC
would be a very good program to get me started.

Great!

Some commercial RDBMS such as oracle implement a feature called
'Query Rewrite' using materialized views. When a query is being
processed, the system checks whether a materialized view created on
a similar query could be useful for the query. If so, it rewrites
it so it can take advantage of the MV, usually dispensing a lot of
calculation.

Oracle, MS SQL Server, Sybase ASE, and IBM DB2 all have this.
(There may be others.) In essence they treat an MV a bit like an
index, as something you can create to speed up an existing query
without rewriting it. It would certainly be nice to have this in
PostgreSQL, too, in my opinion.

So, could you give me your opinion about it? If I made a proposal,
would the community's response be positive? If you have things to
add, new ideas, I'd be very excited to hear them.

That seems extraordinarily difficult for a GSoC project. Unless
you can demonstrate mastery of the concepts involved in such
optimizations, and a familiarity with the PostgreSQL planner, with
a plan to put forward for how you would do this I think you should
set a more modest goal. Perhaps you could find something to work
on related to the planner that is small enough to be achieved in
the limited time allowed for a GSoC project, that would move you
closer to taking on something this big.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kevin Grittner (#2)
Re: Idea: GSoC - Query Rewrite with Materialized Views

On 20.2.2015 22:45, Kevin Grittner wrote:

Oracle, MS SQL Server, Sybase ASE, and IBM DB2 all have this. (There
may be others.) In essence they treat an MV a bit like an index, as
something you can create to speed up an existing query without
rewriting it. It would certainly be nice to have this in PostgreSQL,
too, in my opinion.

Yeah. The trouble is indexes are up-to-date, but MVs may not be - there
might be changes since the last REFRESH, which makes the rewrite more
complex. We don't want to use stale MVs for the rewrite, so we'd have to
identify the stale MVs somehow - AFAIK we don't have a flag for that.

That seems extraordinarily difficult for a GSoC project. Unless you
can demonstrate mastery of the concepts involved in such
optimizations, and a familiarity with the PostgreSQL planner, with a
plan to put forward for how you would do this I think you should set
a more modest goal. Perhaps you could find something to work on
related to the planner that is small enough to be achieved in the
limited time allowed for a GSoC project, that would move you closer
to taking on something this big.

I share the view that this would be very valuable, but the scope far
exceeds what can be done within a single GSoC project. But maybe we
could split that into multiple pieces, and Eric would implement only the
first piece?

For example the 'is_stale' flag for a MV would be really useful, making
it possible to refresh only the MVs that actually need a refresh.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kevin Grittner
kgrittn@ymail.com
In reply to: Tomas Vondra (#3)
Re: Idea: GSoC - Query Rewrite with Materialized Views

Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I share the view that this would be very valuable, but the scope
far exceeds what can be done within a single GSoC project. But
maybe we could split that into multiple pieces, and Eric would
implement only the first piece?

For example the 'is_stale' flag for a MV would be really useful,
making it possible to refresh only the MVs that actually need a
refresh.

You may be on to something there. Frankly, though, I'm not sure
that we could even reach consensus within the community on a
detailed design for how we intend to track staleness (that will
hold up both now and once we have incremental maintenance of
materialized views working) within the time frame of a GSoC
project. This would need to be done with an eye toward how it
might be used in direct references (will we allow a "staleness
limit" on a reference from a query?), for use in a rewrite, and how
it will interact with changes to base tables and with both REFRESH
statements and incremental maintenance at various levels of
"eagerness". I'm not sure that staleness management wouldn't be
better left until we have some of those other parts for it to work
with.

Questions to consider:

Some other products allow materialized views to be partitioned and
staleness to be tracked by partition, and will check which
partitions will be accessed in determining staleness. Is that
something we want to allow for?

Once we have incremental maintenance, an MV maintained in an
"eager" fashion (changes are visible in the MV as soon as the
transaction modifying the underlying table commit) could be
accessed with a MVCC snapshots, with different snapshots seeing
different versions. It seems pretty clear that such an MV would
always be considered "fresh", so there would be no need to
constantly flipping to stale and back again as the underlying table
were changed and the changes were reflected in the MV. How do we
handle that?

If changes to an MV are less eager (they are queued for application
after COMMIT, as time permits) would we want to track the xid of
how far along they are, so that we can tell whether a particular
snapshot is safe to use? Do we want to allow a non-MVCC snapshot
that shows the latest version of each row? Only if staleness is
minimal?

What about MVs which don't have incremental maintenance? We can
still determine what xid they are current "as of", from the
creation or the latest refresh. Do we want to track that instead
of a simple boolean flag?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kevin Grittner (#4)
Re: Idea: GSoC - Query Rewrite with Materialized Views

On 21.2.2015 00:20, Kevin Grittner wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I share the view that this would be very valuable, but the scope
far exceeds what can be done within a single GSoC project. But
maybe we could split that into multiple pieces, and Eric would
implement only the first piece?

For example the 'is_stale' flag for a MV would be really useful,
making it possible to refresh only the MVs that actually need a
refresh.

You may be on to something there. Frankly, though, I'm not sure
that we could even reach consensus within the community on a
detailed design for how we intend to track staleness (that will
hold up both now and once we have incremental maintenance of
materialized views working) within the time frame of a GSoC
project. This would need to be done with an eye toward how it
might be used in direct references (will we allow a "staleness
limit" on a reference from a query?), for use in a rewrite, and how
it will interact with changes to base tables and with both REFRESH
statements and incremental maintenance at various levels of
"eagerness". I'm not sure that staleness management wouldn't be
better left until we have some of those other parts for it to work
with.

Doing that properly is going to be nontrivial, no doubt about that. I
was thinking about keeping a simple list of updated tables (oids) and
then at commit time, deciding which MVs to depend on that and setting
some sort of flag (or XID) for all those MVs. But maybe there's a better
way.

Questions to consider:

Some other products allow materialized views to be partitioned and
staleness to be tracked by partition, and will check which partitions
will be accessed in determining staleness. Is that something we want
to allow for?

I think we need to get this working for simple MVs, especially because
we don't have partitioned MVs (or the type of declarative partitioning
the other products do have).

Once we have incremental maintenance, an MV maintained in an "eager"
fashion (changes are visible in the MV as soon as the transaction
modifying the underlying table commit) could be accessed with a MVCC
snapshots, with different snapshots seeing different versions. It
seems pretty clear that such an MV would always be considered
"fresh", so there would be no need to constantly flipping to stale
and back again as the underlying table were changed and the changes
were reflected in the MV. How do we handle that?

Yes, incrementally updated MVs might be used more easily, without
tracking staleness. But we don't have that now, and it's going to take a
significant amount of time to get there.

Also, not all MVs can be updated incrementally, so either we allow only
simple MVs to be used for rewrites, or we'll have to implement the
'stale' flag anyway.

If changes to an MV are less eager (they are queued for application
after COMMIT, as time permits) would we want to track the xid of how
far along they are, so that we can tell whether a particular snapshot
is safe to use? Do we want to allow a non-MVCC snapshot that shows
the latest version of each row? Only if staleness is minimal?

Maybe. When I talk about 'flag' I actually mean a simple way to
determine whether the MV is up-to-date or not. Snapshots and XIDs are
probably the right way to do that in MVCC-based system.

What about MVs which don't have incremental maintenance? We can still
determine what xid they are current "as of", from the creation or the
latest refresh. Do we want to track that instead of a simple boolean
flag?

How would we use the 'as of' XID? IMHO it's unacceptable to quietly use
stale data unless the user explicitly references the MV, so we'd have to
assume we can't use that MV.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Eric Grinstein
eric@aluno.puc-rio.br
In reply to: Tomas Vondra (#5)
Re: Idea: GSoC - Query Rewrite with Materialized Views

Thank you for your answers.
I would be very interested in tracking the staleness of the MV.
You see, I work in a research group in database tuning, and we have
implemented some solutions to take advantage of MV's and speed up queries.
The query rewrite feature would be extremely desirable for us.
Do you think that implementing the staleness check as suggested by Thomas
could get us started in the query rewrite business? Do you think I should
make a proposal
or there are more interesting subjects to GSoC? I'd be happy to hear
project suggestions, especially
related to the optimizer, tuning, etc.

Eric

2015-02-20 22:35 GMT-02:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

Show quoted text

On 21.2.2015 00:20, Kevin Grittner wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I share the view that this would be very valuable, but the scope
far exceeds what can be done within a single GSoC project. But
maybe we could split that into multiple pieces, and Eric would
implement only the first piece?

For example the 'is_stale' flag for a MV would be really useful,
making it possible to refresh only the MVs that actually need a
refresh.

You may be on to something there. Frankly, though, I'm not sure
that we could even reach consensus within the community on a
detailed design for how we intend to track staleness (that will
hold up both now and once we have incremental maintenance of
materialized views working) within the time frame of a GSoC
project. This would need to be done with an eye toward how it
might be used in direct references (will we allow a "staleness
limit" on a reference from a query?), for use in a rewrite, and how
it will interact with changes to base tables and with both REFRESH
statements and incremental maintenance at various levels of
"eagerness". I'm not sure that staleness management wouldn't be
better left until we have some of those other parts for it to work
with.

Doing that properly is going to be nontrivial, no doubt about that. I
was thinking about keeping a simple list of updated tables (oids) and
then at commit time, deciding which MVs to depend on that and setting
some sort of flag (or XID) for all those MVs. But maybe there's a better
way.

Questions to consider:

Some other products allow materialized views to be partitioned and
staleness to be tracked by partition, and will check which partitions
will be accessed in determining staleness. Is that something we want
to allow for?

I think we need to get this working for simple MVs, especially because
we don't have partitioned MVs (or the type of declarative partitioning
the other products do have).

Once we have incremental maintenance, an MV maintained in an "eager"
fashion (changes are visible in the MV as soon as the transaction
modifying the underlying table commit) could be accessed with a MVCC
snapshots, with different snapshots seeing different versions. It
seems pretty clear that such an MV would always be considered
"fresh", so there would be no need to constantly flipping to stale
and back again as the underlying table were changed and the changes
were reflected in the MV. How do we handle that?

Yes, incrementally updated MVs might be used more easily, without
tracking staleness. But we don't have that now, and it's going to take a
significant amount of time to get there.

Also, not all MVs can be updated incrementally, so either we allow only
simple MVs to be used for rewrites, or we'll have to implement the
'stale' flag anyway.

If changes to an MV are less eager (they are queued for application
after COMMIT, as time permits) would we want to track the xid of how
far along they are, so that we can tell whether a particular snapshot
is safe to use? Do we want to allow a non-MVCC snapshot that shows
the latest version of each row? Only if staleness is minimal?

Maybe. When I talk about 'flag' I actually mean a simple way to
determine whether the MV is up-to-date or not. Snapshots and XIDs are
probably the right way to do that in MVCC-based system.

What about MVs which don't have incremental maintenance? We can still
determine what xid they are current "as of", from the creation or the
latest refresh. Do we want to track that instead of a simple boolean
flag?

How would we use the 'as of' XID? IMHO it's unacceptable to quietly use
stale data unless the user explicitly references the MV, so we'd have to
assume we can't use that MV.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kevin Grittner
kgrittn@ymail.com
In reply to: Eric Grinstein (#6)
Re: Idea: GSoC - Query Rewrite with Materialized Views

Eric Grinstein <eric@aluno.puc-rio.br> wrote:

I would be very interested in tracking the staleness of the MV.
You see, I work in a research group in database tuning, and we
have implemented some solutions to take advantage of MV's and
speed up queries.

Please be sure, right up front, that there are no intellectual
property issues for what you pick. We're happy to give credit
where credit is due, but we can't accept contributions which have
any license restrictions incompatible with the PostgreSQL License:

http://www.postgresql.org/about/licence/

The query rewrite feature would be extremely desirable for us.
Do you think that implementing the staleness check as suggested
by Thomas could get us started in the query rewrite business?

There are many aspects related to the definition, maintenance, and
use of MVs that need work; it seems to me that many of them can be
pursued in parallel as long as people are communicating. Staleness
tracking is definitely one aspect that is needed. If you want to
put forward a proposal for that, which seems to be of a scope that
is possible in the context of GSoC, that would be great. If there
is any other aspect of the MV "big picture" that you can think of
that you would like to tackle and seems of appropriate scope,
please don't feel constrained to "staleness" as the only possible
project; it was just one suggestion of something that might be of
about the right size.

Do you think I should make a proposal or there are more
interesting subjects to GSoC? I'd be happy to hear project
suggestions, especially related to the optimizer, tuning, etc.

If you can find a project in any of those areas that is of interest
to you and of a scope that allows a deliverable within the time
frame of a GSoC project, go for it! The optimizer seems like a
pretty challenging place for a first contribution, though; I would
recommend looking back through recent git history at a few
optimizer changes to get a feel for what that looks like.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Kevin Grittner (#7)
Re: Idea: GSoC - Query Rewrite with Materialized Views

On 3/2/15 9:03 AM, Kevin Grittner wrote:

The query rewrite feature would be extremely desirable for us.

Do you think that implementing the staleness check as suggested
by Thomas could get us started in the query rewrite business?

There are many aspects related to the definition, maintenance, and
use of MVs that need work; it seems to me that many of them can be
pursued in parallel as long as people are communicating. Staleness
tracking is definitely one aspect that is needed. If you want to
put forward a proposal for that, which seems to be of a scope that
is possible in the context of GSoC, that would be great. If there
is any other aspect of the MV "big picture" that you can think of
that you would like to tackle and seems of appropriate scope,
please don't feel constrained to "staleness" as the only possible
project; it was just one suggestion of something that might be of
about the right size.

FWIW, what I would find most useful at this point is a way to get the
equivalent of an AFTER STATEMENT trigger that provided all changed rows
in a MV as the result of a statement. That would at least allow people
do do their own MV refresh work without needing to study the methods for
identifying how the results of a statement impact what should be in the
MV. I think even something that just does that in pure SQL/plpgsql would
be a big step forward, even if we wouldn't want it directly in the codebase.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Eric Grinstein
eric@aluno.puc-rio.br
In reply to: Jim Nasby (#8)
Re: Idea: GSoC - Query Rewrite with Materialized Views

I think even something that just does that in pure SQL/plpgsql would be a
big step forward, even if we wouldn't want it directly in the codebase.

Something like creating a trigger under the hood each time a MV is created,
that checks the changed rows on every statement against the query that
generated the MV? That also seems feasible, but wouldn't it be rather too
small for my three months of GSoC?

2015-03-02 20:22 GMT-03:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

Show quoted text

On 3/2/15 9:03 AM, Kevin Grittner wrote:

The query rewrite feature would be extremely desirable for us.

Do you think that implementing the staleness check as suggested
by Thomas could get us started in the query rewrite business?

There are many aspects related to the definition, maintenance, and
use of MVs that need work; it seems to me that many of them can be
pursued in parallel as long as people are communicating. Staleness
tracking is definitely one aspect that is needed. If you want to
put forward a proposal for that, which seems to be of a scope that
is possible in the context of GSoC, that would be great. If there
is any other aspect of the MV "big picture" that you can think of
that you would like to tackle and seems of appropriate scope,
please don't feel constrained to "staleness" as the only possible
project; it was just one suggestion of something that might be of
about the right size.

FWIW, what I would find most useful at this point is a way to get the
equivalent of an AFTER STATEMENT trigger that provided all changed rows in
a MV as the result of a statement. That would at least allow people do do
their own MV refresh work without needing to study the methods for
identifying how the results of a statement impact what should be in the MV.
I think even something that just does that in pure SQL/plpgsql would be a
big step forward, even if we wouldn't want it directly in the codebase.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jim Nasby (#8)
Re: Idea: GSoC - Query Rewrite with Materialized Views

Jim Nasby wrote:

FWIW, what I would find most useful at this point is a way to get the
equivalent of an AFTER STATEMENT trigger that provided all changed rows in a
MV as the result of a statement.

Ah, like
/messages/by-id/1402790204.65037.YahooMailNeo@web122301.mail.ne1.yahoo.com

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11David Fetter
david@fetter.org
In reply to: Alvaro Herrera (#10)
Re: Idea: GSoC - Query Rewrite with Materialized Views

On Tue, Mar 03, 2015 at 05:49:06PM -0300, Alvaro Herrera wrote:

Jim Nasby wrote:

FWIW, what I would find most useful at this point is a way to get
the equivalent of an AFTER STATEMENT trigger that provided all
changed rows in a MV as the result of a statement.

Ah, like
/messages/by-id/1402790204.65037.YahooMailNeo@web122301.mail.ne1.yahoo.com

Yes, very much like that.

Kevin, might you be able to give some guidance on this?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Fetter (#11)
Re: Idea: GSoC - Query Rewrite with Materialized Views

On 3/3/15 3:34 PM, David Fetter wrote:

On Tue, Mar 03, 2015 at 05:49:06PM -0300, Alvaro Herrera wrote:

Jim Nasby wrote:

FWIW, what I would find most useful at this point is a way to get
the equivalent of an AFTER STATEMENT trigger that provided all
changed rows in a MV as the result of a statement.

Ah, like
/messages/by-id/1402790204.65037.YahooMailNeo@web122301.mail.ne1.yahoo.com

Yes, very much like that.

Actually, I was talking about the next step beyond that. I don't want
what changed in a single table; I want what changed *in the source of
the entire MV*. Kevin has a whitepaper that describes how to do this in
set notation; theoretically this is a matter of converting that to SQL.
IIRC this needs the deltas and current (or maybe NEW and OLD) for every
table in the MV. So one way you could model this is a function that
accepts a bunch of NEW and OLD recordsets.

Theoretically you could actually drive that with per-row triggers, but
the performance would presumably suck. Next best thing would be
providing NEW and OLD for AFTER STATEMENT triggers (what Kevin was
talking about in that email). Though, if you're driving this at a
statement level that means you can't actually reference the MV in a
statement that's performing DML on any of the dependent tables.

As you can see, this is all pretty involved. Doing just a small part of
this would make for a good GSoC project. AFTER STATEMENT NEW and OLD
might be a good project; I don't know how much more work Kevin's stuff
needs. But there's much greater value in creating something that would
take the definition for a MV and turn that into appropriate delta logic.
That would be the basis for detecting if a MV was stale (beyond just the
gross level check of were any of the tables involved touched), and is
what is needed to do *any* kind of incremental update.

That logic doesn't have to be driven by triggers. For example, you could
have PgQ or similar capture all DML on all tables for a MV and feed that
data to the delta logic on an async incremental basis. It's pretty easy
for an end user to setup PgQ or similar but doing the delta logic is
tightly coupled to the MV definition, which would be very hard for an
end user to deal with.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers