Relational Algebra and Aggregate Functions
I'm working on improving my background database theory, to aid in practice.
I've found learning relational algebra to be very helpful. One thing which
relational algebra doesn't cover is aggregate functions. Can anyone
recommend any papers or web pages which provide some good theoretical
background for aggregate functions?
On Sun, Jul 26, 2009 at 03:36:26PM -0400, Robert James wrote:
Can anyone
recommend any papers or web pages which provide some good theoretical
background for aggregate functions?
My knowledge of relational algebra is somewhat non-existent as well;
I tend to just think of them as a "fold" from normal functional
programming. The Wikipedia page is a reasonable description:
http://en.wikipedia.org/wiki/Fold_(higher-order_function)
Not sure how helpful that is though!
--
Sam http://samason.me.uk/
On Sun, 2009-07-26 at 15:36 -0400, Robert James wrote:
I'm working on improving my background database theory, to aid in
practice. I've found learning relational algebra to be very helpful.
One thing which relational algebra doesn't cover is aggregate
functions. Can anyone recommend any papers or web pages which provide
some good theoretical background for aggregate functions?
When it comes to relational theory, C.J. Date is a good author. "An
Introduction To Database Systems" covers pretty much everything.
There's a formal definition of a relational algebra (including
SUMMARIZE, which is the authors' version of an aggregate operator)
defined with only two operators here:
http://thethirdmanifesto.com/
(look for "Appendix A")
Although Appendix A is not easy to understand without some basic
familiarity with the authors' other works.
Regards,
Jeff Davis
Thanks for all the good replies (both on and off list). It seems the
consensus is for me to read Christopher Date. I found two relevant Date
books:
1) Introduction to Database Systems
http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5
and
2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7
Any recommendations as to which? From the titles, I'd be inclined towards
the second, but not if the first is better. One thing I'm not interested in
is polemics against SQL and lamentations on how ignorant all practitioners
are.
On Mon, Jul 27, 2009 at 2:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:
Show quoted text
On Sun, 2009-07-26 at 15:36 -0400, Robert James wrote:
I'm working on improving my background database theory, to aid in
practice. I've found learning relational algebra to be very helpful.
One thing which relational algebra doesn't cover is aggregate
functions. Can anyone recommend any papers or web pages which provide
some good theoretical background for aggregate functions?When it comes to relational theory, C.J. Date is a good author. "An
Introduction To Database Systems" covers pretty much everything.There's a formal definition of a relational algebra (including
SUMMARIZE, which is the authors' version of an aggregate operator)
defined with only two operators here:
http://thethirdmanifesto.com/
(look for "Appendix A")Although Appendix A is not easy to understand without some basic
familiarity with the authors' other works.Regards,
Jeff Davis
On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote:
1) Introduction to Database Systems
http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5and
2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7
I recommend #2. It's shorter and easier to read than "An Introduction to
Database Systems", and I think it will answer your question about
relational theory and aggregates (see "SUMMARIZE").
Appendix A is a part of "Databases, Types, and The Relational Model: The
Third Manifesto". That's interesting, but it's not an easy read, it's
describing a system in formal detail.
Regards,
Jeff Davis
Many wrote that the functional programming 'fold' is a good model for
relational aggregate functions. I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essential component
of aggregation.
2. I don't believe fold can handle things like AVG() or STDDEV(). Can it?
Conversely, fold can handle non-commutative and non-associative operators,
which I don't believe can be used for aggregation.
3. fold is defined on sequences, not sets. This doesn't seem to be a
problem until you think about cases where there a duplicates of the
aggregated field. (For instance, there are 10 bags each weighing 5 lbs, and
you want SUM(weight) - you need to project weight onto a collection which
allows for 10 occurences, or define the aggregate function to work on the
whole tuple somehow... I know a man named Krug worked out a formal theory
for this...)
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
Many wrote that the functional programming 'fold' is a good model for
relational aggregate functions. I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essential component
of aggregation.
Not sure if I'd agree, a GROUP BY without any aggregate functions looks
pretty indistinguishable from just a DISTINCT on the same columns to me.
2. I don't believe fold can handle things like AVG() or STDDEV(). Can it?
An "aggregate" in a database bundles together several things that you
get direct access to in most functional languages; have a look at:
http://www.postgresql.org/docs/current/static/sql-createaggregate.html
In Haskell, to pick one particularly express language, I'd write AVG as:
avg = uncurry (/) . foldl (\(t,n) v -> (t+v,n+1)) (0,0)
running "avg [1,7,15]" gives me back approximately "7.6". If I move
things around so they look like they would in PG, I get:
CREATE AGGREGATE avg (float8) (
INITCOND = (0,0)
STYPE = (float8,float8),
SFUNC = foldl (\(t,n) v -> (t+v,n+1)),
FINALFUNC = uncurry (/),
);
Conversely, fold can handle non-commutative and non-associative operators,
which I don't believe can be used for aggregation.
Strictly speaking this is correct; but in practice PG doesn't care about
this. array_accum is a reasonable example (I think) as the resulting
array will be in the same order as it was given to the aggregation
function.
3. fold is defined on sequences, not sets. This doesn't seem to be a
problem until you think about cases where there a duplicates of the
aggregated field. (For instance, there are 10 bags each weighing 5 lbs, and
you want SUM(weight) - you need to project weight onto a collection which
allows for 10 occurences, or define the aggregate function to work on the
whole tuple somehow... I know a man named Krug worked out a formal theory
for this...)
I don't see why this is a problem at all; could you give a concrete
example?
--
Sam http://samason.me.uk/
On Jul 27, 2009, at 21:05 , Robert James wrote:
2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7
"Database in Depth" is good, though he's effectively rewritten it as
"SQL and Relational Theory: How to Write Accurate SQL Code"
http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068
Michael Glaesemann
grzm seespotcode net
Thanks! "SQL and Relational Theory: How to Write Accurate SQL Code" looks
like the best pick of the bunch.
On Tue, Jul 28, 2009 at 10:08 AM, Michael Glaesemann
<grzm@seespotcode.net>wrote:
Show quoted text
On Jul 27, 2009, at 21:05 , Robert James wrote:
2) Database in Depth: Relational Theory for Practitioners
"Database in Depth" is good, though he's effectively rewritten it as "SQL
and Relational Theory: How to Write Accurate SQL Code"http://www.amazon.com/SQL-Relational-Theory-Write-Accurate/dp/0596523068
Michael Glaesemann
grzm seespotcode net
On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote:
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
Many wrote that the functional programming 'fold' is a good model for
relational aggregate functions. I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essentialcomponent
of aggregation.
Not sure if I'd agree, a GROUP BY without any aggregate functions looks
pretty indistinguishable from just a DISTINCT on the same columns to me.DISTINCT will collapse duplicates, which is not what we want when computing
COUNT, SUM, or AVG - please see below.
3. fold is defined on sequences, not sets. This doesn't seem to be a
problem until you think about cases where there a duplicates of the
aggregated field. (For instance, there are 10 bags each weighing 5 lbs,and
you want SUM(weight) - you need to project weight onto a collection which
allows for 10 occurences, or define the aggregate function to work on the
whole tuple somehow... I know a man named Krug worked out a formal theory
for this...)I don't see why this is a problem at all; could you give a concrete
example?
Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)}
How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could
project_weight(LUGGAGE) and then apply SUM, except that would give us
{(weight:3), (weight:3)}, which is not a set (it has duplicates). We could
define a new operation: project_to_list (allowing duplicates), or we could
define SUM(weight) over the LUGGAGE relation as a whole - either way, we
need to extend the theory a bit.
On Tue, Jul 28, 2009 at 11:26:01AM -0400, Robert James wrote:
On Tue, Jul 28, 2009 at 9:47 AM, Sam Mason <sam@samason.me.uk> wrote:
On Tue, Jul 28, 2009 at 09:14:38AM -0400, Robert James wrote:
Many wrote that the functional programming 'fold' is a good model for
relational aggregate functions. I have a few difficulties with this:
1. fold doesn't offer any type of GROUP BY, which is an essentialcomponent
of aggregation.
Not sure if I'd agree, a GROUP BY without any aggregate functions looks
pretty indistinguishable from just a DISTINCT on the same columns to me.DISTINCT will collapse duplicates, which is not what we want when computing
COUNT, SUM, or AVG - please see below.
That's why I said "without any aggregate functions". E.g:
SELECT a,b FROM foo GROUP BY a,b;
vs.
SELECT DISTINCT a,b FROM foo;
I guess we're talking about different things. Folds are an example of
something called a Catamorphism (as wikipedia seems to know these days).
If you're really interested in this then I liked the "bananas in space"
paper:
http://citeseer.ist.psu.edu/293490.html
wikipedia seems to like:
http://citeseer.ist.psu.edu/meijer91functional.html
I don't think these are the same, but citeseer seems to be down at the
moment and I can't check.
3. fold is defined on sequences, not sets. This doesn't seem to be a
problem until you think about cases where there a duplicates of the
aggregated field. (For instance, there are 10 bags each weighing 5 lbs,and
you want SUM(weight) - you need to project weight onto a collection which
allows for 10 occurences, or define the aggregate function to work on the
whole tuple somehow... I know a man named Krug worked out a formal theory
for this...)I don't see why this is a problem at all; could you give a concrete
example?Relation LUGGAGE = { (name:'ball', weight:3), (name:'bat', weight:3)}
How do we formalize SELECT SUM(weight) FROM LUGGAGE? We could
project_weight(LUGGAGE) and then apply SUM, except that would give us
{(weight:3), (weight:3)}, which is not a set (it has duplicates). We could
define a new operation: project_to_list (allowing duplicates), or we could
define SUM(weight) over the LUGGAGE relation as a whole - either way, we
need to extend the theory a bit.
Which "theory" are we talking about here? I assume you're talking
about relational algebra, at which point I believe that aggregates have
to be handled explicitly and this whole conversation becomes somewhat
meaningless.
I mentioned the analogy with folds because their semantics are somewhat
more accessible to software people, if you're really interested
in relational algebra then I'd stay away from folds and deal with
aggregates directly.
--
Sam http://samason.me.uk/
On Tuesday 28 July 2009 04:38:03 Jeff Davis wrote:
On Mon, 2009-07-27 at 21:05 -0400, Robert James wrote:
1) Introduction to Database Systems
http://www.amazon.com/Introduction-Database-Systems-Kannan-Swamynathan/dp
/B001BVYKY4/ref=sr_1_5?ie=UTF8&s=books&qid=1248742811&sr=1-5and
2) Database in Depth: Relational Theory for Practitioners
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0
596100124/ref=sr_1_7?ie=UTF8&s=books&qid=1248742811&sr=1-7I recommend #2. It's shorter and easier to read than "An Introduction to
Database Systems", and I think it will answer your question about
relational theory and aggregates (see "SUMMARIZE").
Is it weird that "Database in Depth" is shorter and easier than "Introduction
to Database Systems"? And they're by the same author, too.
On Wed, 2009-07-29 at 10:19 +0300, Peter Eisentraut wrote:
Is it weird that "Database in Depth" is shorter and easier than "Introduction
to Database Systems"? And they're by the same author, too.
I agree that it's a little strange. The former is more conceptual and
starts off assuming that you are familiar with things like
normalization. The latter is more like a textbook: more complete and
more formal.
Regards,
Jeff Davis