Berkeley and CMU classes adopt/extend PostgreSQL

Started by Joe Hellersteinalmost 23 years ago5 messages
#1Joe Hellerstein
jmh@cs.berkeley.edu

Hi all:
I emailed Marc Fournier on this topic some weeks back, but haven't
heard from him.

I am teaching the undergrad DB course at UC Berkeley, something I do
with some frequency. We have the usual 180 students we get every
semester (yep: 180!), but this year we've instituted 2 changes:

1) We changed the course projects to make the students hack PostgreSQL
internals, rather than the "minibase" eduware
2) We are coordinating the class with a class at CMU being taught by
Prof. Anastassia ("Natassa") Ailamaki

Our "Homework 2", which is being passed out this week, will ask the
students to implement a hash-based grouping that spills to disk. I
understand this topic has been batted about the pgsql-hackers list
recently. The TAs who've prepared the assignment (Sailesh
Krishnamurthy at Berkeley and Spiros Papadimitriou at CMU) have also
implemented a reference solution to assignment. Once we've got the
students' projects all turned in, we'll be very happy to contribute our
code back the PostgreSQL project.

I'm hopeful this will lead to many good things:

1) Each year we can pick another feature to assign in class, and
contribute back. We'll need to come up with well-scoped engine
features that exercise concepts from the class -- eventually we'll run
out of tractable things that PGSQL needs, but not in the next couple
years I bet.

2) We'll raise a crop of good students who know Postgres internals.
Roughly half the Berkeley EECS undergrads take the DB class, and all of
them will be post-hackers! (Again, I don't know the stats at CMU.)

So consider this a heads up on the hash-agg front, and on the future
contributions front. I'll follow up with another email on
PostgreSQL-centered research in our group at Berkeley as well.

Another favor I'd ask is that people on the list be a bit hesitant
about helping our students with their homework! We would like them to
do it themselves, more or less :-)

Regards,
Joe Hellerstein

--

Joseph M. Hellerstein
Professor, EECS Computer Science Division
UC Berkeley
http://www.cs.berkeley.edu/~jmh

On Tuesday, February 11, 2003, at 06:54 PM, Sailesh Krishnamurthy
wrote:

Show quoted text

From: Hannu Krosing <hannu@tm.ee>
Date: Tue Feb 11, 2003 12:21:26 PM US/Pacific
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Bruno Wolff III <bruno@wolff.to>, Greg Stark <gsstark@mit.edu>,
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Hash grouping, aggregates

Tom Lane kirjutas T, 11.02.2003 kell 18:39:

Bruno Wolff III <bruno@wolff.to> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <gsstark@mit.edu> writes:

The neat thing is that hash aggregates would allow grouping on
data types that
have = operators but no useful < operator.

Hm. Right now I think that would barf on you, because the parser
wants
to find the '<' operator to label the grouping column with, even if
the
planner later decides not to use it. It'd take some redesign of the
query data structure (specifically SortClause/GroupClause) to avoid
that.

I think another issue is that for some = operators you still might
not
be able to use a hash. I would expect the discussion for hash joins
in
http://developer.postgresql.org/docs/postgres/xoper-optimization.html
would to hash aggregates as well.

Right, the = operator must be hashable or you're out of luck. But we
could imagine tweaking the parser to allow GROUP BY if it finds a
hashable = operator and no sort operator. The only objection I can
see
to this is that it means the planner *must* use hash aggregation,
which
might be a bad move if there are too many distinct groups.

If we run out of sort memory, we can always bail out later, preferrably
with a descriptive error message. It is not as elegant as erring out at
parse (or even plan/optimise) time, but the result is /almost/ the
same.

Relying on hash aggregation will become essential if we are ever going
to implement the "other" groupings (CUBE, ROLLUP, (), ...), so it would
be nice if hash aggregation could also overflow to disk - I suspect
that
this will still be faster that running an independent scan for each
GROUP BY grouping and merging the results.

-----
Hannu

---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Hellerstein (#1)
Re: Berkeley and CMU classes adopt/extend PostgreSQL

"Joe Hellerstein" <jmh@cs.berkeley.edu> writes:

I am teaching the undergrad DB course at UC Berkeley, something I do
with some frequency. We have the usual 180 students we get every
semester (yep: 180!), but this year we've instituted 2 changes:
1) We changed the course projects to make the students hack PostgreSQL
internals, rather than the "minibase" eduware

Cool.

2) We are coordinating the class with a class at CMU being taught by
Prof. Anastassia ("Natassa") Ailamaki

Double cool. I'm just down the road, if Natassa needs a visiting
lecturer.

Our "Homework 2", which is being passed out this week, will ask the
students to implement a hash-based grouping that spills to disk. I
understand this topic has been batted about the pgsql-hackers list
recently.

Yes. As of CVS tip, we have hash-based grouping but it doesn't spill
to disk. Want to ask them to start from CVS tip and fix that little
detail? Or fix the various other loose ends that have been mentioned
lately? (make it work with DISTINCT, improve the estimation logic,
some other things I'm forgetting)

I'm hopeful this will lead to many good things:

Yes, let's see what we can do with this ... seems like Postgres may
be coming full circle ;-)

regards, tom lane

#3Anastassia Ailamaki
natassa+@cs.cmu.edu
In reply to: Tom Lane (#2)
Re: Berkeley and CMU classes adopt/extend PostgreSQL

Hi everyone,

with some frequency. We have the usual 180 students we get every
semester (yep: 180!), but this year we've instituted 2 changes:

We're looking at >100 students taking the class here every year.

Double cool. I'm just down the road, if Natassa needs a visiting
lecturer.

Tom - that's really super-cool! Tom, let's take it offline to schedule a
visit.
We will be delighted to have you lecture.

Yes. As of CVS tip, we have hash-based grouping but it doesn't spill
to disk. Want to ask them to start from CVS tip and fix that little
detail? Or fix the various other loose ends that have been mentioned
lately? (make it work with DISTINCT, improve the estimation logic,
some other things I'm forgetting)

As Joe said, this is what we are doing. We intend to use your todo-list to
design projects for future semesters... so all such suggestions
are greatly appreciated.

Natassa

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Anastassia Ailamaki (#3)
Re: Berkeley and CMU classes adopt/extend PostgreSQL

Any chance of giving them all separate TODO items? That way, we would
get more items completed; greedy request, I know. ;-)

---------------------------------------------------------------------------

Anastassia Ailamaki wrote:

Hi everyone,

with some frequency. We have the usual 180 students we get every
semester (yep: 180!), but this year we've instituted 2 changes:

We're looking at >100 students taking the class here every year.

Double cool. I'm just down the road, if Natassa needs a visiting
lecturer.

Tom - that's really super-cool! Tom, let's take it offline to schedule a
visit.
We will be delighted to have you lecture.

Yes. As of CVS tip, we have hash-based grouping but it doesn't spill
to disk. Want to ask them to start from CVS tip and fix that little
detail? Or fix the various other loose ends that have been mentioned
lately? (make it work with DISTINCT, improve the estimation logic,
some other things I'm forgetting)

As Joe said, this is what we are doing. We intend to use your todo-list to
design projects for future semesters... so all such suggestions
are greatly appreciated.

Natassa

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5Marc G. Fournier
scrappy@hub.org
In reply to: Joe Hellerstein (#1)
Re: Berkeley and CMU classes adopt/extend PostgreSQL

On Tue, 11 Feb 2003, Joe Hellerstein wrote:

Hi all:
I emailed Marc Fournier on this topic some weeks back, but haven't
heard from him.

And most public apologies for that ... this past month has been a complete
nightmare all around ... we're just finishing up moving our office, and
finally have phone lines again, and hope to have internet again starting
tomorrow ... :(

Show quoted text

1) We changed the course projects to make the students hack PostgreSQL
internals, rather than the "minibase" eduware
2) We are coordinating the class with a class at CMU being taught by
Prof. Anastassia ("Natassa") Ailamaki

Our "Homework 2", which is being passed out this week, will ask the
students to implement a hash-based grouping that spills to disk. I
understand this topic has been batted about the pgsql-hackers list
recently. The TAs who've prepared the assignment (Sailesh
Krishnamurthy at Berkeley and Spiros Papadimitriou at CMU) have also
implemented a reference solution to assignment. Once we've got the
students' projects all turned in, we'll be very happy to contribute our
code back the PostgreSQL project.

I'm hopeful this will lead to many good things:

1) Each year we can pick another feature to assign in class, and
contribute back. We'll need to come up with well-scoped engine
features that exercise concepts from the class -- eventually we'll run
out of tractable things that PGSQL needs, but not in the next couple
years I bet.

2) We'll raise a crop of good students who know Postgres internals.
Roughly half the Berkeley EECS undergrads take the DB class, and all of
them will be post-hackers! (Again, I don't know the stats at CMU.)

So consider this a heads up on the hash-agg front, and on the future
contributions front. I'll follow up with another email on
PostgreSQL-centered research in our group at Berkeley as well.

Another favor I'd ask is that people on the list be a bit hesitant
about helping our students with their homework! We would like them to
do it themselves, more or less :-)

Regards,
Joe Hellerstein

--

Joseph M. Hellerstein
Professor, EECS Computer Science Division
UC Berkeley
http://www.cs.berkeley.edu/~jmh

On Tuesday, February 11, 2003, at 06:54 PM, Sailesh Krishnamurthy
wrote:

From: Hannu Krosing <hannu@tm.ee>
Date: Tue Feb 11, 2003 12:21:26 PM US/Pacific
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Bruno Wolff III <bruno@wolff.to>, Greg Stark <gsstark@mit.edu>,
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Hash grouping, aggregates

Tom Lane kirjutas T, 11.02.2003 kell 18:39:

Bruno Wolff III <bruno@wolff.to> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <gsstark@mit.edu> writes:

The neat thing is that hash aggregates would allow grouping on
data types that
have = operators but no useful < operator.

Hm. Right now I think that would barf on you, because the parser
wants
to find the '<' operator to label the grouping column with, even if
the
planner later decides not to use it. It'd take some redesign of the
query data structure (specifically SortClause/GroupClause) to avoid
that.

I think another issue is that for some = operators you still might
not
be able to use a hash. I would expect the discussion for hash joins
in
http://developer.postgresql.org/docs/postgres/xoper-optimization.html
would to hash aggregates as well.

Right, the = operator must be hashable or you're out of luck. But we
could imagine tweaking the parser to allow GROUP BY if it finds a
hashable = operator and no sort operator. The only objection I can
see
to this is that it means the planner *must* use hash aggregation,
which
might be a bad move if there are too many distinct groups.

If we run out of sort memory, we can always bail out later, preferrably
with a descriptive error message. It is not as elegant as erring out at
parse (or even plan/optimise) time, but the result is /almost/ the
same.

Relying on hash aggregation will become essential if we are ever going
to implement the "other" groupings (CUBE, ROLLUP, (), ...), so it would
be nice if hash aggregation could also overflow to disk - I suspect
that
this will still be faster that running an independent scan for each
GROUP BY grouping and merging the results.

-----
Hannu

---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org