pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Started by Nonameabout 23 years ago7 messages

tgl@postgresql.org

about 23 years ago

CVSROOT: /cvsroot
Module name: pgsql-server
Changes by: tgl@postgresql.org 02/11/20 19:42:20

Modified files:
doc/src/sgml : runtime.sgml
src/backend/optimizer/path: costsize.c
src/backend/optimizer/plan: createplan.c planmain.c planner.c
src/backend/utils/misc: guc.c postgresql.conf.sample
src/bin/psql : tab-complete.c
src/include/optimizer: cost.h planmain.h
src/test/regress/expected: aggregates.out rangefuncs.out
rules.out select_having.out
select_having_1.out
select_implicit.out
select_implicit_1.out subselect.out
src/test/regress/input: misc.source
src/test/regress/output: misc.source
src/test/regress/sql: aggregates.sql rules.sql select_having.sql
select_implicit.sql subselect.sql

Log message:
Finish implementation of hashed aggregation. Add enable_hashagg GUC
parameter to allow it to be forced off for comparison purposes.
Add ORDER BY clauses to a bunch of regression test queries that will
otherwise produce randomly-ordered output in the new regime.

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 23 years ago

In reply to: Noname (#1)

Re: pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Log message:
Finish implementation of hashed aggregation. Add enable_hashagg GUC
parameter to allow it to be forced off for comparison purposes.
Add ORDER BY clauses to a bunch of regression test queries that will
otherwise produce randomly-ordered output in the new regime.

Out of interest (since I was away while this was proposed I assume),
what's the idea with hashed aggergation? I assume each group is now in a
hash bucket? How did it work before?

Chris

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Noname (#1)

Re: [COMMITTERS] pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Tom Lane wrote:

Log message:
Finish implementation of hashed aggregation. Add enable_hashagg GUC
parameter to allow it to be forced off for comparison purposes.
Add ORDER BY clauses to a bunch of regression test queries that will
otherwise produce randomly-ordered output in the new regime.

Tom, do we really want to add a GUC that is used just for comparison of
performance? I know we have the seqscan on/off, but there are valid
reasons to do that. Do you think there will be cases where it will
faster to have this hash setting off?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Christopher Kings-Lynne (#2)

Re: pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Christopher Kings-Lynne wrote:

Log message:
Finish implementation of hashed aggregation. Add enable_hashagg GUC
parameter to allow it to be forced off for comparison purposes.
Add ORDER BY clauses to a bunch of regression test queries that will
otherwise produce randomly-ordered output in the new regime.

Out of interest (since I was away while this was proposed I assume),
what's the idea with hashed aggergation? I assume each group is now in a
hash bucket? How did it work before?

It sequential scanned the group of possible matches. How it hashes the
value and looks for matches that way --- much faster and the way most
db's do it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Tom Lane

tgl@sss.pgh.pa.us

about 23 years ago

In reply to: Bruce Momjian (#3)

Re: [COMMITTERS] pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom, do we really want to add a GUC that is used just for comparison of
performance? I know we have the seqscan on/off, but there are valid
reasons to do that. Do you think there will be cases where it will
faster to have this hash setting off?

Sure --- that's why the planner code is going to great lengths to try to
choose the faster one. Even if I didn't think that, it'll be at least
as useful as, say, enable_indexscan.

regards, tom lane

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Tom Lane (#5)

Re: [COMMITTERS] pgsql-server/ oc/src/sgml/runtime.sgml rc/back

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom, do we really want to add a GUC that is used just for comparison of
performance? I know we have the seqscan on/off, but there are valid
reasons to do that. Do you think there will be cases where it will
faster to have this hash setting off?

Sure --- that's why the planner code is going to great lengths to try to
choose the faster one. Even if I didn't think that, it'll be at least
as useful as, say, enable_indexscan.

Oh, OK. Just checking. I was confused about your commit message
because you seemed to be saying it was mostly for testing, and I thought
you meant testing to see if the hash code is an improvement over what we
had, rather than to see if the hash code is an improvement over the
sequential scan GROUP BY path, which is still in the code.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Tom Lane

tgl@sss.pgh.pa.us

about 23 years ago

In reply to: Christopher Kings-Lynne (#2)

Re: pgsql-server/ oc/src/sgml/runtime.sgml rc/back ...

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

Out of interest (since I was away while this was proposed I assume),
what's the idea with hashed aggergation?

Old method: scan rows in order by the GROUP BY columns (requiring
a sort, or if you're lucky an indexscan), and execute one aggregation
at a time.

New method: scan rows in any old order (typically a seqscan), and run
all the per-group aggregates in parallel. It's a hash aggregation
because we use an in-memory hashtable indexed by the values of the GROUP
BY columns to keep track of the running state of each aggregate.

The hash method avoids a sort before aggregation, at the cost of a sort
afterwards if you want the results in non-random order. But the
post-sort is only sorting one row per group, which is usually a lot less
data than the input rows.

One case where the old method can still win is where you have
SELECT ... GROUP BY foo ORDER BY foo LIMIT n;
for small n. The hash method does not produce any output till it's
read all the input; the old method can produce a few rows very cheaply
if foo is indexed.

Also, of course, the hash method fails if you have too many groups to
permit the hashtable to fit in memory.

regards, tom lane