sigh

Started by Thomas Zehetbauerover 22 years ago27 messageshackers

thomasz@hostmaster.org

over 22 years ago

psql cannot \i a file which name contains a blank character. Why doesn't
it take the rest of the command line as the name of the file?!? Why
doesn't it use any of the standard command line parsing libraries which
respect "quoting" and \e\s\c\a\p\i\n\g?!? psql \i even does autocomplete
to the file name it fails to load...

Also will the BUG which causes postgresql to execute a sequential scan
when using min()/max()/count() ever be fixed? min()/max() can be
rewritten as SELECT $column ORDER BY $column ASC/DESC LIMIT 1 but this
should be done by the database, NOT by the user!

Tom

Peter Eisentraut

peter_e@gmx.net

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

Thomas Zehetbauer writes:

psql cannot \i a file which name contains a blank character.

\i 'blah blah blah'

--
Peter Eisentraut peter_e@gmx.net

Bruce Momjian

bruce@momjian.us

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

Thomas Zehetbauer <thomasz@hostmaster.org> writes:

Also will the BUG which causes postgresql to execute a sequential scan
when using min()/max()/count() ever be fixed? min()/max() can be
rewritten as SELECT $column ORDER BY $column ASC/DESC LIMIT 1 but this
should be done by the database, NOT by the user!

First of all, you should take COUNT() out of that list. While MIN/MAX could be
implemented to take advantage of indexes like "DISTINCT ON" (however it's much
more complex than your rewrite indicates), COUNT() *cannot* be done that way.

Nobody is currently working on this or planning to work on this soon. So no,
at least currently it appears this issue will not be changed. Postgresql is
open source and this is the hackers mailing list. Feel free to contribute a
patch.

--
greg

Joshua D. Drake

jd@commandprompt.com

over 22 years ago

In reply to: Bruce Momjian (#3)

Re: *sigh*

Greg Stark wrote:

Thomas Zehetbauer <thomasz@hostmaster.org> writes:

Also will the BUG which causes postgresql to execute a sequential scan
when using min()/max()/count() ever be fixed? min()/max() can be
rewritten as SELECT $column ORDER BY $column ASC/DESC LIMIT 1 but this
should be done by the database, NOT by the user!

I would add that this is not a bug as much as a feature request. count()
works. It may not be as feature
filled as we would like (e.g; it won't use an index) but it does work.

Nobody is currently working on this or planning to work on this soon. So no,
at least currently it appears this issue will not be changed. Postgresql is
open source and this is the hackers mailing list. Feel free to contribute a
patch.

Personally I think there are greater things that need to be patched
versus count(). As you can implement
procedures on your own to deliver faster counts.

Sincerely,

Joshua D. Drake

-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-222-2783 - jd@commandprompt.com - http://www.commandprompt.com
Editor-N-Chief - PostgreSQl.Org - http://www.postgresql.org

Bruce Momjian

bruce@momjian.us

over 22 years ago

In reply to: Joshua D. Drake (#4)

Re: *sigh*

"Joshua D. Drake" <jd@commandprompt.com> writes:

Greg Stark wrote:

Thomas Zehetbauer <thomasz@hostmaster.org> writes:

Also will the BUG which causes postgresql to execute a sequential scan
when using min()/max()/count() ever be fixed? min()/max() can be
rewritten as SELECT $column ORDER BY $column ASC/DESC LIMIT 1 but this
should be done by the database, NOT by the user!

I would add that this is not a bug as much as a feature request. count() works.
It may not be as feature
filled as we would like (e.g; it won't use an index) but it does work.

count will use an index just fine where it's useful. If you say "select
count(*) where foo = ?" and there's an index on foo it will use the index. If
there's a partial index that helps with that clause it'll consider that too.

You're thinking of min/max. min/max can use an index to avoid traversing all
of the table. count(*) has to see all the rows to count them.

To optimize count effectively would require a very powerful materalized view
infrastructure with incremental updates. Something I don't believe any
database has, and that I doubt postgres will get any time soon.

You can implement it with triggers, which would be effectively equivalent to
what mysql does, but then you would be introducing a massive point of
contention and deadlocks.

--
greg

Randolf Richardson

rr@8x.ca

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

[sNip]

I would add that this is not a bug as much as a feature request.
count() works. It may not be as feature
filled as we would like (e.g; it won't use an index) but it does work.

count will use an index just fine where it's useful. If you say "select
count(*) where foo = ?" and there's an index on foo it will use the
index. If there's a partial index that helps with that clause it'll
consider that too.

You're thinking of min/max. min/max can use an index to avoid traversing
all of the table. count(*) has to see all the rows to count them.

To optimize count effectively would require a very powerful materalized
view infrastructure with incremental updates. Something I don't believe
any database has, and that I doubt postgres will get any time soon.

You can implement it with triggers, which would be effectively
equivalent to what mysql does, but then you would be introducing a
massive point of contention and deadlocks.

What about adding a "total number of rows" value to the internal
header of each table which gets incremented/decremented after each row is
INSERT/DELETE has been committed. This way, a generic "count(*)" by itself
could simply return this value without any delay at all.

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.

Doug McNaught

doug@mcnaught.org

over 22 years ago

In reply to: Randolf Richardson (#6)

Re: *sigh*

Randolf Richardson <rr@8x.ca> writes:

What about adding a "total number of rows" value to the internal
header of each table which gets incremented/decremented after each row is
INSERT/DELETE has been committed. This way, a generic "count(*)" by itself
could simply return this value without any delay at all.

Because different sessions have a (validly) different concept of what
that number should be, due to MVCC.

-Doug

Gaetano Mendola

mendola@bigfoot.com

over 22 years ago

In reply to: Doug McNaught (#7)

Re: *sigh*

Doug McNaught wrote:

Randolf Richardson <rr@8x.ca> writes:

What about adding a "total number of rows" value to the internal
header of each table which gets incremented/decremented after each row is
INSERT/DELETE has been committed. This way, a generic "count(*)" by itself
could simply return this value without any delay at all.

Because different sessions have a (validly) different concept of what
that number should be, due to MVCC.

The count(*) information can be revisioned too, am I wrong ? I'm able to
create a trigger that store the count(*) information in a special table,
why not implement the same in a way "builded in" ?

Regards
Gaetano Mendola

Doug McNaught

doug@mcnaught.org

over 22 years ago

In reply to: Gaetano Mendola (#8)

Re: *sigh*

Gaetano Mendola <mendola@bigfoot.com> writes:

Doug McNaught wrote:

Because different sessions have a (validly) different concept of what
that number should be, due to MVCC.

The count(*) information can be revisioned too, am I wrong ? I'm able to
create a trigger that store the count(*) information in a special table,
why not implement the same in a way "builded in" ?

This has been discussed extensively before (last time was a few months
ago); please search the archives if you're curious.

-Doug

#10

Chris Browne

cbbrowne@acm.org

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

Oops! mendola@bigfoot.com (Gaetano Mendola) was seen spray-painting on a wall:

Doug McNaught wrote:

Randolf Richardson <rr@8x.ca> writes:

What about adding a "total number of rows" value to the
internal header of each table which gets incremented/decremented
after each row is INSERT/DELETE has been committed. This way, a
generic "count(*)" by itself could simply return this value without
any delay at all.

Because different sessions have a (validly) different concept of what
that number should be, due to MVCC.

The count(*) information can be revisioned too, am I wrong ? I'm able to
create a trigger that store the count(*) information in a special table,
why not implement the same in a way "builded in" ?

You could do this; the cost would be pretty high, as there would be an
extra table update done every time rows were inserted or deleted.
Doing it on _every_ table would be a heavy overhead that is not
worthwhile.

What might make sense would be to set up triggers to do this on those
large tables where you frequently need COUNT(*).
--
"cbbrowne","@","cbbrowne.com"
http://www.ntlug.org/~cbbrowne/lisp.html
As of next month, MACLISP "/" will be flushed in favor of "\".
Please update the WORLD.

#11

Bruce Momjian

bruce@momjian.us

over 22 years ago

In reply to: Gaetano Mendola (#8)

Re: *sigh*

Gaetano Mendola <mendola@bigfoot.com> writes:

The count(*) information can be revisioned too, am I wrong ? I'm able to
create a trigger that store the count(*) information in a special table,
why not implement the same in a way "builded in" ?

Then every insert or delete would have to lock that count. Nobody else would
be able to insert or delete any records until you either commit or roll back.

That would lead to much lower concurrency, much more contention for locks, and
tons of deadlocks.

--
greg

#12

Randolf Richardson

rr@8x.ca

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

The count(*) information can be revisioned too, am I wrong ? I'm able
to create a trigger that store the count(*) information in a special
table, why not implement the same in a way "builded in" ?

Then every insert or delete would have to lock that count. Nobody else
would be able to insert or delete any records until you either commit or
roll back.

That would lead to much lower concurrency, much more contention for
locks, and tons of deadlocks.

What about queueing all these updates for a separate low-priority
thread? The thread would be the only one with access to update this field.

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.

#13

Chris Browne

cbbrowne@acm.org

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

Martha Stewart called it a Good Thing when Randolf Richardson <rr@8x.ca> wrote:

The count(*) information can be revisioned too, am I wrong ? I'm able
to create a trigger that store the count(*) information in a special
table, why not implement the same in a way "builded in" ?

Then every insert or delete would have to lock that count. Nobody else
would be able to insert or delete any records until you either commit or
roll back.

That would lead to much lower concurrency, much more contention for
locks, and tons of deadlocks.

What about queueing all these updates for a separate
low-priority thread? The thread would be the only one with access
to update this field.

If updates are "queued," then how do you get to use them if the
"update thread" isn't running because it's not high enough in
priority?

I am not being facetious.

The one way that is expected to be successful would be to have a
trigger that, upon seeing an insert of 5 rows to table "ABC", puts,
into table "count_detail", something like:

insert into count_detail (table, value) values ('ABC', 5);

You then replace
select count(*) from abc;

with
select sum(value) from count_detail where table = 'ABC';

The "low priority" thread would be a process that does something akin
to vacuuming, where it would replace the contents of the table every
so often...

for curr_table in (select table from count_detail) do
new_total = select sum(value) from count_detail
where table = curr_table;
delete from count_detail where table = curr_table;
insert into count_detail (table, value) values (curr_table,
new_total);
done

The point of this being to try to keep the number of rows to 1 per
table.

Note that this gets _real_ expensive for tables that see lots of
single row inserts and deletes. There isn't a cheaper way that will
actually account for the true numbers of records that have been
committed.

For a small table, it will be cheaper to walk through and calculate
count(*) directly from the tuples themselves.

The situation where it may be worthwhile to do this is a table which
is rather large (thus count(*) is expensive) where there is some
special reason to truly care how many rows there are in the table.
For _most_ tables, it seems unlikely that this will be true. For
_most_ tables, it is absolutely not worth the cost of tracking the
information.
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://cbbrowne.com/info/spreadsheets.html
Predestination was doomed from the start.

#14

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 22 years ago

In reply to: Chris Browne (#13)

Re: *sigh*

How about:

Implement a function "estimated_count" that can be used instead of
"count". It could use something like the algorithm in
src/backend/commands/analyze.c to get a reasonably accurate psuedo count
quickly.

The advantage of this approach is that "count" still means (exact)count
(for your xact snapshot anyway). Then the situation becomes:

Want a fast count? - use estimated_count(*)
Want an exact count - use count(*)

regards

Mark

Christopher Browne wrote:

Show quoted text

For a small table, it will be cheaper to walk through and calculate
count(*) directly from the tuples themselves.

The situation where it may be worthwhile to do this is a table which
is rather large (thus count(*) is expensive) where there is some
special reason to truly care how many rows there are in the table.
For _most_ tables, it seems unlikely that this will be true. For
_most_ tables, it is absolutely not worth the cost of tracking the
information.

#15

Shridhar Daithankar

shridhar_daithankar@myrealbox.com

over 22 years ago

In reply to: Mark Kirkwood (#14)

Re: *sigh*

On Wednesday 03 December 2003 13:59, Mark Kirkwood wrote:

How about:

Implement a function "estimated_count" that can be used instead of
"count". It could use something like the algorithm in
src/backend/commands/analyze.c to get a reasonably accurate psuedo count
quickly.

The advantage of this approach is that "count" still means (exact)count
(for your xact snapshot anyway). Then the situation becomes:

Want a fast count? - use estimated_count(*)
Want an exact count - use count(*)

Something like select reltuples from pg_class where relname='foo'?

Shridhar

#16

Christoph Haller

ch@rodos.fzk.de

over 22 years ago

In reply to: Shridhar Daithankar (#15)

Re: *sigh*

Fairly good idea IMHO, especially considering Christopher's point
about the unlikeliness of needing an exact count anyway.

Regards, Christoph

Show quoted text

How about:

Implement a function "estimated_count" that can be used instead of
"count". It could use something like the algorithm in
src/backend/commands/analyze.c to get a reasonably accurate psuedo count
quickly.

The advantage of this approach is that "count" still means (exact)count
(for your xact snapshot anyway). Then the situation becomes:

Want a fast count? - use estimated_count(*)
Want an exact count - use count(*)

regards

Mark

Christopher Browne wrote:

For a small table, it will be cheaper to walk through and calculate
count(*) directly from the tuples themselves.

The situation where it may be worthwhile to do this is a table which
is rather large (thus count(*) is expensive) where there is some
special reason to truly care how many rows there are in the table.
For _most_ tables, it seems unlikely that this will be true. For
_most_ tables, it is absolutely not worth the cost of tracking the
information.

Import Notes

Reply to msg id not found: fromMarkKirkwoodatDec3103937am | Resolved by subject fallback

#17

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 22 years ago

In reply to: Shridhar Daithankar (#15)

Re: *sigh*

Shridhar Daithankar wrote:

Something like select reltuples from pg_class where relname='foo'?

Shridhar

[chuckles] - I had envisaged something more accurate that the last
ANALYZE, "estimate_count" would effectively *do* acquire_sample_rows()
then and there for you...

regards

Mark

#18

Randolf Richardson

rr@8x.ca

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

"Christopher Browne <cbbrowne@acm.org>" wrote in
comp.databases.postgresql.hackers:

Martha Stewart called it a Good Thing when Randolf Richardson <rr@8x.ca>
wrote:

[sNip]

What about queueing all these updates for a separate
low-priority thread? The thread would be the only one with access
to update this field.

If updates are "queued," then how do you get to use them if the
"update thread" isn't running because it's not high enough in
priority?

That would be an administration issue. This background process would
need to have enough priority in order for this to be functional, yet could
also be completely disabled by administrators who know their systems don't
need to use "count(*)" at all.

Also, if the thread was well-designed, then it could combine all the
queued entries for a single table first in order to reduce disk I/O when
updating each table.

I am not being facetious.

Oh, I see that. Don't worry, I know better than to take things
personally on newsgroups -- go ahead and be blunt if you like. =D

The one way that is expected to be successful would be to have a
trigger that, upon seeing an insert of 5 rows to table "ABC", puts,
into table "count_detail", something like:

insert into count_detail (table, value) values ('ABC', 5);

You then replace
select count(*) from abc;

with
select sum(value) from count_detail where table = 'ABC';

The "low priority" thread would be a process that does something akin
to vacuuming, where it would replace the contents of the table every
so often...

for curr_table in (select table from count_detail) do
new_total = select sum(value) from count_detail
where table = curr_table;
delete from count_detail where table = curr_table;
insert into count_detail (table, value) values (curr_table,
new_total);
done

The point of this being to try to keep the number of rows to 1 per
table.

Interesting. A different way of solving the same problem, but
wouldn't it require more disk I/O on the table being updated then a
separate tracker would?

Note that this gets _real_ expensive for tables that see lots of
single row inserts and deletes. There isn't a cheaper way that will
actually account for the true numbers of records that have been
committed.

For a small table, it will be cheaper to walk through and calculate
count(*) directly from the tuples themselves.

The situation where it may be worthwhile to do this is a table which
is rather large (thus count(*) is expensive) where there is some
special reason to truly care how many rows there are in the table.
For _most_ tables, it seems unlikely that this will be true. For
_most_ tables, it is absolutely not worth the cost of tracking the
information.

Ah, but that's the point -- do we truly care how many rows are in the
table, or is the purpose of "count(*)" to just give us a general idea?

This statistic would be delayed because it's being updated by a
background process, thus "count" won't always be accurate, but at least it
won't be slow -- it could be the fastest "count" in the industry! =)

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.

#19

Randolf Richardson

rr@8x.ca

over 22 years ago

In reply to: Thomas Zehetbauer (#1)

Re: *sigh*

"markir@paradise.net.nz (Mark Kirkwood)" wrote in
comp.databases.postgresql.hackers:

[sNip]

How about:

Implement a function "estimated_count" that can be used instead of
"count". It could use something like the algorithm in
src/backend/commands/analyze.c to get a reasonably accurate psuedo count
quickly.

The advantage of this approach is that "count" still means (exact)count
(for your xact snapshot anyway). Then the situation becomes:

Want a fast count? - use estimated_count(*)
Want an exact count - use count(*)

I think this is an excellent solution.

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.

#20

Paul Punett

paul.punett@shazamteam.com

over 22 years ago

In reply to: Randolf Richardson (#19)

COPY TABLE TO

Hi,

I need to write a tab separated text file such that the first row contains
number of records in the table.
I insert first row with '0' (zero) as first column & rest columns NULL.
Then at the end of writing records to table I do a select into Variable
count(*) from table.
& update the first record with the count returned.
Unfortunately after the update the first row becomes the last row & hence
COPY TO FileName sends the count as the last record.

I need count as the first record? Any suggestions please ?
The documentation says indexing does not affect copy order.

I am developing on C++ with PostGre on windows. I need to port to Linux
later. Any suggestions on linking C++ code to PostGre (queries & functions)

Thanks
Paul

#21