Large selects handled inefficiently?

Started by Jules Beanover 25 years ago17 messagesgeneral
Jump to latest
#1Jules Bean
jules@jellybean.co.uk

Hiya,

I am running a very large SELECT - it selects every row from a ~10 000
000 row table. I'm running this in Perl using DBD:Pg, with the general
sequence:

$sth = $dbh->prepare("SELECT $fields FROM $from") || return 0;
$fh = new FileHandle(">$file") || die "Can't open $file : $!";
$sth->execute() || return 0;
$sth->bind_columns(undef,\(@fields));
while ($sth->fetch){
print $fh join("\t",@fields),"\n";
}
if ($sth->err) { return 0; }
close $fh;

...which works fine in most circumstance, but on this really big
select:

* the client process's memory gradually climbs to around 1 gigabyte
(ouch!)
* eventually the stdout of the process shows over and over again:

Backend sent D message without prior T

It seems as if the client is fetching the /whole/ query result, or
trying to, all at once. I expected it to only actually fetch result
rows as I called fetch().

Is this:

* A deficiency in DBD::Pg?
* A deficiency in the postgresql client libraries?
* A silly bug in my code?

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Jules

#2Jules Bean
jules@jellybean.co.uk
In reply to: Jules Bean (#1)
Re: Large selects handled inefficiently?

On Thu, Aug 31, 2000 at 12:22:36AM +1000, Andrew Snow wrote:

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

I really can't agree with you there.

A cursor is another slightly foolish SQL hack.

A query language specifies the syntax of queries ('SELECT ...'). It
doesn't specify the manner in which these are actually returned. It
seems totally within the bounds of the remit of a decent client-side
library (and a decent back-end) to realise that in practice a client
will want some control over the speed with which rows are returned.

Whilst explicit cursors are needed for some (IMO ugly) procedural SQL
code, explicit cursors should not be necessary for the simple (and
common) task of carrying out a SELECT which takes up more memory than
you wish to have available at any single time.

Jules

#3Andrew Snow
als@fl.net.au
In reply to: Jules Bean (#1)
RE: Large selects handled inefficiently?

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

- Andrew

#4Zlatko Calusic
zlatko@iskon.hr
In reply to: Andrew Snow (#3)
Re: Large selects handled inefficiently?

"Andrew Snow" <als@fl.net.au> writes:

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Yes, I have noticed that particular bad behaviour, too.
With DBD::Pg and DBD::mysql.

At the same time, DBD::Oracle, DBD::InterBase and DBD::Sybase work as
expected. Rows are fetched with fetchrow...() functions instead of all
being sucked up into memory at the time execute() is called.

Anybody know why is that happening?

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

What bothers me is different behaviour of different DBD drivers. But,
yes, I have just subscribed to dbi-users list which is the right place
to ask that question.
--
Zlatko

#5Andrew
andrew@ugh.net.au
In reply to: Jules Bean (#1)
Re: Large selects handled inefficiently?

On Wed, 30 Aug 2000, Jules Bean wrote:

* the client process's memory gradually climbs to around 1 gigabyte
(ouch!)

Have you tried playing with RowCacheSize in DBI? Not sure if it works with
DBD::Pg.

Andrew

#6Andrew Snow
als@fl.net.au
In reply to: Jules Bean (#2)
RE: Large selects handled inefficiently?

A cursor is another slightly foolish SQL hack.

A query language specifies the syntax of queries ('SELECT ...'). It
doesn't specify the manner in which these are actually returned. It
seems totally within the bounds of the remit of a decent client-side
library (and a decent back-end) to realise that in practice a client
will want some control over the speed with which rows are returned.

Whilst explicit cursors are needed for some (IMO ugly) procedural SQL
code, explicit cursors should not be necessary for the simple (and
common) task of carrying out a SELECT which takes up more memory than
you wish to have available at any single time.

Hmm, I agree. So, does the PostgreSQL protocol support some form of non-SQL
cursor?

- Andrew

#7Chris Bitmead
chrisb@nimrod.itg.telstra.com.au
In reply to: Jules Bean (#1)
Re: Large selects handled inefficiently?

Jules Bean wrote:

On Thu, Aug 31, 2000 at 12:22:36AM +1000, Andrew Snow wrote:

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

I really can't agree with you there.

A cursor is another slightly foolish SQL hack.

Not quite, but it is true that this is a flaw in postgres. It has been
discussed on hackers from time to time about implementing a "streaming"
interface. This means that the client doesn't absorb all the results
before allowing access to the results. You can start processing results
as and when they become available by blocking in the client. The main
changes would be to the libpq client library, but there would be also
other issues to address like what happens if an error happens half way
through. In short, I'm sure this will be fixed at some stage, but for
now cursors is the only real answer.

#8Jules Bean
jules@jellybean.co.uk
In reply to: Chris Bitmead (#7)
Re: Large selects handled inefficiently?

On Thu, Aug 31, 2000 at 03:28:14PM +1100, Chris wrote:

Jules Bean wrote:

On Thu, Aug 31, 2000 at 12:22:36AM +1000, Andrew Snow wrote:

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

I really can't agree with you there.

A cursor is another slightly foolish SQL hack.

Not quite,

All right ;) Can we say that, like some other SQL features of which
I'm not fond (e.g. NULLs), cursors do have a use, but many ways in
which they're often used in practice are bad practice...

but it is true that this is a flaw in postgres. It has been
discussed on hackers from time to time about implementing a "streaming"
interface. This means that the client doesn't absorb all the results
before allowing access to the results. You can start processing results
as and when they become available by blocking in the client. The main
changes would be to the libpq client library, but there would be also
other issues to address like what happens if an error happens half way
through. In short, I'm sure this will be fixed at some stage, but for
now cursors is the only real answer.

Or ...LIMIT...OFFSET, I guess. [As long as I remember to set the
transaction isolation to serializable. *sigh* Why isn't that the
default?]

I shall investigate whether LIMIT...OFFSET or cursors seems to be
better for my application.

Jules

#9Jules Bean
jules@jellybean.co.uk
In reply to: Jules Bean (#8)
Re: Large selects handled inefficiently?

On Thu, Aug 31, 2000 at 09:58:34AM +0100, Jules Bean wrote:

On Thu, Aug 31, 2000 at 03:28:14PM +1100, Chris wrote:

but it is true that this is a flaw in postgres. It has been
discussed on hackers from time to time about implementing a "streaming"
interface. This means that the client doesn't absorb all the results
before allowing access to the results. You can start processing results
as and when they become available by blocking in the client. The main
changes would be to the libpq client library, but there would be also
other issues to address like what happens if an error happens half way
through. In short, I'm sure this will be fixed at some stage, but for
now cursors is the only real answer.

Or ...LIMIT...OFFSET, I guess. [As long as I remember to set the
transaction isolation to serializable. *sigh* Why isn't that the
default?]

I shall investigate whether LIMIT...OFFSET or cursors seems to be
better for my application.

OK, I'm using cursors (after having checked that they work with
DBD::Pg!). I'm a little confused about transaction isolation levels,
though. I'm setting the level to 'serializable' --- this seems
important, since other INSERTS might occur during my SELECT. However,
the documentation for DECLARE cursor suggests that the 'INSENSITIVE'
keyword is useless, which seems to me to be equivalent to saying that
the transaction level is always SERIALIZABLE?

Jules

#10Pablo Prieto
pablo.prieto@dulcesol.es
In reply to: Jules Bean (#1)
Error installing ODBC in NT

Hi!,

I'm installing the ODBC Driver, and evrything goes right in all the clients
(NT Workst.) except in one of them where I get this error at 85% of the set
up process:

"Unable to create ODBC Core Subkey"

Sure, It's not ODBC Driver faulty (I guess), but if someone knows how to
work out, would be nice to me, since I'm migrating from MS-SQL Server to
PostgreSQL, and that computer is an important one, so it is delaying the
process

Thanks in advance!

Pablo.

#11Joel Burton
jburton@scw.org
In reply to: Pablo Prieto (#10)
Re: Error installing ODBC in NT

Hi!,

I'm installing the ODBC Driver, and evrything goes right in all the
clients (NT Workst.) except in one of them where I get this error at
85% of the set up process:

"Unable to create ODBC Core Subkey"

Sure, It's not ODBC Driver faulty (I guess), but if someone knows how
to work out, would be nice to me, since I'm migrating from MS-SQL
Server to PostgreSQL, and that computer is an important one, so it is
delaying the process

Under Win98, I had similar problems installing ODBC drivers (at some
point, the ODBC manager seem to become corrupted, and I couldn't
add new drivers.) Are you able to add another new ODBC driver?
(You could download MySQL's, for example, & see if that works.)

I solved my problem by going into the registry and deleting the
ODBC managers keys and reinstalling PostgreSQL w/driver manager.
I love my other ODBC datasources, but reinstalled those. It's hardly
elegant, I'm sure they're are better ways, but, hey, it worked for
me. Your mileage may *definitely* vary, You've been warned, etc.,
etc.

--
Joel Burton, Director of Information Systems -*- jburton@scw.org
Support Center of Washington (www.scw.org)

#12Steve Wolfe
steve@iboats.com
In reply to: Joel Burton (#11)
PostgreSQL on Alpha

Can any of you running PostgreSQL on Alphas comment on the performance vs.
Intel-based servers?

steve

#13Pablo Prieto
pablo.prieto@dulcesol.es
In reply to: Joel Burton (#11)
Re: Error installing ODBC in NT

I solved my problem by going into the registry and deleting the
ODBC managers keys and reinstalling PostgreSQL w/driver manager.
I love my other ODBC datasources, but reinstalled those. It's hardly
elegant, I'm sure they're are better ways, but, hey, it worked for
me. Your mileage may *definitely* vary, You've been warned, etc.,
etc.

Hi. Thanks for the reply. I had a vision and I solved the problem. But my
solution is, perhaps, harder than yours: I've deleted the user and created
again. Quick for me but the user looked angrrrry :).

The driver, then, installed perfectly as usual. Maybe a corruption of the
registry. Bloody Windows :(

Well, that's another emergency solution.

Pablo.

#14Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Jules Bean (#9)
RE: Large selects handled inefficiently?

-----Original Message-----
From: Jules Bean

On Thu, Aug 31, 2000 at 09:58:34AM +0100, Jules Bean wrote:

On Thu, Aug 31, 2000 at 03:28:14PM +1100, Chris wrote:

but it is true that this is a flaw in postgres. It has been
discussed on hackers from time to time about implementing a

"streaming"

interface. This means that the client doesn't absorb all the results
before allowing access to the results. You can start

processing results

as and when they become available by blocking in the client. The main
changes would be to the libpq client library, but there would be also
other issues to address like what happens if an error happens half way
through. In short, I'm sure this will be fixed at some stage, but for
now cursors is the only real answer.

Or ...LIMIT...OFFSET, I guess. [As long as I remember to set the
transaction isolation to serializable. *sigh* Why isn't that the
default?]

I shall investigate whether LIMIT...OFFSET or cursors seems to be
better for my application.

OK, I'm using cursors (after having checked that they work with
DBD::Pg!). I'm a little confused about transaction isolation levels,
though. I'm setting the level to 'serializable' --- this seems
important, since other INSERTS might occur during my SELECT. However,
the documentation for DECLARE cursor suggests that the 'INSENSITIVE'
keyword is useless, which seems to me to be equivalent to saying that
the transaction level is always SERIALIZABLE?

The default transaction isolation level of PostgreSQL is READ COMMITTED.
However transaction isolation levels seems to be irrelevant to your problem.
PostgreSQL cursors don't see any changes made by other backends
after it was declared(opened) regardless of transaction isolation levels.
INSENSITIVE cursors aren't implemented yet. Cursors aren't INSENSITIVE
because they would see the changes made by the backend itself. It's also
regardless of transaction isolation levels.

Regards.

Hiroshi Inoue

#15Jules Bean
jules@jellybean.co.uk
In reply to: Zlatko Calusic (#4)
Re: Large selects handled inefficiently?

On Thu, Aug 31, 2000 at 12:55:35AM +0200, Zlatko Calusic wrote:

"Andrew Snow" <als@fl.net.au> writes:

I believe I can work around this problem using cursors (although I
don't know how well DBD::Pg copes with cursors). However, that
doesn't seem right -- cursors should be needed to fetch a large query
without having it all in memory at once...

Yes, I have noticed that particular bad behaviour, too.
With DBD::Pg and DBD::mysql.

At the same time, DBD::Oracle, DBD::InterBase and DBD::Sybase work as
expected. Rows are fetched with fetchrow...() functions instead of all
being sucked up into memory at the time execute() is called.

Anybody know why is that happening?

Yes. It's a defect in 'libpq', the underlying PostgreSQL client library. This
library is not capable of reading partial results - it always reads whole ones.

Actually, I think thats why cursors were invented in the first place ;-) A
cursor is what you are using if you're not fetching all the results of a
query.

What bothers me is different behaviour of different DBD drivers. But,
yes, I have just subscribed to dbi-users list which is the right place
to ask that question.

No, it's not really the DBD driver's fault. There is no (easy) way
around it, since the flaw lies in the C library it uses. If the DBD
driver wished to change behaviour, it could 'secretly' use cursors,
but that would involve parsing queries to detect selects, which might
be fragile.

Jules

#16Franck Martin
Franck@sopac.org
In reply to: Andrew Snow (#3)
Update table based on OID

I would like to update a tuple based on its oid

update table set name='text' where oid=234

does not seem to update anything...

franck@sopac.org

#17Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Jules Bean (#15)
Re: Large selects handled inefficiently?

On Wed, Sep 06, 2000 at 09:33:07AM +0100, Jules Bean wrote:

No, it's not really the DBD driver's fault. There is no (easy) way
around it, since the flaw lies in the C library it uses. If the DBD
driver wished to change behaviour, it could 'secretly' use cursors,
but that would involve parsing queries to detect selects, which might
be fragile.

Recent experience points out the fragility: the ODBC driver has an option
to try this: and there was just a bug report involving the driver's
attempts to use cursors with SELECT FOR UPDATE.

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005