Problems with inequalities on numeric fields in 6.5

Started by Martin Weinbergabout 27 years ago9 messagesgeneral

weinberg@osprey.phast.umass.edu

about 27 years ago

Folks,

I have som large data basees (from 7 million to 20 million
records and growing) and want to optimize selection in a
range between two float4 values. I have made indices on the
relevant variables.

I notice that a query such as:

select count(*) from mydata where x='3.4';

executes in under 10 seconds. However, a query such as

select count(*) from mydata where x>'3.4' and x<'3.5';

takes at least 20 minutes. EXPLAIN suggests that both
are using an index scan.

Not forcing float4 conversion does much better, e.g.

select count(*) from mydata where x>3.4 and x<3.5;

returns in under a minute although EXPLAIN suggests
that an sequential scan is used.

Can anyone give me some guidance on what that best
strategy is do select records in a range of float4
values?

Thanks!

--Martin

P.S. This behavior is also present in 6.4.2.

===========================================================================

Martin Weinberg Phone: (413) 545-3821
Dept. of Physics and Astronomy FAX: (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA 01003-4525

Martin Weinberg

weinberg@osprey.phast.umass.edu

about 27 years ago

In reply to: Martin Weinberg (#1)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Bruce Momjian wrote on Tue, 06 Jul 1999 23:24:12 EDT

Not forcing float4 conversion does much better, e.g.

select count(*) from mydata where x>3.4 and x<3.5;

OK, let me ask. Vacuum analyze. What does pg_statistics show for
min/max values? What does EXPLAIN show?

The _true_ variable name is called "j_m" in the table "lmctot" and
the database is called lmc.

From "select * from pg_statistic", I have min and max to be

2.731 and 99.999 for that float4 field.

Explain for the converted values:
--------------------------------
lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

And for the uncast values:
-------------------------

lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

lmc=> explain select count(*) from lmctot where j_m>3.4 and
j_m<3.5;NOTICE: QUERY PLAN:

Aggregate (cost=423901.50 rows=788100 width=4)
-> Seq Scan on lmctot (cost=423901.50 rows=788100 width=4)

EXPLAIN

Any clues? I tried looking at the "verbose" output but that
is beyond me.

I appreciate the help. I need to figure out whether pgsql
will do the job for this application.

--M

===========================================================================

Martin Weinberg Phone: (413) 545-3821
Dept. of Physics and Astronomy FAX: (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA 01003-4525

Import Notes

Reply to msg id not found: 199907070324.XAA02358@candle.pha.pa.us | Resolved by subject fallback

Bruce Momjian

bruce@momjian.us

about 27 years ago

In reply to: Martin Weinberg (#1)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Not forcing float4 conversion does much better, e.g.

select count(*) from mydata where x>3.4 and x<3.5;

OK, let me ask. Vacuum analyze. What does pg_statistics show for
min/max values? What does EXPLAIN show?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Bruce Momjian

bruce@momjian.us

about 27 years ago

In reply to: Martin Weinberg (#2)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

Please try this:

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and
j_m<3.5::float4

Also, given your min/max, I am not sure why it thinks it is going to get
788,100 rows. How many rows in the table again?

Does (3.5-3.4)/(max-min) * #rows = 788k?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Martin Weinberg

weinberg@osprey.phast.umass.edu

about 27 years ago

In reply to: Bruce Momjian (#4)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Bruce Momjian wrote on Wed, 07 Jul 1999 04:07:00 EDT

lmc=> explain select count(*) from lmctot where j_m>'3.4' and j_m<'3.5';
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

Please try this:

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and
j_m<3.5::float4

Also, given your min/max, I am not sure why it thinks it is going to get
788,100 rows. How many rows in the table again?

Does (3.5-3.4)/(max-min) * #rows = 788k?

Thanks, Bruce!

Yes, I tried the latter query and it's the same:

--------------------------------------------------

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5::float4;
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

EXPLAIN

--------------------------------------------------
I've tried all permutations of the conversions in the ranges with
similar results (and vacuum analyzed several times as well as
dumped and reloaded and reloaded from scracth). We have
a larger database with 20M rows which has a similar behavior.

There are 7092894 rows in database "lmc". So:

(3.5-3.4)/(99.999-2.731) = 7292.1 != 788k

A clue?

Thanks again,

--M

===========================================================================

Martin Weinberg Phone: (413) 545-3821
Dept. of Physics and Astronomy FAX: (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA 01003-4525

Jonathan davis

haj@idianet.net

about 27 years ago

In reply to: Martin Weinberg (#5)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

hello all

i want to : update virtual_table set field1 = 3 where index = 4;

create table virtual_table(index int4, field1 int4);

create table table1 ( ....)inherits(virtual_table);

create table table2( ....)inherits(virtual_table);

insert into table1(index, field1) values(4, 0);

and i want to change the value of field1

Import Notes

Resolved by subject fallback

Bruce Momjian

bruce@momjian.us

about 27 years ago

In reply to: Martin Weinberg (#5)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Thanks, Bruce!

Yes, I tried the latter query and it's the same:

--------------------------------------------------

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5::float4;
NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

EXPLAIN

--------------------------------------------------
I've tried all permutations of the conversions in the ranges with
similar results (and vacuum analyzed several times as well as
dumped and reloaded and reloaded from scracth). We have
a larger database with 20M rows which has a similar behavior.

There are 7092894 rows in database "lmc". So:

(3.5-3.4)/(99.999-2.731) = 7292.1 != 788k

A clue?

I have just fixed a problem with index size estimates. Try adding
#include <math.h> to the top of backend/optimizer/util/plancat.c. That
may fix the estimated number of tuples returned. However, it don't
think you are going to get better performance, since you are already
using the index in the above case. The only big win I can think of is
to use CLUSTER on that field. That should speed things up quite a bit.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Martin Weinberg

weinberg@osprey.phast.umass.edu

about 27 years ago

In reply to: Bruce Momjian (#7)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Bruce Momjian wrote on Wed, 07 Jul 1999 12:29:13 EDT

Thanks, Bruce!

Yes, I tried the latter query and it's the same:

--------------------------------------------------

lmc=> explain select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:

:float4;

NOTICE: QUERY PLAN:

Aggregate (cost=62349.97 rows=788100 width=4)
-> Index Scan using j on lmctot (cost=62349.97 rows=788100 width=4)

EXPLAIN

--------------------------------------------------
I've tried all permutations of the conversions in the ranges with
similar results (and vacuum analyzed several times as well as
dumped and reloaded and reloaded from scracth). We have
a larger database with 20M rows which has a similar behavior.

There are 7092894 rows in database "lmc". So:

(3.5-3.4)/(99.999-2.731) = 7292.1 != 788k

A clue?

I have just fixed a problem with index size estimates. Try adding
#include <math.h> to the top of backend/optimizer/util/plancat.c. That
may fix the estimated number of tuples returned. However, it don't
think you are going to get better performance, since you are already
using the index in the above case. The only big win I can think of is
to use CLUSTER on that field. That should speed things up quite a bit.

Hi Bruce,

Ok. Sorry about the delay.

I added the math.h but that doesn't seem to change the
query plan output.

I then dropped all the indices, made a new one on three of the
variables and clustered:

create index m_col on lmctot using btree (j_m, h_m, k_m);
cluster m_col on lmctot;
vacuum analyze;

where the j_m, h_m, k_m are three float4 fields.

The cluster took about 18 hours on my 7.1 million records
(this is a dual 450Mhz Xeon Linux box). Not sure why
this was so slow.

Anyway, this *hugely* improved queries of form:

select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:

although the explain query plan output is identical. However
using h_m or k_m (not the first variable in the index) appears
to be doing a sequential scan. Is that right?

I then made indices on h_m and k_m, vacuum analyzed and tried
again, but got identical performance. If this is the way
it is, so be it, but I have the feeling that something is
not working properly.

Any ideas?

Again, with _heaps_ of thanks,

--Martin

===========================================================================

Martin Weinberg Phone: (413) 545-3821
Dept. of Physics and Astronomy FAX: (413) 545-2117/0648
530 Graduate Research Tower
University of Massachusetts
Amherst, MA 01003-4525

Bruce Momjian

bruce@momjian.us

about 27 years ago

In reply to: Martin Weinberg (#8)

Re: [GENERAL] Problems with inequalities on numeric fields in 6.5

Hi Bruce,

Ok. Sorry about the delay.

I added the math.h but that doesn't seem to change the
query plan output.

I then dropped all the indices, made a new one on three of the
variables and clustered:

create index m_col on lmctot using btree (j_m, h_m, k_m);
cluster m_col on lmctot;
vacuum analyze;

where the j_m, h_m, k_m are three float4 fields.

The cluster took about 18 hours on my 7.1 million records
(this is a dual 450Mhz Xeon Linux box). Not sure why
this was so slow.

That is a long time.

Anyway, this *hugely* improved queries of form:

select count(*) from lmctot where j_m>3.4::float4 and j_m<3.5:

although the explain query plan output is identical. However
using h_m or k_m (not the first variable in the index) appears
to be doing a sequential scan. Is that right?

Yes, that is right. The index is only on the one field, and can only
use secondary index variables after the first one is matched.

I then made indices on h_m and k_m, vacuum analyzed and tried
again, but got identical performance. If this is the way
it is, so be it, but I have the feeling that something is
not working properly.

The big problem is that you can only cluster on one index.

What cluster has done is prevent the system from bouncing all over the
disk getting matching rows. They are all sequential on the disk.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026