Incorrect index being used

Started by Jesse Longalmost 13 years ago12 messagesgeneral

jpl@unknown.za.net

almost 13 years ago

Hi PostgreSQL community,

I have the following query, run immediately after executing VACUUM in
the database. There is only one connection to the database.

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.

How can I get PostgreSQL to use subplan 1 and 3?

Thanks,
Jesse

testdb=> explain analyse SELECT * FROM ARCHIVE_DOCUMENT AS r0 WHERE r0.NODE_ID = 29 AND r0.ARCHIVE_DATE >= '2013-07-08 18:28:00' AND (EXISTS (SELECT r1.* FROM ARCHIVE_DOCUMENT_INDEX AS r1 WHERE r1.ARCHIVE_ID = r0.ID AND r1.NODE_ID = r0.NODE_ID AND r1.VALUE = 'BSH70002152' ) OR EXISTS ( SELECT r2.* FROM ARCHIVE_DOCUMENT_INDEX AS r2 WHERE r2.ARCHIVE_ID = r0.ID AND r2.NODE_ID = r0.NODE_ID AND r2.VALUE = 'TC212592' ) ) ORDER BY r0.ARCHIVE_DATE DESC LIMIT 10;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.56..151.79 rows=10 width=122) (actual time=44601.350..97649.196 rows=2 loops=1)

-> Index Scan Backward using idx_archive_document_x_node_id_archive_date on archive_document r0 (cost=0.56..8258406.24 rows=546105 width=122) (actual time=44601.33

Index Cond: ((node_id = 29) AND (archive_date >= '2013-07-08 18:28:00'::timestamp without time zone))

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value = 'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1 (cost=0.00..1958104.00 rows=1520 width=16) (actual time=44418.383..44558.293 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value = 'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1 (cost=0.00..1958104.00 rows=1520 width=16) (actual time=41659.464..41663.342 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

Total runtime: 97683.836 ms

(22 rows)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Laurenz Albe

laurenz.albe@cybertec.at

almost 13 years ago

In reply to: Jesse Long (#1)

Re: Incorrect index being used

Jesse Long wrote:

I have the following query, run immediately after executing VACUUM in
the database. There is only one connection to the database.

You should run ANALYZE, not VACUUM.

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.

How can I get PostgreSQL to use subplan 1 and 3?

They are only possible if an "Index Only Scan" is possible, which
can only be used if the respective table entries are visible for
all transactions.

testdb=> explain analyse SELECT * FROM ARCHIVE_DOCUMENT AS r0 WHERE r0.NODE_ID = 29 AND
r0.ARCHIVE_DATE >= '2013-07-08 18:28:00' AND (EXISTS (SELECT r1.* FROM ARCHIVE_DOCUMENT_INDEX AS r1
WHERE r1.ARCHIVE_ID = r0.ID AND r1.NODE_ID = r0.NODE_ID AND r1.VALUE = 'BSH70002152' ) OR EXISTS (
SELECT r2.* FROM ARCHIVE_DOCUMENT_INDEX AS r2 WHERE r2.ARCHIVE_ID = r0.ID AND r2.NODE_ID = r0.NODE_ID
AND r2.VALUE = 'TC212592' ) ) ORDER BY r0.ARCHIVE_DATE DESC LIMIT 10;

[...]

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed
SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on
archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value =
'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1 (cost=0.00..1958104.00 rows=1520 width=16)
(actual time=44418.383..44558.293 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on
archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value =
'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1 (cost=0.00..1958104.00 rows=1520 width=16)
(actual time=41659.464..41663.342 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

The estimates are quite off.
Does "ANALYZE archive_document", possibly after increasing
default_statistics_target, make a difference?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: Laurenz Albe (#2)

Re: Incorrect index being used

On 09/10/2013 12:10, Albe Laurenz wrote:

Jesse Long wrote:

I have the following query, run immediately after executing VACUUM in
the database. There is only one connection to the database.

You should run ANALYZE, not VACUUM.

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.

How can I get PostgreSQL to use subplan 1 and 3?

They are only possible if an "Index Only Scan" is possible, which
can only be used if the respective table entries are visible for
all transactions.

testdb=> explain analyse SELECT * FROM ARCHIVE_DOCUMENT AS r0 WHERE r0.NODE_ID = 29 AND
r0.ARCHIVE_DATE >= '2013-07-08 18:28:00' AND (EXISTS (SELECT r1.* FROM ARCHIVE_DOCUMENT_INDEX AS r1
WHERE r1.ARCHIVE_ID = r0.ID AND r1.NODE_ID = r0.NODE_ID AND r1.VALUE = 'BSH70002152' ) OR EXISTS (
SELECT r2.* FROM ARCHIVE_DOCUMENT_INDEX AS r2 WHERE r2.ARCHIVE_ID = r0.ID AND r2.NODE_ID = r0.NODE_ID
AND r2.VALUE = 'TC212592' ) ) ORDER BY r0.ARCHIVE_DATE DESC LIMIT 10;

[...]

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed
SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on
archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value =
'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1 (cost=0.00..1958104.00 rows=1520 width=16)
(actual time=44418.383..44558.293 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on
archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value =
'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1 (cost=0.00..1958104.00 rows=1520 width=16)
(actual time=41659.464..41663.342 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

The estimates are quite off.
Does "ANALYZE archive_document", possibly after increasing
default_statistics_target, make a difference?

Yours,
Laurenz Albe

Hi Laurenz,

Thank you for the feedback.

There is no problem with row visibility, there is only one connection to
the database - the connection I am using to do these selects.

Thanks you for the advise regarding ANALYZE. I ran ANALYZE on both
tables concerned, but not much changed:

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.56..151.85 rows=10 width=122) (actual time=40841.984..85668.213 rows=2 loops=1)

-> Index Scan Backward using idx_archive_document_x_node_id_archive_date on archive_document r0 (cost=0.56..7627640.20 rows=504186 width=122) (actual time=40841.98

Index Cond: ((node_id = 29) AND (archive_date >= '2013-07-08 18:28:00'::timestamp without time zone))

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value = 'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1 (cost=0.00..1958101.80 rows=1568 width=16) (actual time=36633.365..40841.909 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using archive_document_index_x_archive_id_node_id_value on archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never executed)

Index Cond: ((archive_id = r0.id) AND (node_id = r0.node_id) AND (value = 'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1 (cost=0.00..1958101.80 rows=1568 width=16) (actual time=40241.599..44462.485 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

Total runtime: 85676.734 ms

(22 rows)

Thanks,
Jesse

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: Jesse Long (#3)

Re: Incorrect index being used

On 09/10/2013 12:57, Jesse Long wrote:

On 09/10/2013 12:10, Albe Laurenz wrote:

Jesse Long wrote:

I have the following query, run immediately after executing VACUUM in
the database. There is only one connection to the database.

You should run ANALYZE, not VACUUM.

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.

How can I get PostgreSQL to use subplan 1 and 3?

They are only possible if an "Index Only Scan" is possible, which
can only be used if the respective table entries are visible for
all transactions.

testdb=> explain analyse SELECT * FROM ARCHIVE_DOCUMENT AS r0 WHERE
r0.NODE_ID = 29 AND
r0.ARCHIVE_DATE >= '2013-07-08 18:28:00' AND (EXISTS (SELECT r1.*
FROM ARCHIVE_DOCUMENT_INDEX AS r1
WHERE r1.ARCHIVE_ID = r0.ID AND r1.NODE_ID = r0.NODE_ID AND r1.VALUE
= 'BSH70002152' ) OR EXISTS (
SELECT r2.* FROM ARCHIVE_DOCUMENT_INDEX AS r2 WHERE r2.ARCHIVE_ID =
r0.ID AND r2.NODE_ID = r0.NODE_ID
AND r2.VALUE = 'TC212592' ) ) ORDER BY r0.ARCHIVE_DATE DESC LIMIT 10;

[...]

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR
(alternatives: SubPlan 3 or hashed
SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using
archive_document_index_x_archive_id_node_id_value on
archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never
executed)

Index Cond: ((archive_id = r0.id) AND (node_id =
r0.node_id) AND (value =
'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1
(cost=0.00..1958104.00 rows=1520 width=16)
(actual time=44418.383..44558.293 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using
archive_document_index_x_archive_id_node_id_value on
archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never
executed)

Index Cond: ((archive_id = r0.id) AND (node_id =
r0.node_id) AND (value =
'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1
(cost=0.00..1958104.00 rows=1520 width=16)
(actual time=41659.464..41663.342 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

The estimates are quite off.
Does "ANALYZE archive_document", possibly after increasing
default_statistics_target, make a difference?

Yours,
Laurenz Albe

Hi Laurenz,

Thank you for the feedback.

There is no problem with row visibility, there is only one connection
to the database - the connection I am using to do these selects.

Thanks you for the advise regarding ANALYZE. I ran ANALYZE on both
tables concerned, but not much changed:

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.56..151.85 rows=10 width=122) (actual
time=40841.984..85668.213 rows=2 loops=1)

-> Index Scan Backward using
idx_archive_document_x_node_id_archive_date on archive_document r0
(cost=0.56..7627640.20 rows=504186 width=122) (actual time=40841.98

Index Cond: ((node_id = 29) AND (archive_date >= '2013-07-08
18:28:00'::timestamp without time zone))

Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR
(alternatives: SubPlan 3 or hashed SubPlan 4))

Rows Removed by Filter: 710851

SubPlan 1

-> Index Only Scan using
archive_document_index_x_archive_id_node_id_value on
archive_document_index r1 (cost=0.57..4.59 rows=1 width=0) (never
executed)

Index Cond: ((archive_id = r0.id) AND (node_id =
r0.node_id) AND (value = 'BSH70002152'::text))

Heap Fetches: 0

SubPlan 2

-> Seq Scan on archive_document_index r1_1
(cost=0.00..1958101.80 rows=1568 width=16) (actual
time=36633.365..40841.909 rows=4 loops=1)

Filter: ((value)::text = 'BSH70002152'::text)

Rows Removed by Filter: 95009919

SubPlan 3

-> Index Only Scan using
archive_document_index_x_archive_id_node_id_value on
archive_document_index r2 (cost=0.57..4.59 rows=1 width=0) (never
executed)

Index Cond: ((archive_id = r0.id) AND (node_id =
r0.node_id) AND (value = 'TC212592'::text))

Heap Fetches: 0

SubPlan 4

-> Seq Scan on archive_document_index r2_1
(cost=0.00..1958101.80 rows=1568 width=16) (actual
time=40241.599..44462.485 rows=1 loops=1)

Filter: ((value)::text = 'TC212592'::text)

Rows Removed by Filter: 95009922

Total runtime: 85676.734 ms

(22 rows)

Sorry, I neglected to mention that this is on PostgreSQL 9.3.0.

Thanks,
Jesse

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Laurenz Albe

laurenz.albe@cybertec.at

almost 13 years ago

In reply to: Jesse Long (#3)

Re: Incorrect index being used

Jesse Long wrote:

There is no problem with row visibility, there is only one connection to
the database - the connection I am using to do these selects.

No idea why the plans cannot be used.

It might be helpful to see the table and index definitions.

Thanks you for the advise regarding ANALYZE. I ran ANALYZE on both
tables concerned, but not much changed:

Did you try increasing default_statistics_target before ANALYZE?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Jesse Long (#1)

Re: Incorrect index being used

Jesse Long <jpl@unknown.za.net> writes:

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.
How can I get PostgreSQL to use subplan 1 and 3?

You can't, and you would not like the results if you did.

The construct that's being described (perhaps not very intelligibly)
by this EXPLAIN output is an alternative pair of subplans. Actually
there are two such alternative pairs in this example. The indexscan
variants are subplans that would be fast if executed only once or
twice. The seqscan variants, if used, are used to load a hashtable
that is then probed for each row of the outer plan. If there are a
lot of rows to be considered in the outer plan, then it's better to
pay the price of loading the hashtable, because each hashtable probe
will be a lot cheaper than doing a fresh indexscan with the comparison
value from the current outer row.

In this example, we can see that the outer scan that the subplans
are attached to eliminated 710851 rows by means of the subplan filters,
meaning that the subplans were probed 710851+2 times. If each of those
probes had been done with a separate indexscan, you'd likely still be
waiting for the result. Using the seqscan+hashtable was definitely the
right choice here.

BTW, the reason it looks like this rather than just hard-wiring the
seqscan choice is a planner implementation artifact --- at the time
that the subplan plans are created, we don't know how many rows are
expected to pass through the outer plan level. So we plan it both
ways and leave the choice to be made during executor startup.

What I'd suggest is that you see if you can't get rid of the "EXISTS() OR
EXISTS()" construction in favor of a single EXISTS clause --- I'm too lazy
to work out the details but it looks like you could do the OR in the WHERE
clause of a single EXISTS sub-select. That would allow the planner to
convert the EXISTS into a semi-join, which might work better than what
you've got. As is, you're dealing with fairly generic sub-select logic
that isn't going to be terribly well optimized.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: Laurenz Albe (#5)

Re: Incorrect index being used

On 09/10/2013 15:20, Albe Laurenz wrote:

Jesse Long wrote:

There is no problem with row visibility, there is only one connection to
the database - the connection I am using to do these selects.

No idea why the plans cannot be used.

It might be helpful to see the table and index definitions.

Thanks you for the advise regarding ANALYZE. I ran ANALYZE on both
tables concerned, but not much changed:

Did you try increasing default_statistics_target before ANALYZE?

Yours,
Laurenz Albe

Hi Laurenz,

After running VACUUM ANALYZE, the query did run the preferred sub plan,
at least once. This was before making any changes.

I did ALTER COLUMN SET STATISTICS 10000 for each column involved after
that (is that what you meant?). But it did not make much difference, but
I have read the manual regarding this setting and think I understand it.

Would it be a bad idea to ALTER COLUMN SET STATISTICS 100000? Would that
have a very negative impact on disk usage or performace?

Thanks,
Jesse

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: Tom Lane (#6)

Re: Incorrect index being used

On 09/10/2013 18:06, Tom Lane wrote:

Jesse Long <jpl@unknown.za.net> writes:

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.
How can I get PostgreSQL to use subplan 1 and 3?

You can't, and you would not like the results if you did.

The construct that's being described (perhaps not very intelligibly)
by this EXPLAIN output is an alternative pair of subplans. Actually
there are two such alternative pairs in this example. The indexscan
variants are subplans that would be fast if executed only once or
twice. The seqscan variants, if used, are used to load a hashtable
that is then probed for each row of the outer plan. If there are a
lot of rows to be considered in the outer plan, then it's better to
pay the price of loading the hashtable, because each hashtable probe
will be a lot cheaper than doing a fresh indexscan with the comparison
value from the current outer row.

In this example, we can see that the outer scan that the subplans
are attached to eliminated 710851 rows by means of the subplan filters,
meaning that the subplans were probed 710851+2 times. If each of those
probes had been done with a separate indexscan, you'd likely still be
waiting for the result. Using the seqscan+hashtable was definitely the
right choice here.

BTW, the reason it looks like this rather than just hard-wiring the
seqscan choice is a planner implementation artifact --- at the time
that the subplan plans are created, we don't know how many rows are
expected to pass through the outer plan level. So we plan it both
ways and leave the choice to be made during executor startup.

What I'd suggest is that you see if you can't get rid of the "EXISTS() OR
EXISTS()" construction in favor of a single EXISTS clause --- I'm too lazy
to work out the details but it looks like you could do the OR in the WHERE
clause of a single EXISTS sub-select. That would allow the planner to
convert the EXISTS into a semi-join, which might work better than what
you've got. As is, you're dealing with fairly generic sub-select logic
that isn't going to be terribly well optimized.

Hi Tom,

I am very grateful for your detailed reply. I have not had much time to
pursue this issue further, but as soon as I have I will investigate and
study what you have written.

Thanks for taking the time to write your thoughts in detail.

Cheers,
Jesse

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Laurenz Albe

laurenz.albe@cybertec.at

almost 13 years ago

In reply to: Jesse Long (#7)

Re: Incorrect index being used

Jesse Long wrote:

I did ALTER COLUMN SET STATISTICS 10000 for each column involved after
that (is that what you meant?). But it did not make much difference, but
I have read the manual regarding this setting and think I understand it.

Would it be a bad idea to ALTER COLUMN SET STATISTICS 100000? Would that
have a very negative impact on disk usage or performace?

No, in the light of what Tom wrote, I was on the wrong track entirely
and your best bet is probably to rewrite the query.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: Tom Lane (#6)

Re: Incorrect index being used

On 09/10/2013 18:06, Tom Lane wrote:

Jesse Long <jpl@unknown.za.net> writes:

The query runs for much longer than I expect it to run for, and I think
this is due to it using the incorrect subplan. As you can see, subplans
1 and 3 make use of and index, but these subplans are not used.
Subplans and 4 are seqscan, and they are used.
How can I get PostgreSQL to use subplan 1 and 3?

You can't, and you would not like the results if you did.

The construct that's being described (perhaps not very intelligibly)
by this EXPLAIN output is an alternative pair of subplans. Actually
there are two such alternative pairs in this example. The indexscan
variants are subplans that would be fast if executed only once or
twice. The seqscan variants, if used, are used to load a hashtable
that is then probed for each row of the outer plan. If there are a
lot of rows to be considered in the outer plan, then it's better to
pay the price of loading the hashtable, because each hashtable probe
will be a lot cheaper than doing a fresh indexscan with the comparison
value from the current outer row.

In this example, we can see that the outer scan that the subplans
are attached to eliminated 710851 rows by means of the subplan filters,
meaning that the subplans were probed 710851+2 times. If each of those
probes had been done with a separate indexscan, you'd likely still be
waiting for the result. Using the seqscan+hashtable was definitely the
right choice here.

BTW, the reason it looks like this rather than just hard-wiring the
seqscan choice is a planner implementation artifact --- at the time
that the subplan plans are created, we don't know how many rows are
expected to pass through the outer plan level. So we plan it both
ways and leave the choice to be made during executor startup.

What I'd suggest is that you see if you can't get rid of the "EXISTS() OR
EXISTS()" construction in favor of a single EXISTS clause --- I'm too lazy
to work out the details but it looks like you could do the OR in the WHERE
clause of a single EXISTS sub-select. That would allow the planner to
convert the EXISTS into a semi-join, which might work better than what
you've got. As is, you're dealing with fairly generic sub-select logic
that isn't going to be terribly well optimized.

Hi Tom,

I understand what you say about using the index (archive_id, node_id,
value) to do a separate lookup for each row in the archive_document
table that would be filtered. I understand that this would be constly.

However, the seqscan is killing me. I have another index on
archive_document_index which has been there all along - (node_id,
value). Would it not be better for PostgreSQL to use this index to
perform an index scan instead of a seqscan when populating the hash table?

explain select * from archive_document_index where node_id = 29 and
value = 'BSH70002152';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using archive_document_index_node_id_value_idx on
archive_document_index (cost=0.57..36.13 rows=14 width=33)
Index Cond: ((node_id = 29) AND ((value)::text = 'BSH70002152'::text))
(2 rows)

Thanks,
Jesse

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

BladeOfLight16

bladeoflight16@gmail.com

almost 13 years ago

In reply to: Jesse Long (#10)

Re: Incorrect index being used

On Fri, Oct 11, 2013 at 9:32 AM, Jesse Long <jpl@unknown.za.net> wrote:

explain select * from archive_document_index where node_id = 29 and value
= 'BSH70002152';
QUERY PLAN
------------------------------**------------------------------**
------------------------------**------------------------------**--
Index Scan using archive_document_index_node_**id_value_idx on
archive_document_index (cost=0.57..36.13 rows=14 width=33)
Index Cond: ((node_id = 29) AND ((value)::text = 'BSH70002152'::text))
(2 rows)

I believe that this is what Tom is suggesting:

SELECT *
FROM ARCHIVE_DOCUMENT AS ad
WHERE ad.NODE_ID = 29
AND ad.ARCHIVE_DATE >= '2013-07-08 18:28:00'
AND EXISTS (SELECT *
FROM ARCHIVE_DOCUMENT_INDEX AS adi
WHERE adi.ARCHIVE_ID = ad.ID
AND adi.NODE_ID = ad.NODE_ID
AND (adi.VALUE = 'BSH70002152' OR adi.VALUE = 'TC212592')
)
ORDER BY ad.ARCHIVE_DATE DESC
LIMIT 10;

Forgive my changing of the aliases. I found names like r0 difficult to
interpret easily.

This is a much simpler query since it only requires one subselect, and I
believe Tom is suggesting that this query may be able to make use of the
index or at least find some more efficient plan. How does this perform?

#12

Jesse Long

jpl@unknown.za.net

almost 13 years ago

In reply to: BladeOfLight16 (#11)

Re: Incorrect index being used

On 12/10/2013 01:45, BladeOfLight16 wrote:

On Fri, Oct 11, 2013 at 9:32 AM, Jesse Long <jpl@unknown.za.net
<mailto:jpl@unknown.za.net>> wrote:

explain select * from archive_document_index where node_id = 29
and value = 'BSH70002152';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Scan using archive_document_index_node_id_value_idx on
archive_document_index (cost=0.57..36.13 rows=14 width=33)
Index Cond: ((node_id = 29) AND ((value)::text =
'BSH70002152'::text))
(2 rows)

I believe that this is what Tom is suggesting:

SELECT *
FROM ARCHIVE_DOCUMENT AS ad
WHERE ad.NODE_ID = 29
AND ad.ARCHIVE_DATE >= '2013-07-08 18:28:00'
AND EXISTS (SELECT *
FROM ARCHIVE_DOCUMENT_INDEX AS adi
WHERE adi.ARCHIVE_ID = ad.ID
AND adi.NODE_ID = ad.NODE_ID
AND (adi.VALUE = 'BSH70002152' OR adi.VALUE = 'TC212592')
)
ORDER BY ad.ARCHIVE_DATE DESC
LIMIT 10;

Forgive my changing of the aliases. I found names like r0 difficult to
interpret easily.

This is a much simpler query since it only requires one subselect, and
I believe Tom is suggesting that this query may be able to make use of
the index or at least find some more efficient plan. How does this
perform?

Indeed it does perform very much better. With the modified query the
plan and the execution time are excellent.

However, my previous question remains - in the original query plan,
there are two hash tables being populated using seqscan + filter. During
each seqscan, over 95 million records were wastefully read. I think that
this could have been dramatically improved by using an index scan. Am I
mistaken? Is it a optimisation that could be implemented but has not
been implemented yet? Is it a bug?

Thanks,
Jesse