How effectively do the indexing in postgres in such cases

Started by sudover 1 year ago2 messagesgeneral

suds1434@gmail.com

over 1 year ago

Hello Experts,
We have a requirement in which the query will be formed like below. We will
have two partitioned tables joined and there may be filters used on both of
the tables columns or it may be one of those.

These types of queries are very frequently used queries and critical to
customers as these are part of search screens , so we want to have the
indexing happen effectively to satisfy these types of queries to return
rows in not more than ~1 seconds. The both tables are daily range
partitions on column "part_date" and the volume of data per day/partitions
will be ~700mllion in both the tables.

The customer can go searching for a duration starting from one days till
max ~1 month of data i.e. part_date spanning for ~1 month duration. And the
search should provide the latest transaction on the screen which is why
"order by ..limit clause is used". "Offset" is used there because the
customer can scroll through the next page on the UI where he has the
capability to see the next 100 rows and so on. In the first screen it is
also expected to see the count of the results , so that the customer can
get an immediate idea about the total count of transactions he has matching
his search criteria.

So ,
1)In the query below , if the optimizer chooses tab1 as the driving table,
the index on just col1 should be enough or it should be (col1, tab1_id)?
Similarly if it chooses the tab2 be the driving table then , index on
(col2,tab2_id). Or just indexing the filtered column should be enough like
individual indexes on COL1 and COL2 of table tab1 and tab2 respectively?

2)In scenarios where the customer has a lot of matching transactions (say
in millions) post all the filters applied , and as the customer has to just
see the latest 100 rows transaction data, the sorting will be a bottleneck.
So what can be done to make such types of queries to return the latest
transactions in quick time on the search screen?

3)As here also the count has to happen in the first step to make the
customer know the total number of rows(which may be in millions), so what
additional index will support this requirement?

Or if any other optimization strategy we can follow for catering to such
queries?

select * from tab1, tab2
where tab1.part_date between '1-jan-2024' and '31-jan-2024'
and tab1.part_date=tab2.part_date
and tab1.tab1_id=tab2.tab2_id
and tab1.col1=<:input_col1>
and tab2.col2=<:input_col2>
order by tab1.create_timestamp desc
limit 100 offset 100;

Regards
Sud

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: sud (#1)

Re: How effectively do the indexing in postgres in such cases

Your questions are a little too vague to answer well, but let me try a bit.

1)In the query below , if the optimizer chooses tab1 as the driving table,

the index on just col1 should be enough or it should be (col1, tab1_id)?

No way to tell without trying it yourself. We need information on how the
tables are joined, the cardinality, general distribution, etc. But as a
rough general rule, yes, indexes on the column of interest should be able
to handle the job well by themselves.

2)In scenarios where the customer has a lot of matching transactions (say

in millions) post all the filters applied , and as the customer has to just
see the latest 100 rows transaction data, the sorting will be a bottleneck.
So what can be done to make such types of queries to return the latest
transactions in quick time on the search screen?

Sorting can be quick, if you hit an index (b-trees are already sorted)
Postgres can look at only the first X rows returned and does not need to
read the whole thing. So a well-designed index is the key here.

3)As here also the count has to happen in the first step to make the
customer know the total number of rows(which may be in millions), so what
additional index will support this requirement?

Again, a very vague question, but for things that are in millions, an
estimate is usually sufficient, so you might be able to do something like
SELECT count(*) FROM mytab WHERE mydate BETWEEN x AND y; and use that as
your answer. Compare to the full query to see how close it is. You might
even have cutoffs, where if the results of that first one is < 10,000,
switch to a more accurate version which has more filtering (i.e. the joins
and more where conditions).

Or if any other optimization strategy we can follow for catering to such
queries?

select * from tab1, tab2
where tab1.part_date between '1-jan-2024' and '31-jan-2024'
and tab1.part_date=tab2.part_date
and tab1.tab1_id=tab2.tab2_id
and tab1.col1=<:input_col1>
and tab2.col2=<:input_col2>
order by tab1.create_timestamp desc
limit 100 offset 100;

It probably would help to see exact tables and queries. Why are you joining
on part_date? Is tab_id unique to either table? Running EXPLAIN on these
while you try out indexes and change the joins, etc. is a great exercise to
help you learn how Postgres works. As far as asking on lists for help with
specific queries, there is a range between totally abstract and generic
queries that nobody can help you with, and large, complex specific queries
that nobody wants to unravel and help you with. You are definitely on the
former side: try to create some actually runable sample queries that are
small, self-contained, and generate the problem you are trying to solve.

Cheers,
Greg