Proposition for better performance
Hi,
I'm trying to create an availability table for existing products. I'm
fetching the suitable schema to design in order to get good performance
when I fetch products in a table contains millions of rows.
I think to make a schema like this:
*create table availability (product_id integer, product_name varchar(255),
start_date date, end_date date)*.
E.g:
Given an example 'product-1' which has a product_id = 1 and available
from *27/03/2018
to 31/03/2018*:
*product_id product_name start_date end_date *
* 1 product-1 27-03-2018 31-03-2018*
However if the product is not available in 29/03/2018 I have to divide the
period to 2 intervals:
*product_id product_name start_date end_date
1 product-1 27-03-2018 28-03-2018
1 product-1 30-03-2018 31-03-2018With a
table contains millions of products and if I divide the availability by
intervals I will got a huge number of rows and fetching data will degrade
the performance. Is there any solution to use such as range types or
anything which mentions the unavailable dates.Based on the example
mentioned above, can I mention a data type like range type that take the
start_date and end_date and exclude the unavailable dates?*
On 03/27/2018 07:42 AM, hmidi slim wrote:
Hi,
I'm trying to create an availability table for existing products. I'm
fetching the suitable schema to design in order to get good performance
when I fetch products in a table contains millions of rows.
I think to make a schema like this:
*create table availability (product_id integer, product_name
varchar(255), start_date date, end_date date)*.
I would use a tstzrange (or daterange) instead of separate start_date
and end_date columns. Then you can create an exclusion constraint that
has `EXCLUDE USING gist (id WITH =, available_during WITH &&)`. That
will automatically add a GiST index on those columns that should help
with faster lookups. (It will also prevent contradictions where a
product has two overlapping rows.)
You didn't mention the queries you want to be fast, but that index
should cover the reasonable ones I think.
Is there any solution to use such as range types or anything which
mentions the unavailable dates.
For any product, there should be the same number of unavailable periods
as available, right---or often one more? So I don't see any performance
benefit in doing it that way, and it certainly seems less intuitive to
store when something is *not* available.
--
Paul ~{:-)
pj@illuminatedcomputing.com
the query that I used to fetch products was:
select * from availability
where ('27-03-2018' between start_date and end_date)
and ('31-03-2018' between start_date and end_date);
I added another column named during of type daterange and I created a gist
index :
create index idx on availability(during);
select * from availability
where during @> daterange(''27-03-2018', '31-03-2018');
With a table of 15M rows I got an execution time of 1 minute.Thtat's why I
asked if it is a good solution to divide the availability of a product by
intervals.
If I assume that a product has in every month 5 availability intervals, in
12 months we will have 60 intervals.For 1M products that's will be 60M
rows. This will affect the performance.
On 03/27/2018 10:04 AM, hmidi slim wrote:
the query that I used to fetch products was:
select * from availability
where ('27-03-2018' between start_date and end_date)
and ('31-03-2018' between start_date and end_date);I added another column named during of type daterange and I created a
gist index :
create index idx on availability(during);
That creates a b-tree index.
Also, what did you set `during` to?
select * from availability
where during @> daterange(''27-03-2018', '31-03-2018');With a table of 15M rows I got an execution time of 1 minute.
What was the query plan? Did you confirm that it used the index?
--
Paul ~{:-)
pj@illuminatedcomputing.com
I update the example:
*create table product_availabilities(product_id integer, start_date date,
end_date date); insert into product_availabilities(product_id,
start_date, end_date) select a, '2018-01-01', '2018-05-03' from
generate_series(1,15000000) as a alter table
product_availabilities add column during daterange; update
product_availabilities set during = daterange(start_date, end_date);
CREATE INDEX idx_time ON product_availabilities USING gist(during);
explain analyze SELECT product_id, start_date, end_date, during FROM
product_availabilities WHERE during @> daterange('2018-02-01',
'2018-04-01')*
Query plan:
*Bitmap Heap Scan on product_availabilities (cost=33728.79..236086.04
rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000
loops=1) Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Heap Blocks: exact=31040 lossy=79255 -> Bitmap Index Scan on idx_time
(cost=0.00..33509.17 rows=878500 width=0) (actual time=2767.262..2767.262
rows=15000000 loops=1) Index Cond: (during @>
'[2018-02-01,2018-04-01)'::daterange)Planning time: 0.063 msExecution time:
6408.456 ms SELECT product_id, start_date, end_date, during FROM
product_availabilities WHERE during @> daterange('2018-02-01',
'2018-04-01')*
returns query runtime
*: 2min*
Import Notes
Reply to msg id not found: CAMsqVxu2jFANWNDnFRUnM76NS70_u+VrwhmPOiK9U97FK9sm8w@mail.gmail.com
(Including the list....)
On 03/27/2018 10:49 AM, hmidi slim wrote:
Sorry I didn't copy it very well:
create index idx on availability using gist(during);and during = daterange(start_date,end_date)
And the query plan used was seq scan not index scan.
It sounds like there must be some important information missing. Here is
my attempt to mock up some fake data:
insert into availability
select p.id, concat('prod ', p.id::text), daterange((now() +
concat(r.v, ' days')::interval)::date, (now() + concat((r.v + 1 +
random() * 21)::int, ' days')::interval)::date)
from (select * from generate_series(1, 1000000)) p(id)
cross join (select * from generate_series(1, 15)) n(i)
join lateral (
select p2.id, n2.i, (random() * 600 - 300)::int AS v
from generate_series(1, 1000000) p2(id),
generate_series(1, 15) n2(i)
) r
on r.id = p.id and r.i = n.i
;
Then
explain select * from availability where during @>
daterange('2018-03-27', '2018-03-31');
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using idx on availability (cost=0.39..1644.41 rows=1
width=552)
Index Cond: (during @> '[2018-03-27,2018-03-31)'::daterange)
(2 rows)
Running that query for real I get:
...
(215044 rows)
Time: 1450.099 ms (00:01.450)
So if the index isn't being used you'll have to do some digging to find
out why.
--
Paul ~{:-)
pj@illuminatedcomputing.com
Import Notes
Reply to msg id not found: CAMsqVxu2jFANWNDnFRUnM76NS70_u+VrwhmPOiK9U97FK9sm8w@mail.gmail.com
On 03/27/2018 11:14 AM, hmidi slim wrote:
Query plan:*
Bitmap Heap Scan on product_availabilities (cost=33728.79..236086.04
rows=878500 width=26) (actual time=2775.058..5792.842 rows=15000000 loops=1)
Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Heap Blocks: exact=31040 lossy=79255
-> Bitmap Index Scan on idx_time (cost=0.00..33509.17 rows=878500
width=0) (actual time=2767.262..2767.262 rows=15000000 loops=1)
Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Planning time: 0.063 ms
Execution time: 6408.456 ms*SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')*
*
returns query runtime*: 2min
*
It is 6 seconds or 2 minutes? Where is the 2 minutes number coming from?
Are you sure that's all Postgres? With 878k rows even sending them over
the network is going to take a while, and then more time to JSONify them
or whatever else you need to do.
--
Paul ~{:-)
pj@illuminatedcomputing.com
This is the message that I got:
Successfully run. Total query runtime: 2 min.
15000000 rows affected.
On 03/27/2018 11:42 AM, hmidi slim wrote:
This is the message that I got:
Successfully run. Total query runtime: 2 min.
15000000 rows affected.
Sorry, I don't think this is enough information to suggest anything.
--
Paul ~{:-)
pj@illuminatedcomputing.com
On Mar 27, 2018, at 4:25 PM, Paul Jungwirth <pj@illuminatedcomputing.com> wrote:
On 03/27/2018 11:42 AM, hmidi slim wrote:
This is the message that I got:
Successfully run. Total query runtime: 2 min.
15000000 rows affected.Sorry, I don't think this is enough information to suggest anything.
--
Paul ~{:-)
pj@illuminatedcomputing.com
I took the case prsented and ran in a sandbox and
UPDATE 15000000
explain analyze SELECT product_id, start_date, end_date, during
FROM product_availabilities
WHERE during @> daterange('2018-02-01', '2018-04-01')
Bitmap Heap Scan on product_availabilities (cost=1156.67..46856.67 rows=75000 width=44) (actual time=5413.792..11367.379 rows=15000000 loops
=1)
Recheck Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Heap Blocks: exact=110295
-> Bitmap Index Scan on idx_time (cost=0.00..1137.92 rows=75000 width=0) (actual time=5325.844..5325.844 rows=15000000 loops=1)
Index Cond: (during @> '[2018-02-01,2018-04-01)'::daterange)
Planning time: 0.145 ms
Execution time: 14055.666 ms
(7 rows)
But
considering the update (I did not check bloating or anything but still)
vacuum full product_availabilities;
analyze product_availabilities;
The plan changes to
Seq Scan on product_availabilities (cost=0.00..242647.91 rows=15000033 width=26) (actual time=0.034..7207.697 rows=15000000 loops=1)
Filter: (during @> '[2018-02-01,2018-04-01)'::daterange)
Planning time: 6.701 ms
Execution time: 9238.285 ms
And the runtime does get in the two minutes
time psql -U csidba -d armandp <f.sql > /dev/null
real 2m39.767s
user 1m45.576s
sys 0m12.324s
Not sure if that confirms the OP’s findings but to me a first question would be if the fact that the execution time reported by epxlain analyze does not seem to be even close to the actual run time is expected or not
BTW I was the postgres version reported ? I ran the presented case on 9.5.8
— Armand