partitioning question

Started by Alex Samadover 8 years ago7 messagesgeneral
Jump to latest
#1Alex Samad
alex@samad.com.au

Hi

I was about to partition a large (?) approx 3T of data 2B rows into
partition tables but broken up into YYYYMM ...

Now I have been reading about limiting the number of partitions otherwise
it could slow down the parser.

My reasoning for limiting to YYYYMM was that most of the request would be
monthly based.

Should I be making the partitioning based on YYYY instead and have lots
more indexs.

If I have an index on the timestamp field will it help limiting to YYYYMM ?

A

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Alex Samad (#1)
Re: partitioning question

On Mon, Jul 31, 2017 at 09:15:29AM +1000, Alex Samad wrote:

Hi

I was about to partition a large (?) approx 3T of data 2B rows into
partition tables but broken up into YYYYMM ...

Now I have been reading about limiting the number of partitions otherwise
it could slow down the parser.

My reasoning for limiting to YYYYMM was that most of the request would be
monthly based.

Should I be making the partitioning based on YYYY instead and have lots
more indexs.

If I have an index on the timestamp field will it help limiting to YYYYMM ?

The major advantages of partitions are enumerated here:
https://www.postgresql.org/docs/10/static/ddl-partitioning.html#ddl-partitioning-overview

For your case, it might be that seq scans of an entire "monthly" partition turn
out to be very advantageous, compared with index scan (or seq scan of entire
3TB data).

Also DROPing the oldest partition every month is commonly very much more
efficient than DELETEing it..

There are warnings like these:

|All constraints on all partitions of the master table are examined during
|constraint exclusion, so large numbers of partitions are likely to increase
|query planning time considerably. Partitioning using these techniques will work
|well with up to perhaps a hundred partitions; don't try to use many thousands
|of partitions.

Unless you have 100s of years of data I don't think it would be a problem.

For us, having hundreds of partitions hasn't been an issue (planning time is
insignificant for our analytic report queries). But there's an overhead to
partitions and at some point the cost becomes significant. (Actually, I think
one cost which *did* hit us, while experimenting with *daily* partition
granularity of every table, was probably due to very large pg_statistics and
pg_attributes tables, which no longer fit in buffer cache).

Justin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Alex Samad
alex@samad.com.au
In reply to: Justin Pryzby (#2)
Re: partitioning question

Hi

I note that you link to P10 and I am currently looking at 9.6. The changes
do look nice for partitioning for p10.

Interesting your suggest that the YYYYMM parition isn't that bad.

I will add currently we don't delete anything, we will keep adding to it.

Also I am thinking my insert trigger becomes a lot smaller and easier if I
leave it at yearly.

Also thinking if P10 was the current recommended version right now I would
probably look at YYYYMM because it looks like it makes partitioning easier

Alex

On 31 July 2017 at 09:54, Justin Pryzby <pryzby@telsasoft.com> wrote:

Show quoted text

On Mon, Jul 31, 2017 at 09:15:29AM +1000, Alex Samad wrote:

Hi

I was about to partition a large (?) approx 3T of data 2B rows into
partition tables but broken up into YYYYMM ...

Now I have been reading about limiting the number of partitions otherwise
it could slow down the parser.

My reasoning for limiting to YYYYMM was that most of the request would be
monthly based.

Should I be making the partitioning based on YYYY instead and have lots
more indexs.

If I have an index on the timestamp field will it help limiting to

YYYYMM ?

The major advantages of partitions are enumerated here:
https://www.postgresql.org/docs/10/static/ddl-partitioning.html#ddl-
partitioning-overview

For your case, it might be that seq scans of an entire "monthly" partition
turn
out to be very advantageous, compared with index scan (or seq scan of
entire
3TB data).

Also DROPing the oldest partition every month is commonly very much more
efficient than DELETEing it..

There are warnings like these:

|All constraints on all partitions of the master table are examined during
|constraint exclusion, so large numbers of partitions are likely to
increase
|query planning time considerably. Partitioning using these techniques
will work
|well with up to perhaps a hundred partitions; don't try to use many
thousands
|of partitions.

Unless you have 100s of years of data I don't think it would be a problem.

For us, having hundreds of partitions hasn't been an issue (planning time
is
insignificant for our analytic report queries). But there's an overhead to
partitions and at some point the cost becomes significant. (Actually, I
think
one cost which *did* hit us, while experimenting with *daily* partition
granularity of every table, was probably due to very large pg_statistics
and
pg_attributes tables, which no longer fit in buffer cache).

Justin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Justin Pryzby
pryzby@telsasoft.com
In reply to: Alex Samad (#3)
Re: partitioning question

On Mon, Jul 31, 2017 at 10:25:54AM +1000, Alex Samad wrote:

I note that you link to P10 and I am currently looking at 9.6. The changes
do look nice for partitioning for p10.

Yes sorry, pg10 is beta - avoid using it except for testing purposes.

I will add currently we don't delete anything, we will keep adding to it.

Also I am thinking my insert trigger becomes a lot smaller and easier if I
leave it at yearly.

Note: the trigger function can either be a static function updated monthly (to
handle the next month), preferably with the most recent months tested first (so
a typical newly-inserted rows only goes through one if/case test).

Alternately, the trigger function can dynamically compute the table into which
to insert using plpgsql "format()" similar to here:
https://www.postgresql.org/docs/9.1/static/plpgsql-statements.html#PLPGSQL-QUOTE-LITERAL-EXAMPLE

Justin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Alex Samad
alex@samad.com.au
In reply to: Justin Pryzby (#4)
Re: partitioning question

How expensive is dynamic over static. I'm looking at storing yearly now,
so I figure if my if then clause has the latest year at the top it should
be very quick.

On 31 July 2017 at 11:07, Justin Pryzby <pryzby@telsasoft.com> wrote:

Show quoted text

On Mon, Jul 31, 2017 at 10:25:54AM +1000, Alex Samad wrote:

I note that you link to P10 and I am currently looking at 9.6. The

changes

do look nice for partitioning for p10.

Yes sorry, pg10 is beta - avoid using it except for testing purposes.

I will add currently we don't delete anything, we will keep adding to it.

Also I am thinking my insert trigger becomes a lot smaller and easier if

I

leave it at yearly.

Note: the trigger function can either be a static function updated monthly
(to
handle the next month), preferably with the most recent months tested
first (so
a typical newly-inserted rows only goes through one if/case test).

Alternately, the trigger function can dynamically compute the table into
which
to insert using plpgsql "format()" similar to here:
https://www.postgresql.org/docs/9.1/static/plpgsql-
statements.html#PLPGSQL-QUOTE-LITERAL-EXAMPLE

Justin

#6Scott Marlowe
scott.marlowe@gmail.com
In reply to: Alex Samad (#5)
Re: partitioning question

On Sun, Jul 30, 2017 at 7:13 PM, Alex Samad <alex@samad.com.au> wrote:

How expensive is dynamic over static. I'm looking at storing yearly now, so
I figure if my if then clause has the latest year at the top it should be
very quick.

Assuming you're not doing anything particularly crazy it's minimal.
But what is good performance for one application may not be acceptable
for others. Generally the cost of inserting is MUCH higher than the
cost of dynamically setting the target, esp if you stick to plpgsql
and don't try to use rules to accomplish it.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Alex Samad
alex@samad.com.au
In reply to: Scott Marlowe (#6)
Re: partitioning question

The comparison would be

a if then else end if .. about 8 of them 2013-> and a static insert into

v's

making a dynamic string and using execute, my presumption would be the
execute would be expensive verses a INSERT command

A

On 1 August 2017 at 07:04, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Show quoted text

On Sun, Jul 30, 2017 at 7:13 PM, Alex Samad <alex@samad.com.au> wrote:

How expensive is dynamic over static. I'm looking at storing yearly

now, so

I figure if my if then clause has the latest year at the top it should be
very quick.

Assuming you're not doing anything particularly crazy it's minimal.
But what is good performance for one application may not be acceptable
for others. Generally the cost of inserting is MUCH higher than the
cost of dynamically setting the target, esp if you stick to plpgsql
and don't try to use rules to accomplish it.