Partioning with overlapping and non overlapping constraints

Started by Tim Uckunabout 11 years ago9 messagesgeneral
Jump to latest
#1Tim Uckun
timuckun@gmail.com

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records per
day, the data is transactional for a while but then settles down and is
used for analytical purposes later.

When a metric is reported both the UTC time and the local times are stored
along with the other data belonging to the metric. I want to partition
this table to both make it faster to query and also to spread out the
writes. Ideally the partitions would be based on the UTC timestamp and the
sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of thousands
of locations so this is going to result hundreds of thousands of tables.
I know there are no upper limits to how many tables there are but I am
thinking this might really get me into trouble later.

Second and possibly more vexing problem is that often the local time is
queried. Ideally I would like to put three constraints on the child
tables. Location id, UTC timestamp and the local time but obviously the
local timestamps would overlap with other locations in the same timezone
Even if I was to only partition by UTC the local timestamps would overlap
between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of
shards.
2. Is PG smart enough to handle overlapping constraints on table and limit
it's querying to only those tables that have the correct time constraint.

Thanks.

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Tim Uckun (#1)
Re: Partioning with overlapping and non overlapping constraints

Tim Uckun wrote

1. Should I be worried about having possibly hundreds of thousands of
shards.

IIRC, yes.

2. Is PG smart enough to handle overlapping constraints on table and limit
it's querying to only those tables that have the correct time constraint.

Probably yes, but seems easy enough to verify.

All constraints are checked for each partiton and if any return false the
entire partiton will be excluded; which means multiple partitions can be
included.

Note, this is large reason why #1 poses a problem.

David J.

--
View this message in context: http://postgresql.nabble.com/Partioning-with-overlapping-and-non-overlapping-constraints-tp5836869p5836871.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Melvin Davidson
melvin6925@gmail.com
In reply to: David G. Johnston (#2)
Re: Partioning with overlapping and non overlapping constraints

Perhaps, I do not fully understand completely, but would it not be simpler
to just rearrange the key (and partition) by date & location?
EG: 2015_01_01_metrics_location_XXXXX

In that way, you would only have 365 partitions per year at most. But you
also have the option to break it down by week or month, or year.

EG:

EXTRACT(YEAR FROM utc_time) = 2015 AND
EXTRACT(WEEK FROM utc_time) = 1

or
EXTRACT(YEAR FROM utc_time) = 2015 AND
EXTRACT(MONTH FROM utc_time) = 1

or just
EXTRACT(YEAR FROM utc_time) = 2015

On Thu, Feb 5, 2015 at 10:12 PM, David G Johnston <
david.g.johnston@gmail.com> wrote:

Tim Uckun wrote

1. Should I be worried about having possibly hundreds of thousands of
shards.

IIRC, yes.

2. Is PG smart enough to handle overlapping constraints on table and

limit

it's querying to only those tables that have the correct time constraint.

Probably yes, but seems easy enough to verify.

All constraints are checked for each partiton and if any return false the
entire partiton will be excluded; which means multiple partitions can be
included.

Note, this is large reason why #1 poses a problem.

David J.

--
View this message in context:
http://postgresql.nabble.com/Partioning-with-overlapping-and-non-overlapping-constraints-tp5836869p5836871.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

#4Tim Uckun
timuckun@gmail.com
In reply to: Melvin Davidson (#3)
Re: Partioning with overlapping and non overlapping constraints

Partitioning by day would result in less partitions but of course it would
create a "hot" table where all the writes go.

Actually I have thought of an alternative and I'd be interested in your
opinion of it.

I leave the metrics table alone, The current code continues to read and
write from the metrics. Every night I create a table based on
metricts_YYYYMMDD which inherit from metrics and move data (using the
"ONLY" clause in the delete) into the table and then set a constraint for
that table for that day. I also adjust the constraint for the metrics table
which is basically saying "where timestamp > YYYMMDD".

This way there is no trigger in the parent table to slow down the inserts
and I still have partitions that will speed up read queries. I realize
that moving large amounts of data is going to be painful but perhaps I can
do it in chunks.

On Sat, Feb 7, 2015 at 3:09 AM, Melvin Davidson <melvin6925@gmail.com>
wrote:

Show quoted text

Perhaps, I do not fully understand completely, but would it not be simpler
to just rearrange the key (and partition) by date & location?
EG: 2015_01_01_metrics_location_XXXXX

In that way, you would only have 365 partitions per year at most. But you
also have the option to break it down by week or month, or year.

EG:

EXTRACT(YEAR FROM utc_time) = 2015 AND
EXTRACT(WEEK FROM utc_time) = 1

or
EXTRACT(YEAR FROM utc_time) = 2015 AND
EXTRACT(MONTH FROM utc_time) = 1

or just
EXTRACT(YEAR FROM utc_time) = 2015

On Thu, Feb 5, 2015 at 10:12 PM, David G Johnston <
david.g.johnston@gmail.com> wrote:

Tim Uckun wrote

1. Should I be worried about having possibly hundreds of thousands of
shards.

IIRC, yes.

2. Is PG smart enough to handle overlapping constraints on table and

limit

it's querying to only those tables that have the correct time

constraint.

Probably yes, but seems easy enough to verify.

All constraints are checked for each partiton and if any return false the
entire partiton will be excluded; which means multiple partitions can be
included.

Note, this is large reason why #1 poses a problem.

David J.

--
View this message in context:
http://postgresql.nabble.com/Partioning-with-overlapping-and-non-overlapping-constraints-tp5836869p5836871.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

#5Marc Mamin
M.Mamin@intershop.de
In reply to: Tim Uckun (#1)
Re: Partioning with overlapping and non overlapping constraints

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records per day, the data is transactional for a while but then settles down and is used for analytical purposes later.

When a metric is reported both the UTC time and the local times are stored along with the other data belonging to the metric.

Don't you have duplicate information within your UTC, location and local_time data ?
Maybe you can just attach a timezone to each location...

I want to partition this table to both make it faster to query and also to spread out the writes. Ideally the partitions would be based on the UTC timestamp and the sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of thousands of locations so this is going to result hundreds of thousands of tables. I know there are no upper limits to how many tables there are but I am thinking this might really get me into trouble later.

With only a few millions rows per day, weekly or even monthly partitions without regard of locations should be sufficient for older data.
It should be possible to partition your hot data differently; But Instead of using one partition per location, you may use a hash/modulo approach to keep the number of partitions in a reasonable count if required at all (This can be helpful: https://github.com/markokr/pghashlib). Here I would avoid to include time information except for the limit between old and hot tables. And depending on the pattern and performance requirement of your analytic queries this may be sufficient (i.e. don't partition on the time at all).
With smaller partitions for hot data, it should be quite fast to move them one by one to the old data. I have no experience with the trigger based partitioning of Postgres (we handle partitions logically at the application level), so I'm not sure how difficult this approach is. I suppose that you'll need a function that move data from hot to old partitons and that fix the triggers accordingly.

Second and possibly more vexing problem is that often the local time is queried. Ideally I would like to put three constraints on the child tables. Location id, UTC timestamp and the local time but obviously the local timestamps would overlap with other locations in the same timezone Even if I was to only partition by UTC the local timestamps would overlap between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of shards.
2. Is PG smart enough to handle overlapping constraints on table and limit it's querying to only those tables that have the correct time constraint.

If you partition on the UTC time only, you don't have overlapping. When querying on the local time, the planner will consider all partitions, but an additional index or constraint on this column should be sufficient as long as your partition count remains small.

regards,
Marc Mamin

#6Melvin Davidson
melvin6925@gmail.com
In reply to: Marc Mamin (#5)
Re: Partioning with overlapping and non overlapping constraints

Well, without knowing too much about your application, it certainly sounds
like using the metricts_YYYYMMDD is the way to go. As for modifying the
constraint daily, couldn't you just use

where timestamp > current_date - Interval '1 Day'

?

On Mon, Feb 9, 2015 at 5:14 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records per

day, the data is transactional for a while but then settles down and is
used for analytical purposes later.

When a metric is reported both the UTC time and the local times are

stored along with the other data belonging to the metric.

Don't you have duplicate information within your UTC, location and
local_time data ?
Maybe you can just attach a timezone to each location...

I want to partition this table to both make it faster to query and also

to spread out the writes. Ideally the partitions would be based on the UTC
timestamp and the sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of thousands

of locations so this is going to result hundreds of thousands of tables.
I know there are no upper limits to how many tables there are but I am
thinking this might really get me into trouble later.

With only a few millions rows per day, weekly or even monthly partitions
without regard of locations should be sufficient for older data.
It should be possible to partition your hot data differently; But Instead
of using one partition per location, you may use a hash/modulo approach to
keep the number of partitions in a reasonable count if required at all
(This can be helpful: https://github.com/markokr/pghashlib). Here I would
avoid to include time information except for the limit between old and hot
tables. And depending on the pattern and performance requirement of your
analytic queries this may be sufficient (i.e. don't partition on the time
at all).
With smaller partitions for hot data, it should be quite fast to move them
one by one to the old data. I have no experience with the trigger based
partitioning of Postgres (we handle partitions logically at the application
level), so I'm not sure how difficult this approach is. I suppose that
you'll need a function that move data from hot to old partitons and that
fix the triggers accordingly.

Second and possibly more vexing problem is that often the local time is

queried. Ideally I would like to put three constraints on the child
tables. Location id, UTC timestamp and the local time but obviously the
local timestamps would overlap with other locations in the same timezone
Even if I was to only partition by UTC the local timestamps would overlap
between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of

shards.

2. Is PG smart enough to handle overlapping constraints on table and

limit it's querying to only those tables that have the correct time
constraint.

If you partition on the UTC time only, you don't have overlapping. When
querying on the local time, the planner will consider all partitions, but
an additional index or constraint on this column should be sufficient as
long as your partition count remains small.

regards,
Marc Mamin

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.

#7Brent Wood
Brent.Wood@niwa.co.nz
In reply to: Melvin Davidson (#6)
Re: Partioning with overlapping and non overlapping constraints

Hi Mark,

We have a somewhat similar situation - we have years of continuous data which are managed in Postgis. The tables are partitioned into annual subsets. The previous (static) years' underlying tables have a clustered index on UTC timestamp, the current year table has a conventional index. It works well, with 20 odd partitions to date.

An annual basis for partitions may not be ideal in your case, but you have not specified how long it takes for your data to become fixed - or if there is a way the database can tell that records are now static. If there is, a scheduled task which migrates such records from a live to fixed partition would perhaps be appropriate.

Organising your data by UTC timestamp may be the simplest approach for you.

Cheers

Brent Wood

Programme leader: Environmental Information Delivery
NIWA
DDI: +64 (4) 3860529
________________________________________
From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of Melvin Davidson [melvin6925@gmail.com]
Sent: Tuesday, February 10, 2015 6:01 AM
To: Marc Mamin
Cc: Tim Uckun; pgsql-general
Subject: Re: [GENERAL] Partioning with overlapping and non overlapping constraints

Well, without knowing too much about your application, it certainly sounds like using the metricts_YYYYMMDD is the way to go. As for modifying the constraint daily, couldn't you just use

where timestamp > current_date - Interval '1 Day'

?

On Mon, Feb 9, 2015 at 5:14 AM, Marc Mamin <M.Mamin@intershop.de<mailto:M.Mamin@intershop.de>> wrote:

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records per day, the data is transactional for a while but then settles down and is used for analytical purposes later.

When a metric is reported both the UTC time and the local times are stored along with the other data belonging to the metric.

Don't you have duplicate information within your UTC, location and local_time data ?
Maybe you can just attach a timezone to each location...

I want to partition this table to both make it faster to query and also to spread out the writes. Ideally the partitions would be based on the UTC timestamp and the sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of thousands of locations so this is going to result hundreds of thousands of tables. I know there are no upper limits to how many tables there are but I am thinking this might really get me into trouble later.

With only a few millions rows per day, weekly or even monthly partitions without regard of locations should be sufficient for older data.
It should be possible to partition your hot data differently; But Instead of using one partition per location, you may use a hash/modulo approach to keep the number of partitions in a reasonable count if required at all (This can be helpful: https://github.com/markokr/pghashlib). Here I would avoid to include time information except for the limit between old and hot tables. And depending on the pattern and performance requirement of your analytic queries this may be sufficient (i.e. don't partition on the time at all).
With smaller partitions for hot data, it should be quite fast to move them one by one to the old data. I have no experience with the trigger based partitioning of Postgres (we handle partitions logically at the application level), so I'm not sure how difficult this approach is. I suppose that you'll need a function that move data from hot to old partitons and that fix the triggers accordingly.

Second and possibly more vexing problem is that often the local time is queried. Ideally I would like to put three constraints on the child tables. Location id, UTC timestamp and the local time but obviously the local timestamps would overlap with other locations in the same timezone Even if I was to only partition by UTC the local timestamps would overlap between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of shards.
2. Is PG smart enough to handle overlapping constraints on table and limit it's querying to only those tables that have the correct time constraint.

If you partition on the UTC time only, you don't have overlapping. When querying on the local time, the planner will consider all partitions, but an additional index or constraint on this column should be sufficient as long as your partition count remains small.

regards,
Marc Mamin

--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you. [http://us.i1.yimg.com/us.yimg.com/i/mesg/tsmileys2/01.gif]

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Tim Uckun
timuckun@gmail.com
In reply to: Marc Mamin (#5)
Re: Partioning with overlapping and non overlapping constraints

Don't you have duplicate information within your UTC, location and

local_time data ? Maybe you can just attach a timezone to each location...

Yes there is duplicate information but dealing with time zones are a PITA
and the easiest way to solve the myriad of problems I have is to store the
local time as it was at that instant.

With smaller partitions for hot data, it should be quite fast to move them

one by one to the old data. I have no experience with the trigger based
partitioning of Postgres (we handle partitions logically at the application
level), so I'm not sure how difficult this approach is. I suppose that
you'll need a function that move data from hot to old partitons and that
fix the triggers accordingly.

This I think would be a good approach but it does involve a heavy rewrite
of the app.

On Mon, Feb 9, 2015 at 11:14 PM, Marc Mamin <M.Mamin@intershop.de> wrote:

Show quoted text

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records per

day, the data is transactional for a while but then settles down and is
used for analytical purposes later.

When a metric is reported both the UTC time and the local times are

stored along with the other data belonging to the metric.

Don't you have duplicate information within your UTC, location and
local_time data ?
Maybe you can just attach a timezone to each location...

I want to partition this table to both make it faster to query and also

to spread out the writes. Ideally the partitions would be based on the UTC
timestamp and the sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of thousands

of locations so this is going to result hundreds of thousands of tables.
I know there are no upper limits to how many tables there are but I am
thinking this might really get me into trouble later.

With only a few millions rows per day, weekly or even monthly partitions
without regard of locations should be sufficient for older data.
It should be possible to partition your hot data differently; But Instead
of using one partition per location, you may use a hash/modulo approach to
keep the number of partitions in a reasonable count if required at all
(This can be helpful: https://github.com/markokr/pghashlib). Here I would
avoid to include time information except for the limit between old and hot
tables. And depending on the pattern and performance requirement of your
analytic queries this may be sufficient (i.e. don't partition on the time
at all).
With smaller partitions for hot data, it should be quite fast to move them
one by one to the old data. I have no experience with the trigger based
partitioning of Postgres (we handle partitions logically at the application
level), so I'm not sure how difficult this approach is. I suppose that
you'll need a function that move data from hot to old partitons and that
fix the triggers accordingly.

Second and possibly more vexing problem is that often the local time is

queried. Ideally I would like to put three constraints on the child
tables. Location id, UTC timestamp and the local time but obviously the
local timestamps would overlap with other locations in the same timezone
Even if I was to only partition by UTC the local timestamps would overlap
between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of

shards.

2. Is PG smart enough to handle overlapping constraints on table and

limit it's querying to only those tables that have the correct time
constraint.

If you partition on the UTC time only, you don't have overlapping. When
querying on the local time, the planner will consider all partitions, but
an additional index or constraint on this column should be sufficient as
long as your partition count remains small.

regards,
Marc Mamin

#9Tim Uckun
timuckun@gmail.com
In reply to: Melvin Davidson (#6)
Re: Partioning with overlapping and non overlapping constraints

This approach wouldn't work I think. The data keeps growing in the "hot"
table.

On Tue, Feb 10, 2015 at 6:01 AM, Melvin Davidson <melvin6925@gmail.com>
wrote:

Show quoted text

Well, without knowing too much about your application, it certainly sounds
like using the metricts_YYYYMMDD is the way to go. As for modifying the
constraint daily, couldn't you just use

where timestamp > current_date - Interval '1 Day'

?

On Mon, Feb 9, 2015 at 5:14 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

I have two partitioning questions I am hoping somebody can help me with.

I have a fairly busy metric(ish) table. It gets a few million records

per day, the data is transactional for a while but then settles down and is
used for analytical purposes later.

When a metric is reported both the UTC time and the local times are

stored along with the other data belonging to the metric.

Don't you have duplicate information within your UTC, location and
local_time data ?
Maybe you can just attach a timezone to each location...

I want to partition this table to both make it faster to query and also

to spread out the writes. Ideally the partitions would be based on the UTC
timestamp and the sending location. For example

metrics_location_XXXXX_2015_01_01

First problem with this approach is that there could be tens of

thousands of locations so this is going to result hundreds of thousands of
tables. I know there are no upper limits to how many tables there are but
I am thinking this might really get me into trouble later.

With only a few millions rows per day, weekly or even monthly partitions
without regard of locations should be sufficient for older data.
It should be possible to partition your hot data differently; But Instead
of using one partition per location, you may use a hash/modulo approach to
keep the number of partitions in a reasonable count if required at all
(This can be helpful: https://github.com/markokr/pghashlib). Here I
would avoid to include time information except for the limit between old
and hot tables. And depending on the pattern and performance requirement of
your analytic queries this may be sufficient (i.e. don't partition on the
time at all).
With smaller partitions for hot data, it should be quite fast to move
them one by one to the old data. I have no experience with the trigger
based partitioning of Postgres (we handle partitions logically at the
application level), so I'm not sure how difficult this approach is. I
suppose that you'll need a function that move data from hot to old
partitons and that fix the triggers accordingly.

Second and possibly more vexing problem is that often the local time is

queried. Ideally I would like to put three constraints on the child
tables. Location id, UTC timestamp and the local time but obviously the
local timestamps would overlap with other locations in the same timezone
Even if I was to only partition by UTC the local timestamps would overlap
between tables.

So the questions in a nutshell are.

1. Should I be worried about having possibly hundreds of thousands of

shards.

2. Is PG smart enough to handle overlapping constraints on table and

limit it's querying to only those tables that have the correct time
constraint.

If you partition on the UTC time only, you don't have overlapping. When
querying on the local time, the planner will consider all partitions, but
an additional index or constraint on this column should be sufficient as
long as your partition count remains small.

regards,
Marc Mamin

--
*Melvin Davidson*
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.