Dynamic Partitioning using Segment Visibility Maps
Happy New Year, everybody.
This proposal follows on from previous thinking about partitioning,
where I've taken up Andrew Sullivan's suggestion to re-examine the
current partitioning concept of using tables as partitions. So I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.
Recap: Very Large Table Use Case
--------------------------------
Many large tables have a regular access pattern:
- new rows inserted frequently
- some updates and deletes on the "recent" data
- older data becomes effectively read-only
- at some point data may be removed in bulk from the table
Most tables vary on what "recent" means; some are insert only, many not.
A typical example might be a large history table where the last 6 months
data is updated, but after that the data has almost zero change.
Partitioning relies on knowledge of the physical location of data to
exclude sections of data from a scan.
Currently the partitioning knowledge is provided by user declarations in
the form of constraints on tables linked together by views or
inheritance. We also currently do this in "constraint-first" fashion,
i.e. we put the data where the constraints say we should, rather than
deriving the constraints from the data.
In-table Partitioning
---------------------
In the common use-case there is a clear affinity between certain data
columns and the physical placement of data in a table. This is true for
columns such as LOG_DATE or TRANSACTION_DATE, but is also true for
any columns where the id column is assigned by a Sequence. HOT helps
this to remain true over time.
Given the above use case, when there is an affinity between data values
and data placement *and* we know that older data tends to become read
only, we can then realise that certain physical sections of a table will
become effectively read only. (My experience is that the larger the
table, the less random the write pattern - whatever the guys that wrote
TPC-C say).
If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.
We can imagine that we do this within a single table. Imagine: split a
table up into 100 sections, then record the min/max values of all tuples
in those sections. When we perform a SeqScan we could skip over sections
that could be proved would never satisfy the query predicate. Same thing
as partitioning, but within the table.
Currently tables are already broken up into 1GB file segments, so it
seems fairly natural to pick that as our section size. That fits
reasonably well with recommendations for ideal partition size.
Segment Exclusion
-----------------
After we note that a segment is read-only we can scan the segment and
record min/max values for all columns. These are then "implicit
constraints", which can then be used for segment exclusion in a similar
manner as we do with the current constraint exclusion mechanism.
The implicit constraints can then allow SeqScans to assess segment
exclusion for each segment at execution time, i.e. a scan can skip a
whole segment if constraints don't match. If a segment does not (yet)
have implicit constraints it will always be scanned in full. This allows
partitioning, synch scans and buffer recycling to work much better
together than they currently do.
Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.
This would also allow a Merge Join to utilise exclusion for the inner
plan at execution time. Instead of scanning the whole inner table plan
from the start, we would be able to start the scan from the appropriate
segment. This would require us to pass the current value of the outer
plan down into the inner plan. The typical executor nodes on the inner
plan would be a SortNode and below that a SeqScan. In that case the
executor would need to pass the outer value from the MergeJoinNode down
thru the SortNode to the SeqScan node. The SeqScan node could then
perform partition exclusion, reducing the time for that scan and also
reducing the time for the resulting sort. This sounds like it might be
worth doing in normal cases also. It might turn out that the potentially
applicable cases are already excluded during planning, I haven't thought
about that aspect in enough detail yet.
If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.
Note that all of this exclusion is happening in the executor, not the
planner. That allows this form of exclusion to work with stable
functions and parameters without problem.
Noting which segments are read-only
-----------------------------------
Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways
1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to "mark" a segment as read-only now
Probably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)
Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.
Visibility Map
--------------
So *how* would we mark segments as read only? It turns out that
attempting to do this is extremely similar to Heikki's Visibility Map
idea, so I'll describe it in those terms, even though there are some
differences. It also brings to mind Andrew Sullivan's read-only tuples
concept.
We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.
Most everything about Heikki's Visibility Map idea holds, just that we
have a bit for each segment in the table, rather than each block. The
map is very small and hardly ever changes for a table, so we can cache
it easily on each backend. If the map does change we can perform a
relcache invalidation to make everybody re-read the map.
No dynamic shared memory cache is required because any concurrent
changes to the table would be ignored by a Scan anyway, so it doesn't
matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
new scans that start will attempt to lock the table and then perform a
rel cache check before continuing. So the visibility will be set
correctly for *that* scan at least.
In most cases the visibility map can be summarised as a single boolean
to show whether any 100% visible segments exist. That makes accessing
the map very cheap in the common, unset case.
Setting the Visibility Map
--------------------------
VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
all current backends. Marking the map will cause a rel cache
invalidation.
We would never mark the highest numbered segment
EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
This is also a useful way of excluding smaller tables from the overheads
involved for this feature.
In an insert-only table this would mean that only the last 1 or 2
segments would be read write, so VACUUM would always run in a bounded
time, no matter how large the table.
When would we attempt to set the visibility map? If we do this
aggressively, then it will be quickly unset. If we wait too long then we
might lose some of the benefits of this approach.
First, we recognise that read only segments will change the autovacuum
calculation - we simply exclude them entirely. This will mean that
VACUUMs are run more regularly than they are now, but when they do run
they will be quicker. (Note that currently, VACUUMs on a growing table
get further and further apart as the table grows, assuming constant
write rate).
My feeling is that VACUUMs are sufficiently infrequent that we should
set the visibility map aggressively after each VACUUM. If the map is
quickly unset, then we wait some time before trying again. Unsetting
seems to be relatively low contention (see below), so that seems
acceptable.
We would then re-scan newly marked segments to derive min/max values for
implicit constraints, performed by autovacuum worker processes. The best
way seems to be that an ANALYZE command would automatically detect that
a segment has been recently set EFFECTIVE_READ_ONLY and perform an
all-rows scan rather than just a sample. (We would keep min/max values
only for datatypes with valid btree sort operators, just as ANALYZE
already does). The data for this can be stored as part of the stats
histogram for each column. pg_class.reltuples would become an array of
row counts in each segment, known accurate for read only segments.
A later VACUUM will still be required to freeze the rows, though that
can happen later when we hit the frozen limit, or earlier if we run an
explicit VACUUM FREEZE. We would need to store the oldest xid for each
segment, so we know when to kick off a FREEZE on that part of the table.
We would do this by making pg_class.relfrozenxid an array also.
This departs completely from the idea that the visibility map and the
freespace map are related somehow. However, there is a change required
in the FSM to ensure the proposed technique is stable and useful: If we
scan a segment and it is 100% visible, if there is freespace in just one
of the blocks in the segment then we will soon find that our
visibility-bit will be set to off again very quickly.
If there is more than 5% freespace total in the segment then we will not
set EFFECTIVE_READ_ONLY, nor report blocks in a segment to the FSM. That
way, small amounts of freespace won't destroy the benefits of segment
exclusion. VACUUM currently overwrites FSM information, though if this
was not the case then we would have to actively remove it for the newly
read only segments. Perhaps that limit would be (100 - fillfactor)%,
which is normally set according to a user's expectation of the number of
updates likely on a table.
Unsetting the visibility map
----------------------------
For EFFECTIVE_READ_ONLY, INSERTs, UPDATEs and DELETEs are still
possible. For MARKED_READ_ONLY it would be explicitly refused.
The relcache invalidation caused by making a segment EFFECTIVE_READ_ONLY
can flush the rd_targBlock for the relation in each backend. Other
INSERTs are not possible since all current code-paths either use
extension or the FSM when the table is non-empty, and neither of those
will cause a problem. Extension only touches last segment, which will
never by EFFECTIVE_READ_ONLY. The FSM contains no entries for an
EFFECTIVE_READ_ONLY segment.
UPDATEs or DELETEs are possible on EFFECTIVE_READ_ONLY segments, but we
don't expect it often. If this occurs, we will simply reset the
visibility map for the table. This can be handled with a
non-transactional overwrite - the bit always exists already, so the
overwritten data is always same size. That's pessimistic, since it
resets the state even if the UPDATE/DELETE aborts, but its better than
holding the row locked until the transaction completes, which might
prevent other things from happening. We probably also want VACUUM to
clear up any aborted transactions anyway.
The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM.
Running a huge DELETE will likely perform a SeqScan that avoids all but
the deleting data. The following VACUUM will also ignore the other
segments, so removing older data seems fairly well optimised already, in
comparison with now. It would be possible to optimise this to perform a
segment-level truncate, but since the partitioning will already improve
the DELETE and VACUUM, I'm not proposing that yet, if ever.
SHARE locks currently write to data blocks, but since they represent
transient state we do not need to unset either the partitioning or the
visibility map.
Overlap with Visibility Map Proposal
------------------------------------
This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.
My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though. I've re-read the earlier thread and agree with Jeff Davis that
we *could* have multiple kinds of visibility map, but also agree with
Heikki that it seems like too much work. It was never my intent to
address both challenges in one proposal and this is just pure
coincidence, or possibly just luck, depending upon how you like the
proposal.
If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).
Index Scans
-----------
Index-only scans would look at the Visibility Map, just as they would
have done in the earlier proposal.
It would be costly for an index scan to repeatedly check the visibility
map, but probably worth it to do so.
Other Changes
-------------
We can handle select count(*) by scanning the non-100% visible segments
of a table, then adding the stored counts for each segment to get a
final total. Not sure if its really worth doing, but it does sound like
an added bonus.
There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...
I think it would be prudent to have an override option on VACUUM to
allow us to scan a table in full, checking rather than using the
visibility map. This mode would be called VACUUM CHECK.
None of the above blocks earlier proposals for read only tables, and the
two features could work together very straightforwardly.
Comparison with other Partitioning approaches
---------------------------------------------
Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.
Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.
This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recycling
All of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...
And yes, this is a different conclusion to earlier discussions,
particularly those in 2005 where I was still thinking in terms of
declarative partitioning.
Conclusions
-----------
This technique would be useful for any table with historical data keyed
by date or timestamp. It would also be useful for data where a
time-of-insert component is implicit, such as many major entity tables
where the object ids are assigned by a sequence. e.g. an Orders table
with an OrderId as PK. Once all orders placed in a period have been
shipped/resolved/closed then the segments will be marked read-only.
Its not really going to change the code path much for small tables, yet
seems likely to work reasonably well for large tables of all shapes and
sizes. If a segment is being updated, we leave it alone, and maybe never
actually set the visibility map at all. So overall, this idea seems to
cover the main use case well, yet with only minimal impact on the
existing use cases we support.
As before, I will maintain this proposal on the PG developer Wiki, once
we get down to detailed design.
Like it?
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
Like it?
Sounds good. I've only given it a quick scan though. Would read-only
segments retain the same disk-level format as is currently? It seems
possible to remove the MVCC fields and hence get more tuples per page---
whether this would actually be a net performance gain/loss seems like
a difficult question question to answer, it would definitly be a
complexity increase though.
Reading this reminds me of the design of the store for a persistent
operating system called EROS. It has a very good paper[1]http://www.eros-os.org/papers/storedesign2002.pdf describing
the design (implementation and careful benchmarking thereof) that I
think could be a useful read.
A lot of your design sounds like the EROS store, with the the
"Checkpoint Area" being, in current and all previous versions of
Postgres, the only place data is stored. Data in EROS also has a "Home
Location" which is where the data ends up after a checkpoint, and sounds
somewhat like the proposed read-only.
Checkpoints here serve a similar purpose than checkpoints to PG, so the
following analogy may get a bit confusing. When you're reading the
paper I'd try and imagine the checkpoints not occurring as one event,
but spread across time as the database recognizes that data is now (or
has been marked as) read-only. The home locations would then store
only the read-only copies of the data and all the churn would (if the
recognition of read-only data works) be done in the checkpoint area.
Sam
On Thu, 2008-01-03 at 00:41 +0000, Sam Mason wrote:
On Wed, Jan 02, 2008 at 05:56:14PM +0000, Simon Riggs wrote:
Like it?
Sounds good. I've only given it a quick scan though. Would read-only
segments retain the same disk-level format as is currently?
Yes, no changes at all to the table. So your app would just work,
without any DDL changes. Existing partitioning apps would not change.
It seems
possible to remove the MVCC fields and hence get more tuples per page---
whether this would actually be a net performance gain/loss seems like
a difficult question question to answer, it would definitly be a
complexity increase though.
I've been looking at general compression at table and segment level, but
thats further down the track. Removing the MVCC fields is too much work,
I think.
Reading this reminds me of the design of the store for a persistent
operating system called EROS. It has a very good paper[1] describing
the design (implementation and careful benchmarking thereof) that I
think could be a useful read.
Thanks, will do.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Hi Simon,
Looks like a novel idea. I just want to confirm my
understanding of the proposal.
a) This proposal would work for the kind of table organizations which are
currently partitioned and maintained based on some kind of timestamp.
Consider one of the use-case. A large Retail firm has a lot of stores. DBA
maintains and updates the inventories of those stores in hash-partitions
based on store-no. As the inventory gets updated, the corresponding
partition receives the update and it goes like that..
Here all the partitions are going to get constantly updated. So no
partition can possibly become a read only partition. You have clearly called
it out in your para. i am just re-confirming on that. Or do you have
something for this in your soln.?
To my limited experience, most partition strategies are based on some form
of time-stamp. If the proposed soln. can cater to those, it has lot of
use-cases.
Thanks,
Gokul.
On Fri, 2008-01-04 at 13:06 +0530, Gokulakannan Somasundaram wrote:
a) This proposal would work for the kind of table organizations which
are currently partitioned and maintained based on some kind of
timestamp. Consider one of the use-case. A large Retail firm has a lot
of stores. DBA maintains and updates the inventories of those stores
in hash-partitions based on store-no. As the inventory gets updated,
the corresponding partition receives the update and it goes like
that..
Here all the partitions are going to get constantly updated.
So no partition can possibly become a read only partition. You have
clearly called it out in your para. i am just re-confirming on that.
Or do you have something for this in your soln.?To my limited experience, most partition strategies are based on some
form of time-stamp. If the proposed soln. can cater to those, it has
lot of use-cases.
I don't think it would apply to an Inventory table. That is a current
state table.
It is designed for any large tables that would grow naturally over time
if we left them to do so. Solutions that it would work for:
- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control
- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events
It's not aimed at any particular benchmark, just real usage scenarios.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.
Small dumb-user question.
I take it you've considered some more flexible consecutive-run-of-blocks
unit of flagging rather than file-segments. That obviously complicates
the tracking but means you can cope with infrequent updates as well as
mark most of the "most recent" segment for log-style tables.
--
Richard Huxton
Archonet Ltd
On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote:
Simon Riggs wrote:
We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.Small dumb-user question.
I take it you've considered some more flexible consecutive-run-of-blocks
unit of flagging rather than file-segments. That obviously complicates
the tracking but means you can cope with infrequent updates as well as
mark most of the "most recent" segment for log-style tables.
I'm writing the code to abstract that away, so yes.
Now you mention it, it does seem straightforward to have a table storage
parameter for partition size, which defaults to 1GB. The partition size
is simply a number of consecutive blocks, as you say.
The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only
- compressed
- able to be shipped off to hierarchical storage
Those ideas work best if the partitioning is based around the physical
file sizes we use for segments.
Changing the partition size would simply reset the visibility map for
that table, in its easiest implementation.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote:
On Fri, 2008-01-04 at 10:22 +0000, Richard Huxton wrote:
Simon Riggs wrote:
We would keep a dynamic visibility map at *segment* level, showing which
segments have all rows as 100% visible. No freespace map data would be
held at this level.Small dumb-user question.
I take it you've considered some more flexible consecutive-run-of-blocks
unit of flagging rather than file-segments. That obviously complicates
the tracking but means you can cope with infrequent updates as well as
mark most of the "most recent" segment for log-style tables.I'm writing the code to abstract that away, so yes.
Now you mention it, it does seem straightforward to have a table storage
parameter for partition size, which defaults to 1GB. The partition size
is simply a number of consecutive blocks, as you say.The smaller the partition size the greater the overhead of managing it.
Oh, obviously, but with smaller partition sizes this also becomes useful
for low-end systems as well as high-end ones. Skipping 80% of a seq-scan
on a date-range query is a win for even small (by your standards)
tables. I shouldn't be surprised if the sensible-number-of-partitions
remained more-or-less constant as you scaled the hardware, but the
partition size grew.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only
- compressed
- able to be shipped off to hierarchical storageThose ideas work best if the partitioning is based around the physical
file sizes we use for segments.
I can see why you've chosen file segments. It certainly makes things easier.
Hmm - thinking about the date-range scenario above, it occurs to me that
for seq-scan purposes the correct partition size depends upon the
data value you are interested in. What I want to know is what blocks Jan
07 covers (or rather what blocks it doesn't) rather than knowing blocks
1-9999999 cover 2005-04-12 to 2007-10-13. Of course that means that
you'd eventually want different partition sizes tracking visibility for
different columns (e.g. id, timestamp).
I suspect the same would be true for read-only/compressed/archived
flags, but I can see how they are tightly linked to physical files
(particularly the last two).
--
Richard Huxton
Archonet Ltd
Hello Simon,
Simon Riggs wrote:
I've come
up with an alternative concept to allow us to discuss the particular
merits of each. ISTM that this new proposal has considerable potential.
Hm.. interesting idea.
If we were able to keep track of which sections of a table are now
read-only then we would be able to record information about the data in
that section and use it to help solve queries. This is turning the
current thinking on its head: we could derive the constraints from the
data, rather than placing the data according to the constraints. That
would be much more natural: load data into the table and have the system
work out the rest.
Yeah, but that's also the most limiting factor of your approach: it
covers only horizontal partitioning by time (or to be more precise, by
columns which are very likely to increase or decrease with time). All
other columns will very likely contain values from the full range of
possible values.
As you have pointed out, that might be a very frequent use case. I can't
argue about that, however, I think it's important to be well aware of
that limitation.
Other scans types might also use segment exclusion, though this would
only be useful for scans retrieving many rows, otherwise the overhead of
segment exclusion might not be worthwhile.
Uh.. the overhead of checking against min/max values doesn't seem that
big to me.
I rather think the gain for index scans would be prohibitively small,
because (given frequent enough vacuuming) an index scan shouldn't return
many pointers to tuples in segments which could be optimized away by
segment exclusion.
If we collect data for all columns then many of our implicit constraints
would be useless. e.g. if a column only has a few values and these are
present in all segments. Matching our predicate against all constraints
would be expensive, so we must weed out poor constraints. We would do
this by removing any constraint that overlapped more than 10% of other
segments. Various heuristics would likely need to be discovered during
development to make this work without resorting to manual commands.
Uh, well, that's about the limitation I've pointed out above. But is it
really worth maintaining statistics about overlapping values and
removing min/max checks for certain columns?
It would save you the min/max check per segment and scan, but cost
maintaining the statistics and checking against the statistics once per
scan. AFAICS the block with the min/max tuple per segment will often
remain cached anyway... dunno.
Noting which segments are read-only
-----------------------------------Everything so far has relied upon our ability to note which segments of
a table are read-only. We could do this in two ways1) have the system automatically keep track of non-changing data
2) have the DBA issue a command to "mark" a segment as read-only nowProbably a combination of the two is better, so we have three states for
segments
- READ_WRITE_ALLOWED
- EFFECTIVE_READ_ONLY (set by 1 only)
- MARKED_READ_ONLY (set by 2 only)Having said that I want to concentrate on (1), though consider the other
one also in case requested by reviewers.
Hm.. AFAICT, horizontal partitioning often serves manageability, which
is quite limited having all data in one table (you can't move a single
segment to a different tablespace). Thus I think option 2 is pretty
constrained is usability. What would the DBA gain by setting a segment
to read only? How does the DBA figure out, in which segment a tuple is
stored in (so she can decide to mark it read only)?
The user may also wish to clear down very old data, so allowing DELETEs
can ensure the user can still remove old data from the table. By
carefully choosing the values to be deleted, a whole segment can be
deleted and then returned to the FSM.
Oh, yeah, that sounds like a good optimization. Bulk deletes, yummie!
This proposal offers many of the advantages of the earlier Visibility
Map proposal, yet without major changes to heap structure. Since the
segment-level visibility map is more granular it would only be 80% as
effective as the more detailed block-level map. Having said that, the
bookkeeping overheads will also be considerably reduced, so it does seem
a good joint approach. It also allows freezing to be handled fully,
which was a problem with the original visibility map proposal. WAL
logging visibility map changes is also now much easier.
I generally agree, although I'm somewhat dubious about the 80% factor.
My thoughts have been targeted directly at partitioning, yet I have to
admit that this idea overlaps, and in my world view, replaces the
Visibility Map proposal. I very much like the name Visibility Map
though.
I would even say, that partitioning is somewhat of a misleading term
here, because it normally allows the DBA to decide on where to split.
Given that we are operating on segments here, to which the DBA has very
limited information and access, I prefer the term "Segment Exclusion". I
think of that as an optimization of sequential scans on tables with the
above mentioned characteristics.
If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).
I'm clearly in favor of separating between the two proposals. SVM is a
good name, IMHO.
We can handle select count(*) by scanning the non-100% visible segments
of a table, then adding the stored counts for each segment to get a
final total. Not sure if its really worth doing, but it does sound like
an added bonus.
Yup, sounds tempting. Although it's contrary to Postgres position so
far. And one could argue that you'd have to maintain not only count(),
but also avg(), sum(), etc... as well for all tuples in the 100% visible
segment.
There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...
Hm.. that looks like a rather bad downside of an executor-only optimization.
Comparison with other Partitioning approaches
---------------------------------------------Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recyclingAll of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...
I fully agree. But as I tried to point out above, the gains in
manageability from Segment Exclusion are also pretty close to zero. So
I'd argue they only fulfill parts of the needs for general horizontal
partitioning.
This technique would be useful for any table with historical data keyed
by date or timestamp. It would also be useful for data where a
time-of-insert component is implicit, such as many major entity tables
where the object ids are assigned by a sequence. e.g. an Orders table
with an OrderId as PK. Once all orders placed in a period have been
shipped/resolved/closed then the segments will be marked read-only.
Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.
Its not really going to change the code path much for small tables, yet
seems likely to work reasonably well for large tables of all shapes and
sizes.
That sounds a bit too optimistic to me. For Segment Exclusion, it takes
only *one* tuple to enlarge the min/max range dramatically in any
direction. So it's not the overall correlation between column values and
storage location, but rather only the min/max column values which
matter. Have you ever checked, how well these min/max values correlate
with the segment number?
Pretty much the same argument applies to SVM: an update to only one
tuple in a segment is enough to remove the optimization for reading
(EFFECTIVE_READ_ONLY property) for the segment. The assumption here
(being that updates happen mostly to newer segments) is not quite the
same as above.
Maybe a combination with CLUSTERing would be worthwhile? Or even
enforced CLUSTERing for the older segments?
If a segment is being updated, we leave it alone, and maybe never
actually set the visibility map at all. So overall, this idea seems to
cover the main use case well, yet with only minimal impact on the
existing use cases we support.
Yup.
As before, I will maintain this proposal on the PG developer Wiki, once
we get down to detailed design.
Cool.
Thanks for working out yet another great proposal. I hope to have been
of help with my questions and remarks. ;-)
Regards
Markus
Hi,
Simon Riggs wrote:
- any Fact table where measurements/observations/events are accumulated
e.g.
Web Hits (any Internet events)
Call Detail Records
Sales
Security Events
Scientific Measurements
Process Control- any Major Entity where new entities are created from a sequence
e.g.
Orders, OrderItems
Invoices
Shipments, Returns
most SCM/DCM events
...and only changed very seldom after a while, I would add. Because
changing an old tuple would invalidate the optimization for the affected
segment.
That's why this optimization can't help for inventory tables, where an
id might correlate with time and storage location, but writing access
doesn't correlate with storage location (segment number) and time.
Regards
Markus
Hi,
Simon Riggs wrote:
The smaller the partition size the greater the overhead of managing it.
Also I've been looking at read-only tables and compression, as you may
know. My idea was that in the future we could mark segments as either
- read-only
- compressed
- able to be shipped off to hierarchical storageThose ideas work best if the partitioning is based around the physical
file sizes we use for segments.
As much as I'd like this simplification.. But I'm still thinking of
these segments as an implementation detail of Postgres, and not
something the user should have to deal with.
Allowing the DBA to move segments to a different table space and giving
him the possibility to check which tuples are in which segment seems
awkward from a users perspective, IMO.
Regards
Markus
On Fri, 2008-01-04 at 13:29 +0100, Markus Schiltknecht wrote:
Given that we are operating on segments here, to which the DBA has very
limited information and access, I prefer the term "Segment Exclusion". I
think of that as an optimization of sequential scans on tables with the
above mentioned characteristics.If we do need to differentiate between the two proposals, we can refer
to this one as the Segment Visibility Map (SVM).I'm clearly in favor of separating between the two proposals. SVM is a
good name, IMHO.
OK, I'll refer to this as proposal as SVM.
There would be additional complexity in selectivity estimation and plan
costing. The above proposal allows dynamic segment exclusion, which
cannot be assessed at planning time anyway, so suggestions welcome...Hm.. that looks like a rather bad downside of an executor-only optimization.
I think that's generally true. We already have that problem with planned
statements and work_mem, for example, and parameterised query planning
is a difficult problem. Stable functions are already estimated at plan
time, so we hopefully should be getting that right. I don't see any show
stoppers here, just more of the usual problems of query optimization.
Comparison with other Partitioning approaches
---------------------------------------------Once I realised this was possible in fairly automatic way, I've tried
hard to keep away from manual overrides, commands and new DDL.Declarative partitioning is a big overhead, though worth it for large
databases. No overhead is *much* better though.This approach to partitioning solves the following challenges
- allows automated table extension, so works automatically with Slony
- responds dynamically to changing data
- allows stable functions, nested loop joins and parametrised queries
- allows RI via SHARE locks
- avoids the need for operator push-down through Append nodes
- allows unique indexes
- allows both global indexes (because we only have one table)
- allows advanced planning/execution using read-only/visible data
- works naturally with synchronous scans and buffer recyclingAll of the above are going to take considerably longer to do in any of
the other ways I've come up with so far...I fully agree. But as I tried to point out above, the gains in
manageability from Segment Exclusion are also pretty close to zero. So
I'd argue they only fulfill parts of the needs for general horizontal
partitioning.
Agreed.
My focus for this proposal wasn't manageability, as it had been in other
recent proposals. I think there are some manageability wins to be had as
well, but we need to decide what sort of partitioning we want/need
first.
So in the case of SVM, enhanced manageability is really a phase 2 thing.
Plus, you can always combine a design with constraint and segment
exclusion.
Maybe a combination with CLUSTERing would be worthwhile? Or even
enforced CLUSTERing for the older segments?
I think there's merit in Heikki's maintain cluster order patch and that
should do an even better job of maintaining locality.
Thanks for detailed comments. I'll do my best to include all of the
viewpoints you've expressed as the design progresses.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.
I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.
A
On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:
On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.
I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:
There would be three different states for a segment:
1. read write
2. "optimized for reading", as Markus says it
3. marked read only by explicit command
Transition 1 -> 2 is by autovacuum under the SVM proposal, transition 2
-> 3 is by user command only. So throwing an ERROR is acceptable for
segments in state 3.
I came up with a complex scheme for going from 1 -> 3 previously, but I
don't think its needed any longer (for this, at least). It's trivial to
go from 2 -> 3 using an ALTER TABLE statement, along the lines of ALTER
TABLE .... WHERE ....
Files that are completely in state 3 can then be archived by a
hierarchical storage manager without problem.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Hi,
Simon Riggs wrote:
On Fri, 2008-01-04 at 13:06 -0500, Andrew Sullivan wrote:
On Fri, Jan 04, 2008 at 01:29:55PM +0100, Markus Schiltknecht wrote:
Agreed. Just a minor note: I find "marked read-only" too strong, as it
implies an impossibility to write. I propose speaking about mostly-read
segments, or optimized for reading or similar.
Hm.. yeah, after rereading, I realize that I've mixed up states no. 2
and 3 here, sorry.
I do want some segments to be _marked_ read-only: I want attempted writes to
them to _fail_.
Well, I can see use cases for marking tuples or complete relations as
read only. But segments?
I'm still puzzled about how a DBA is expected to figure out which
segments to mark. Simon, are you assuming we are going to pass on
segment numbers to the DBA one day?
If not, a more user friendly command like "MARK READ ONLY WHERE date <=
2006" would have to move tuples around between segments, so as to be
able to satisfy the split point exactly, right?
I think Markus thought that we would mark them read only automatically,
which was not my intention. I believe its possible to have this in a way
that will make you both happy. Some more explanation:There would be three different states for a segment:
1. read write
2. "optimized for reading", as Markus says it
3. marked read only by explicit command
Thanks for clarification.
Regards
Markus
On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which
segments to mark.
I think that part might be hand-wavy still. But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean "the table is read only"?
A
On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which
segments to mark. Simon, are you assuming we are going to pass on
segment numbers to the DBA one day?
No Way!
That would stop Richard's idea to make the segment stride configurable,
apart from being a generally ugly thing.
If not, a more user friendly command like "MARK READ ONLY WHERE date <=
2006" would have to move tuples around between segments, so as to be
able to satisfy the split point exactly, right?
Yes, just a simple WHERE clause that we can translate into segments
under the covers. It would be an alter table, so we get an exclusive
lock.
ALTER TABLE foo SET READ ONLY WHERE ....
possibly with a few restrictions on the WHERE clause. Anyway this is
just futures and dreams, so far, so lets just say something like that is
possible in the future and work out more when we pass the first few
hurdles.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
On Friday 04 January 2008 17:01, Simon Riggs wrote:
On Fri, 2008-01-04 at 22:26 +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which
segments to mark. Simon, are you assuming we are going to pass on
segment numbers to the DBA one day?No Way!
That would stop Richard's idea to make the segment stride configurable,
apart from being a generally ugly thing.If not, a more user friendly command like "MARK READ ONLY WHERE date <=
2006" would have to move tuples around between segments, so as to be
able to satisfy the split point exactly, right?Yes, just a simple WHERE clause that we can translate into segments
under the covers. It would be an alter table, so we get an exclusive
lock.ALTER TABLE foo SET READ ONLY WHERE ....
possibly with a few restrictions on the WHERE clause. Anyway this is
just futures and dreams, so far, so lets just say something like that is
possible in the future and work out more when we pass the first few
hurdles.
Not to be negative, but istm how this feature would be managed is as important
as the bits under the hood. Or at least we have to believe there will be
some practical way to manage this, which as of yet I am skeptical.
--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Fri, 2008-01-04 at 22:31 -0500, Robert Treat wrote:
Not to be negative, but istm how this feature would be managed is as important
as the bits under the hood.
Agreed. On this part of the thread, we've been discussing an extension
to the basic proposal, which is why I have not been concentrating there.
Core management wise, the basic proposal showed how we would be able to
have VACUUM run much faster than before and how DELETE will also be
optimised naturally by this approach. Loading isn't any slower than it
is now; loading does need some work, but that's another story.
Or at least we have to believe there will be
some practical way to manage this, which as of yet I am skeptical.
Skepticism is OK, but I'd like to get your detailed thoughts on this.
I've been an advocate of the multi-tables approach now for many years,
so I don't expect everybody to switch their beliefs on my say-so
overnight. Let me make a few more comments in this area:
The main proposal deliberately has few, if any, knobs and dials. That's
a point of philosophy that I've had views on previously: my normal
stance is that we need some knobs to allow the database to be tuned to
individual circumstances.
In this case, partitioning is way too complex to administer effectively
and requires application changes that make it impossible to use for
packaged applications. The latest Oracle TPC-H benchmark uses 10 pages
of DDL to set it up and if I can find a way to avoid that, I'd recommend
it to all. I do still want some knobs and dials, just not 10 pages
worth, though I'd like yours and others' guidance on what those should
be. Oracle have been responding to feedback with their new interval
partitioning, but its still a multi-table approach in essence.
My observation of partitioned databases is that they all work
beautifully at the design stage, but problems emerge over time. A
time-based range partitioned table can often have different numbers of
rows per partition, giving inconsistent response times. A
height-balanced approach where we make the partitions all the same size,
yet vary the data value boundaries will give much more consistent query
times and can be completely automated much more easily.
The SVM concept doesn't cover everything that you can do with
partitioning, but my feeling is it covers the main use cases well. If
that's not true, in broad strokes or in the detail, then we need to
uncover that. Everybody's help in doing that is appreciated, whatever
the viewpoint and whatever the outcome.
It's probably worth examining existing applications to see how well they
would migrate to segmented tables approach. The following query will
analyse one column of a table to produce a list of boundary values,
given a segment size of 131072 blocks (1 GB).
select
substr(ctid::text,2,strpos(ctid::text,',')-2)::integer/131072 as seg,
min(PrimaryKey), max(PrimaryKey)
from bigtable
group by seg;
We should be able to see whether this works for existing use cases, or
not fairly easily.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
Andrew Sullivan wrote:
On Fri, Jan 04, 2008 at 10:26:54PM +0100, Markus Schiltknecht wrote:
I'm still puzzled about how a DBA is expected to figure out which
segments to mark.I think that part might be hand-wavy still. But once this facility is
there, what's to prevent the current active segment (and the rest) from also
getting this mark, which would mean "the table is read only"?
Well, sure, marking *all* segments read-only is pretty easy. But that
was not quite my point.
Regards
Markus