[PATCHES] Post-special page storage TDE support
Hi -hackers,
An additional piece that I am working on for improving infra for TDE
features is allowing the storage of additional per-page data. Rather
than hard-code the idea of a specific struct, this is utilizing a new,
more dynamic structure to associate page offsets with a particular
feature that may-or-may-not be present for a given cluster. I am
calling this generic structure a PageFeature/PageFeatureSet (better
names welcome), which is defined for a cluster at initdb/bootstrap
time, and reserves a given amount of trailing space on the Page which
is then parceled out to the consumers of said space.
While the immediate need that this feature fills is storage of
encryption tags for XTS-based encryption on the pages themselves, this
can also be used for any optional features; as an example I have
implemented expanded checksum support (both 32- and 64-bit), as well
as a self-description "wasted space" feature, which just allocates
trailing space from the page (obviously intended as illustration
only).
There are 6 commits in this series:
0001 - adds `reserved_page_space` global, making various size
calculations and limits dynamic, adjusting access methods to offset
special space, and ensuring that we can safely reserve allocated space
from the end of pages.
0002 - test suite stability fixes - the change in number of tuples per
page means that we had some assumptions about the order from tests
that now break
0003 - the "PageFeatures" commit, the meat of this feature (see
following description)
0004 - page_checksum32 feature - store the full 32-bit checksum across
the existing pd_checksum field as well as 2 bytes from
reserved_page_space. This is more of a demo of what could be done
here than a practical feature.
0005 - wasted space PageFeature - just use up space. An additional
feature we can turn on/off to see how multiple features interact.
Only for illustration.
0006 - 64-bit checksums - fully allocated from reserved_page_space.
Using an MIT-licensed 64-bit checksum, but if we determined we'd want
to do this we'd probably roll our own.
From the commit message for PageFeatures:
Page features are a standardized way of assigning and using dynamic
space usage from the tail end of
a disk page. These features are set at cluster init time (so
configured via `initdb` and
initialized via the bootstrap process) and affect all disk pages.
A PageFeatureSet is effectively a bitflag of all configured features,
each of which has a fixed
size. If not using any PageFeatures, the storage overhead of this is 0.
Rather than using a variable location struct, an implementation of a
PageFeature is responsible for
an offset and a length in the page. The current API returns only a
pointer to the page location for
the implementation to manage, and no further checks are done to ensure
that only the expected memory
is accessed.
Access to the underlying memory is synonymous with determining whether
a given cluster is using an
underlying PageFeature, so code paths can do something like:
char *loc;
if ((loc = ClusterGetPageFeatureOffset(page, PF_MY_FEATURE_ID)))
{
// ipso facto this feature is enabled in this cluster *and* we
know the memory address
...
}
Since this is direct memory access to the underlying Page, ensure the
buffer is pinned. Explicitly
locking (assuming you stay in your lane) should only need to guard
against access from other
backends of this type if using shared buffers, so will be use-case dependent.
This does have a runtime overhead due to moving some offset
calculations from compile time to
runtime. It is thought that the utility of this feature will outweigh
the costs here.
Candidates for page features include 32-bit or 64-bit checksums,
encryption tags, or additional
per-page metadata.
While we are not currently getting rid of the pd_checksum field, this
mechanism could be used to
free up that 16 bits for some other purpose. One such purpose might be
to mirror the cluster-wise
PageFeatureSet, currently also a uint16, which would mean the entirety
of this scheme could be
reflected in a given page, opening up per-relation or even per-page
setting/metadata here. (We'd
presumably need to snag a pd_flags bit to interpret pd_checksum that
way, but it would be an
interesting use.)
Discussion is welcome and encouraged!
Thanks,
David
Attachments:
0005-A-second-page-feature-just-to-allocate-more-space.patchapplication/octet-stream; name=0005-A-second-page-feature-just-to-allocate-more-space.patchDownload+24-3
0001-Add-reserved_page_space-to-Page-structure.patchapplication/octet-stream; name=0001-Add-reserved_page_space-to-Page-structure.patchDownload+71-34
0004-Add-page_checksums32-page-feature.patchapplication/octet-stream; name=0004-Add-page_checksums32-page-feature.patchDownload+172-23
0003-Add-cluster-wide-Page-Features.patchapplication/octet-stream; name=0003-Add-cluster-wide-Page-Features.patchDownload+290-47
0006-Add-64-bit-checksum-page-feature.patchapplication/octet-stream; name=0006-Add-64-bit-checksum-page-feature.patchDownload+695-17
0002-Make-the-output-of-select_views-test-stable.patchapplication/octet-stream; name=0002-Make-the-output-of-select_views-test-stable.patchDownload+145-146
Hi,
On Mon, Oct 24, 2022 at 12:55:53PM -0500, David Christensen wrote:
Explicitly
locking (assuming you stay in your lane) should only need to guard
against access from other
backends of this type if using shared buffers, so will be use-case dependent.
I'm not sure what you mean here?
This does have a runtime overhead due to moving some offset
calculations from compile time to
runtime. It is thought that the utility of this feature will outweigh
the costs here.
Have you done some benchmarking to give an idea of how much overhead we're
talking about?
Candidates for page features include 32-bit or 64-bit checksums,
encryption tags, or additional
per-page metadata.While we are not currently getting rid of the pd_checksum field, this
mechanism could be used to
free up that 16 bits for some other purpose.
IIUC there's a hard requirement of initdb-time initialization, as there's
otherwise no guarantee that you will find enough free space in each page at
runtime. It seems like a very hard requirement for a full replacement of the
current checksum approach (even if I agree that the current implementation
limitations are far from ideal), especially since there's no technical reason
that would prevent us from dynamically enabling data-checksums without doing
all the work when the cluster is down.
Explicitly
locking (assuming you stay in your lane) should only need to guard
against access from other
backends of this type if using shared buffers, so will be use-case dependent.I'm not sure what you mean here?
I'm mainly pointing out that the specific code that manages this
feature is the only one who has to worry about modifying said page
region.
This does have a runtime overhead due to moving some offset
calculations from compile time to
runtime. It is thought that the utility of this feature will outweigh
the costs here.Have you done some benchmarking to give an idea of how much overhead we're
talking about?
Not yet, but I am going to work on this. I suspect the current code
could be improved, but will try to get some sort of measurement of the
additional overhead.
Candidates for page features include 32-bit or 64-bit checksums,
encryption tags, or additional
per-page metadata.While we are not currently getting rid of the pd_checksum field, this
mechanism could be used to
free up that 16 bits for some other purpose.IIUC there's a hard requirement of initdb-time initialization, as there's
otherwise no guarantee that you will find enough free space in each page at
runtime. It seems like a very hard requirement for a full replacement of the
current checksum approach (even if I agree that the current implementation
limitations are far from ideal), especially since there's no technical reason
that would prevent us from dynamically enabling data-checksums without doing
all the work when the cluster is down.
As implemented, that is correct; we are currently assuming this
specific feature mechanism is set at initdb time only. Checksums are
not the primary motivation here, but were something that I could use
for an immediate illustration of the feature. That said, presumably
you could define a way to set the features per-relation (say with a
template field in pg_class) which would propagate to a relation on
rewrite, so there could be ways to handle things incrementally, were
this an overall goal.
Thanks for looking,
David
Hi
On Mon, 24 Oct 2022, 19:56 David Christensen, <
david.christensen@crunchydata.com> wrote:
Discussion is welcome and encouraged!
Did you read the related thread with related discussion from last June,
"Re: better page-level checksums" [0]/messages/by-id/CA+TgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA@mail.gmail.com? In that I argued that space at the
end of a page is already allocated for the AM, and that reserving variable
space at the end of the page for non-AM usage is wasting the AM's
performance potential.
Apart from that: Is this variable-sized 'metadata' associated with smgr
infrastructure only, or is it also available for AM features? If not; then
this is a strong -1. The amount of tasks smgr needs to do on a page is
generally much less than the amount of tasks an AM needs to do; so in my
view the AM has priority in prime page real estate, not smgr or related
infrastructure.
re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be
part of the storage manager (smgr) implementation / can't this be part of
the smgr of the relation?
re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1]/messages/by-id/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5=zg63nKtHBnn8A@mail.gmail.com, that we could
use pd_checksum as an extra area marker for this storage-specific data,
which would be located between pd_upper and pd_special.
Re: patch contents
0001:
+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or
an assertion that reserved_page_size is MAXALIGNED, would be better.
PageValidateSpecialPointer(Page page) { Assert(page); - Assert(((PageHeader) page)->pd_special <= BLCKSZ); + Assert((((PageHeader) page)->pd_special - reserved_page_size) <=
BLCKSZ);
This check is incorrect. With your code it would allow pd_special past the
end of the block. If you want to put the reserved_space_size effectively
inside the special area, this check should instead be:
+ Assert(((PageHeader) page)->pd_special <= (BLCKSZ -
reserved_page_size));
Or, equally valid
+ Assert((((PageHeader) page)->pd_special + reserved_page_size) <=
BLCKSZ);
+ * +-------------+-----+------------+-----------------+ + * | ... tuple2 tuple1 | "special space" | "reserved" | + * +-------------------+------------+-----------------+
Could you fix the table display if / when you revise the patchset? It seems
to me that the corners don't line up with the column borders.
0002:
Make the output of "select_views" test stable
Changing the reserved_page_size has resulted in non-stable results for
this test.
This makes sense, what kind of instability are we talking about? Are there
different results for runs with the same binary, or is this across
compilations?
0003 and up were not yet reviewed in depth.
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/CA+TgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA@mail.gmail.com
/messages/by-id/CA+TgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA@mail.gmail.com
[1]: /messages/by-id/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5=zg63nKtHBnn8A@mail.gmail.com
/messages/by-id/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5=zg63nKtHBnn8A@mail.gmail.com
Hi Matthias,
Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In that I argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the end of the page for non-AM usage is wasting the AM's performance potential.
Yes, I had read parts of that thread among others, but have given it a
re-read. I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense. I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint. Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?
Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available for AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or related infrastructure.
I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated). Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.
re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr) implementation / can't this be part of the smgr of the relation?
For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.
re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for this storage-specific data, which would be located between pd_upper and pd_special.
I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible. Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.
Re: patch contents
0001:
+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or an assertion that reserved_page_size is MAXALIGNED, would be better.
It is currently aligned via the space calculation return value but
agree that folding it into an assert or reworking it explicitly is
clearer.
PageValidateSpecialPointer(Page page) { Assert(page); - Assert(((PageHeader) page)->pd_special <= BLCKSZ); + Assert((((PageHeader) page)->pd_special - reserved_page_size) <= BLCKSZ);This check is incorrect. With your code it would allow pd_special past the end of the block. If you want to put the reserved_space_size effectively inside the special area, this check should instead be:
+ Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));
Or, equally valid
+ Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);
Yup, I think I inverted my logic there; thanks.
+ * +-------------+-----+------------+-----------------+ + * | ... tuple2 tuple1 | "special space" | "reserved" | + * +-------------------+------------+-----------------+Could you fix the table display if / when you revise the patchset? It seems to me that the corners don't line up with the column borders.
Sure thing.
0002:
Make the output of "select_views" test stable
Changing the reserved_page_size has resulted in non-stable results for this test.This makes sense, what kind of instability are we talking about? Are there different results for runs with the same binary, or is this across compilations?
When running with the same compilation/initdb settings, the test
results are stable, but differ depending what options you chose, so
`make installcheck` output will fail when testing a cluster with
different options vs upstream HEAD without these patches, etc.
0003 and up were not yet reviewed in depth.
Thanks, I appreciate the feedback so far.
On Sat, 29 Oct 2022 at 00:25, David Christensen
<david.christensen@crunchydata.com> wrote:
Hi Matthias,
Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In that I argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the end of the page for non-AM usage is wasting the AM's performance potential.
Yes, I had read parts of that thread among others, but have given it a
re-read. I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense. I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint.
It would be sufficient, but it is definitely suboptimal. See [0]https://commitfest.postgresql.org/40/3543/ as a
patch that is being held back by putting stuff behind the special
area.
I don't really care much about the storage layout on-disk, but I do
care that AMs have efficient access to their data. For the page
header, line pointers, and special area, that is currently guaranteed
by the current page layout. However, for the special area, that
currently guaranteed offset of (BLCKSZ -
MAXALIGN(sizeof(IndexOpaque))) will get broken as there would be more
space in the special area than the AM would be expecting. Right now,
our index AMs are doing pointer chasing during special area lookups
for no good reason, but with the patch it would be required. I don't
like that at all.
Because I understand that it is a requirement to store this reserved
space in a fixed place on the on-disk page (you must know where the
checksum is at static places on the page, otherwise you'd potentially
mis-validate a page) but that requirement is not there for in-memory
storage. I think it's a small price to pay to swap the fields around
during R/W operations - the largest size of special area is currently
24 bytes, and the proposals I've seen for this extra storage area
would not need it to be actually filled with data whilst the page is
being used by the AM (checksum could be zeroed in in-memory
operations, and it'd get set during writeback; same with all other
fields I can imagine the storage system using).
Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?
The issue is that I'd like to eliminate the redirection from the page
header in the hot path. Currently, we can do that, and pd_special
would be little more than a hint to the smgr and pd_linp code that
that area is special and reserved for this access method's private
data, so that it is not freed. If you stick something extra in there,
it's not special for the AM's private data, and the AM won't be able
to use pd_special for similar uses as pd_linp+pd_lower. I'd rather
have the storage system use it's own not-special area; choreographed
by e.g. a reuse of pd_checksum for one more page offset. Swapping the
fields around between on-disk and in-memory doesn't need to be an
issue, as special areas are rarely very large.
Every index type we support utilizes the special area. Wouldn't those
in-memory operations have priority on this useful space, as opposed to
a storage system that maybe will be used in new clusters, and even
then only during R/W operations to disk (each at most once for N
memory operations)?
Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available for AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or related infrastructure.
I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated). Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.
pd_special has (historically) been reserved for access methods'
page-level private data. If you add to this area, shouldn't that be
space that the AM should be able to hook into as well? Or are all
those features limited to the storage system only; i.e. the storage
system decides what's best for the AM's page handling w.r.t. physical
storage?
re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr) implementation / can't this be part of the smgr of the relation?For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.
OK, so having a storage manager for each supported set of features is
not planned for this. Understood.
re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for this storage-specific data, which would be located between pd_upper and pd_special.I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible. Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.
I think that adding page header bits would suffice for backwards
compatibility if we'd want to reuse pd_checksum. A new
PD_CHECKSUM_REUSED_FOR_STORAGE would suffice here; it would be unset
in normal (pre-patch, or without these fancy new features) clusters.
Kind regards,
Matthias van de Meent
PS. sorry for the rant. I hope my arguments are clear why I dislike
the storage area being placed behind the special area in memory.
Per some offline discussion with Stephen and incorporating some of the
feedback I've gotten I'm including the following changes/revisions:
1. Change the signature of any macros that rely on a dynamic component
to look like a function so you can more easily determine in-code
whether something is truly a constant/compile time calculation or a
runtime one.
2. We use a new page flag for whether "extended page features" are
enabled on the given page. If this is set then we look for the 1-byte
trailer with the bitflag of the number of features. We allow space for
7 page features and are reserving the final hi bit for future
use/change of interpretation to accommodate more.
3. Consolidate the extended checksums into a 56-bit checksum that
immediately precedes the 1-byte flag. Choice of 64-bit checksum is
arbitrary just based on some MIT-licensed code I found, so just
considering this proof of concept, not necessarily promoting that
specific calculation. (I think I included some additional checksum
variants from the earlier revision for ease of testing various
approaches.)
4. Ensure the whole area is MAXALIGN and fixed a few bugs that were
pointed out in this thread.
Patches are:
1. make select_views stable, prerequisite for anything that is messing
with tuples on page sizes
2. add reserved_page_size handling and rework existing code to account
for this additional space usage
3. main PageFeatures-related code; introduce that abstraction layer,
along with the trailing byte on the page with the enabled features for
this specific page. We also add an additional param to PageInit()
with the page features active on this page; currently all call sites
are using the cluster-wide cluster_page_features as the parameter, so
all pages share what is stored in the control file based on initdb
options. However, routines which query page features look on the
actual page itself, so in fact we are able to piecemeal at the
page/relation level if we so desire, or turn off for specific types of
pages, say. Also includes the additional pd_flags bit to enable that
interpretation.
4. Actual extended checksums PageFeature. Rather than two separate
implementations as in the previous patch series, we are using 56 bits
of a 64-bit checksum, stored as the high 7 bytes of the final 8 in the
page where this is enabled.
5. wasted_space PageFeature just to demo multiple features in play.
Thanks,
David
Attachments:
v2-0002-Add-reserved_page_space-to-Page-structure.patchapplication/octet-stream; name=v2-0002-Add-reserved_page_space-to-Page-structure.patchDownload+249-171
v2-0005-A-second-page-feature-just-to-allocate-more-space.patchapplication/octet-stream; name=v2-0005-A-second-page-feature-just-to-allocate-more-space.patchDownload+24-3
v2-0004-Add-extended-page-checksums-feature.patchapplication/octet-stream; name=v2-0004-Add-extended-page-checksums-feature.patchDownload+868-20
v2-0003-Add-pagefeat-plus-new-page-status-bit-and-page-fe.patchapplication/octet-stream; name=v2-0003-Add-pagefeat-plus-new-page-status-bit-and-page-fe.patchDownload+300-36
v2-0001-Make-the-output-of-select_views-test-stable.patchapplication/octet-stream; name=v2-0001-Make-the-output-of-select_views-test-stable.patchDownload+145-146
Looking into some CF bot failures which didn't show up locally. Will
send a v3 when resolved.
So here is a v3 here, incorporating additional bug fixes and some
design revisions. I have narrowed this down to 3 patches, fixing the
bugs that were leading to the instability of the specific test file so
dropping that as well as removing the useless POC "wasted space".
The following pieces are left:
0001 - adjust the codebase to utilize the "reserved_page_space"
variable for all offsets rather than assuming compile-time constants.
This allows us to effectively allocate a fixed chunk of storage from
the end of the page and have everything still work on this cluster.
0002 - add the Page Feature abstraction. This allows you to utilize
this chunk of storage, as well as query for feature use at the page
level.
0003 - the first page feature, 64-bit encryption (soon to be
renumbered when GCM storage for TDE is introduced, though the two
features are designed to be incompatible). This includes an
arbitrarily found 64-bit checksum, so we probably will need to write
our own or ensure that we have something license-compatible.
This is rebased and current as-of-today and passes all CI tests, so
should be in a good place to start looking at.
Best,
David
Attachments:
v3-0003-Add-page-feature-for-64-bit-checksums.patchapplication/octet-stream; name=v3-0003-Add-page-feature-for-64-bit-checksums.patchDownload+772-27
v3-0002-Add-Page-Features-optional-per-page-storage-alloc.patchapplication/octet-stream; name=v3-0002-Add-Page-Features-optional-per-page-storage-alloc.patchDownload+521-81
v3-0001-Add-reserved_page_space-to-Page-structure.patchapplication/octet-stream; name=v3-0001-Add-reserved_page_space-to-Page-structure.patchDownload+256-177
Refreshing this with HEAD as of today, v4.
Attachments:
v4-0001-Add-reserved_page_space-to-Page-structure.patchapplication/octet-stream; name=v4-0001-Add-reserved_page_space-to-Page-structure.patchDownload+256-177
v4-0002-Add-Page-Features-optional-per-page-storage-alloc.patchapplication/octet-stream; name=v4-0002-Add-Page-Features-optional-per-page-storage-alloc.patchDownload+519-79
v4-0003-Add-page-feature-for-64-bit-checksums.patchapplication/octet-stream; name=v4-0003-Add-page-feature-for-64-bit-checksums.patchDownload+771-26
Greetings,
* David Christensen (david.christensen@crunchydata.com) wrote:
Refreshing this with HEAD as of today, v4.
Thanks for updating this!
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
This space is reserved for extended data on the Page structure which will be ultimately used for
encrypted data, extended checksums, and potentially other things. This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at runtime based on specific
ControlFile features.No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with different settings here.
This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.
Looking at the changes to bufpage.h, in particular ...
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
@@ -19,6 +19,14 @@
#include "storage/item.h"
#include "storage/off.h"+extern PGDLLIMPORT int reserved_page_size; + +#define SizeOfPageReservedSpace() reserved_page_size +#define MaxSizeOfPageReservedSpace 0 + +/* strict upper bound on the amount of space occupied we have reserved on + * pages in this cluster */
This will eventually be calculated based on what features are supported
concurrently?
@@ -36,10 +44,10 @@ * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | - * +-------------+------------------+-----------------+ - * | ... tuple3 tuple2 tuple1 | "special space" | - * +--------------------------------+-----------------+ - * ^ pd_special + * +-------------+-----+------------+----+------------+ + * | ... tuple2 tuple1 | "special space" | "reserved" | + * +-------------------+------------+----+------------+ + * ^ pd_special ^ reserved_page_space
Right, adds a dynamic amount of space 'post-special area'.
@@ -73,6 +81,8 @@ * stored as the page trailer. an access method should always * initialize its pages with PageInit and then set its own opaque * fields. + * + * XXX - update more comments here about reserved_page_space */
Would be good to do. ;)
@@ -325,7 +335,7 @@ static inline void PageValidateSpecialPointer(Page page) { Assert(page); - Assert(((PageHeader) page)->pd_special <= BLCKSZ); + Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData); }
This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout? Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time. Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting. That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB. This would likely require multiple shared buffer
pools, of course...
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 9a302ddc30..a93cd9df9f 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -26,6 +26,8 @@ /* GUC variable */ bool ignore_checksum_failure = false;+int reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata */ +/* ----------------------------------------------------------------
* Page support functions
@@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
{
PageHeader p = (PageHeader) page;- specialSize = MAXALIGN(specialSize); + specialSize = MAXALIGN(specialSize) + reserved_page_size;
Rather than make it part of specialSize, I would think we'd be better
off just treating them independently. Eg, the later pd_upper setting
would be done by:
p->pd_upper = pageSize - specialSize - reserved_page_size;
etc.
@@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) * one that is both unused and deallocated. * * If flag PAI_IS_HEAP is set, we enforce that there can't be more than - * MaxHeapTuplesPerPage line pointers on the page. + * MaxHeapTuplesPerPage() line pointers on the page.
Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
supporting multiple page sizes concurrently ... but I'm not sure it's
actually required for the reserved_page_size idea as currently
considered. The reason is that with 8K or larger pages, the amount of
space we're already throwing away is at least 20 bytes, if I did my math
right. If we constrain reserved_page_size to be 20 bytes or less, as I
believe we're currently thinking we won't need that much, then we could
perhaps keep MaxHeapTuplesPerPage as a compile-time constant.
On the other hand, to the extent that we want to consider having
variable page sizes in the future, perhaps we do want to change this.
If so, the approach broadly looks reasonable to me, but I'd suggest we
make that a separate patch from the introduction of reserved_page_size.
@@ -211,7 +213,7 @@ PageAddItemExtended(Page page, if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || - phdr->pd_special > BLCKSZ) + phdr->pd_special + reserved_page_size > BLCKSZ) ereport(PANIC, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
Probably should add reserved_page_size to that errmsg output? Also,
this check of pointers seems to be done multiple times- maybe it should
be moved into a #define or similar?
@@ -723,7 +725,7 @@ PageRepairFragmentation(Page page) if (pd_lower < SizeOfPageHeaderData || pd_lower > pd_upper || pd_upper > pd_special || - pd_special > BLCKSZ || + pd_special + reserved_page_size > BLCKSZ || pd_special != MAXALIGN(pd_special)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
This ends up being the same as above ...
@@ -1066,7 +1068,7 @@ PageIndexTupleDelete(Page page, OffsetNumber offnum) if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || - phdr->pd_special > BLCKSZ || + phdr->pd_special + reserved_page_size > BLCKSZ || phdr->pd_special != MAXALIGN(phdr->pd_special)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
And here ...
@@ -1201,7 +1203,7 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems) if (pd_lower < SizeOfPageHeaderData || pd_lower > pd_upper || pd_upper > pd_special || - pd_special > BLCKSZ || + pd_special + reserved_page_size > BLCKSZ || pd_special != MAXALIGN(pd_special)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
And here ...
@@ -1307,7 +1309,7 @@ PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum) if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || - phdr->pd_special > BLCKSZ || + phdr->pd_special + reserved_page_size > BLCKSZ || phdr->pd_special != MAXALIGN(phdr->pd_special)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
And here ...
@@ -1419,7 +1421,7 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum, if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || - phdr->pd_special > BLCKSZ || + phdr->pd_special + reserved_page_size > BLCKSZ || phdr->pd_special != MAXALIGN(phdr->pd_special)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED),
And here ...
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index 6979aff727..060c4ab3e3 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -489,12 +489,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace, /* * Size Bloom filter based on estimated number of tuples in index, * while conservatively assuming that each block must contain at least - * MaxTIDsPerBTreePage / 3 "plain" tuples -- see + * MaxTIDsPerBTreePage() / 3 "plain" tuples -- see * bt_posting_plain_tuple() for definition, and details of how posting * list tuples are handled. */ total_pages = RelationGetNumberOfBlocks(rel); - total_elems = Max(total_pages * (MaxTIDsPerBTreePage / 3), + total_elems = Max(total_pages * (MaxTIDsPerBTreePage() / 3), (int64) state->rel->rd_rel->reltuples); /* Generate a random seed to avoid repetition */ seed = pg_prng_uint64(&pg_global_prng_state);
Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
end up with any 'leftover' space, from what I can tell. Again, though,
perhaps this should be split out as an independent patch from the rest.
That is- we can change the higher-level functions to be dynamic in the
initial patches, and then eventually we'll get down to making the
lower-level functions dynamic.
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h index efdf9415d1..8ebabdd7ee 100644 --- a/contrib/bloom/bloom.h +++ b/contrib/bloom/bloom.h @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)/* Number of blocks numbers fit in BloomMetaPageData */ -#define BloomMetaBlockN (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber)) +#define BloomMetaBlockN() ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/ sizeof(BlockNumber))#define BloomPageGetMeta(page) ((BloomMetaPageData *) PageGetContents(page))
@@ -151,6 +151,7 @@ typedef struct BloomState
#define BloomPageGetFreeSpace(state, page) \ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \ + - SizeOfPageReservedSpace() \ - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \ - MAXALIGN(sizeof(BloomPageOpaqueData)))
This formulation (or something close to it) tends to happen quite a bit:
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...
This is basically asking for "amount of usable space" where the
resulting 'usable space' either includes line pointers and tuples or
similar, or doesn't. Perhaps we should break this down into two
patches- one which provides a function to return usable space on a page,
and then the patch to add reserved_page_size can simply adjust that
instead of changing the very, very many places we have this forumlation.
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index d935ed8fbd..d3d74a9d28 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage) */ BloomInitPage(metaPage, BLOOM_META); metadata = BloomPageGetMeta(metaPage); - memset(metadata, 0, sizeof(BloomMetaPageData)); + memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());
This doesn't seem quite right? The reserved space is off at the end of
the page and this is 0'ing the space immediately after the page header,
if I'm following correctly, and only to the size of BloomMetaPageData...
metadata->magickNumber = BLOOM_MAGICK_NUMBER; metadata->opts = *opts; - ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData); + ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();
Not quite following what's going on here either.
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, */ if (BloomPageGetMaxOffset(page) != 0 && BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple && - countPage < BloomMetaBlockN) + countPage < BloomMetaBlockN()) notFullPage[countPage++] = blkno;
Looks to be another opportunity to have a separate patch making this
change first before actually changing the lower-level #define's,
diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple, * datatype, try to compress it in-line. */ if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) && - VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET && + VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() && (atttype->typstorage == TYPSTORAGE_EXTENDED || atttype->typstorage == TYPSTORAGE_MAIN)) {
Probably could be another patch but also if we're going to change
TOAST_INDEX_TARGET to be a function we should probably not have it named
in all-CAPS.
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c @@ -535,7 +535,7 @@ dataBeginPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack, * a single byte, and we can use all the free space on the old page as * well as the new page. For simplicity, ignore segment overhead etc. */ - maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize); + maxitems = Min(maxitems, freespace + GinDataPageMaxDataSize()); } else {
Ditto.
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c @@ -38,8 +38,8 @@ /* GUC parameter */ int gin_pending_list_limit = 0;-#define GIN_PAGE_FREESIZE \ - ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) ) +#define GIN_PAGE_FREESIZE() \ + ( BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - MAXALIGN(sizeof(GinPageOpaqueData)) - SizeOfPageReservedSpace() )
Another case of BLCKSZ - MAXALIGN(SizeOfPageHeaderData) -
SizeOfPageReservedSpace() ...
@@ -450,7 +450,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) * ginInsertCleanup() should not be called inside our CRIT_SECTION. */ cleanupSize = GinGetPendingListCleanupSize(index); - if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L) + if (metadata->nPendingPages * GIN_PAGE_FREESIZE() > cleanupSize * 1024L) needCleanup = true;
Also shouldn't be all-CAPS.
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c index 43b67893d9..5babbb457a 100644 --- a/src/backend/access/nbtree/nbtsplitloc.c +++ b/src/backend/access/nbtree/nbtsplitloc.c @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,/* Total free space available on a btree page, after fixed overhead */ leftspace = rightspace = - PageGetPageSize(origpage) - SizeOfPageHeaderData - + PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() - MAXALIGN(sizeof(BTPageOpaqueData));
Also here ... though a bit interesting that this uses PageGetPageSize()
instead of BLCKSZ.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 011ec18015..022b5eee4e 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -154,3 +154,4 @@ int64 VacuumPageDirty = 0;int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+
Unnecessary whitespace hunk ?
Thanks!
Stephen
On Fri, May 12, 2023 at 7:48 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* David Christensen (david.christensen@crunchydata.com) wrote:
Refreshing this with HEAD as of today, v4.
Thanks for updating this!
Thanks for the patience in my response here.
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
This space is reserved for extended data on the Page structure which will be ultimately used for
encrypted data, extended checksums, and potentially other things. This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at runtime based on specific
ControlFile features.No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with different settings here.This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.
Correct; a reserved_page_size of 0 would be the same page format as
currently exists, so you could use pg_upgrade with no page features
and be binary compatible with existing clusters.
Looking at the changes to bufpage.h, in particular ...
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h@@ -19,6 +19,14 @@
#include "storage/item.h"
#include "storage/off.h"+extern PGDLLIMPORT int reserved_page_size; + +#define SizeOfPageReservedSpace() reserved_page_size +#define MaxSizeOfPageReservedSpace 0 + +/* strict upper bound on the amount of space occupied we have reserved on + * pages in this cluster */This will eventually be calculated based on what features are supported
concurrently?
Correct; these are fleshed out in later patches.
@@ -36,10 +44,10 @@ * | v pd_upper | * +-------------+------------------------------------+ * | | tupleN ... | - * +-------------+------------------+-----------------+ - * | ... tuple3 tuple2 tuple1 | "special space" | - * +--------------------------------+-----------------+ - * ^ pd_special + * +-------------+-----+------------+----+------------+ + * | ... tuple2 tuple1 | "special space" | "reserved" | + * +-------------------+------------+----+------------+ + * ^ pd_special ^ reserved_page_spaceRight, adds a dynamic amount of space 'post-special area'.
Dynamic as in "fixed at initdb time" instead of compile time. However,
things are coded in such a way that the page feature bitmap is stored
on a given page, so different pages could have different
reserved_page_size depending on use case/code path. (Basically
preserving future flexibility while minimizing code changes here.) We
could utilize different features depending on what type of page it is,
say, or have different relations or tablespaces with different page
feature defaults.
@@ -73,6 +81,8 @@ * stored as the page trailer. an access method should always * initialize its pages with PageInit and then set its own opaque * fields. + * + * XXX - update more comments here about reserved_page_space */Would be good to do. ;)
Next revision... :D
@@ -325,7 +335,7 @@ static inline void PageValidateSpecialPointer(Page page) { Assert(page); - Assert(((PageHeader) page)->pd_special <= BLCKSZ); + Assert((((PageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData); }This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout? Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time. Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting. That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB. This would likely require multiple shared buffer
pools, of course...
I think multiple shared-buffer pools is a ways off; but sure, this
would support this sort of use case as well. I am working on a new
patch for this series (probably the first one in the series) which
will actually just abstract away all existing compile-time usages of
BLCKSZ. This will be a start in that direction and also make the
reserved_page_size patch a bit more reasonable to review.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 9a302ddc30..a93cd9df9f 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -26,6 +26,8 @@ /* GUC variable */ bool ignore_checksum_failure = false;+int reserved_page_size = 0; /* how much page space to reserve for extended unencrypted metadata */ +/* ----------------------------------------------------------------
* Page support functions
@@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
{
PageHeader p = (PageHeader) page;- specialSize = MAXALIGN(specialSize); + specialSize = MAXALIGN(specialSize) + reserved_page_size;Rather than make it part of specialSize, I would think we'd be better
off just treating them independently. Eg, the later pd_upper setting
would be done by:p->pd_upper = pageSize - specialSize - reserved_page_size;
etc.
I can see that there's a mild readability benefit, but really the
effect is local to PageInit(), so ¯\_(ツ)_/¯... happy to make that
change though.
@@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) * one that is both unused and deallocated. * * If flag PAI_IS_HEAP is set, we enforce that there can't be more than - * MaxHeapTuplesPerPage line pointers on the page. + * MaxHeapTuplesPerPage() line pointers on the page.Making MaxHeapTuplesPerPage() runtime dynamic is a requirement for
supporting multiple page sizes concurrently ... but I'm not sure it's
actually required for the reserved_page_size idea as currently
considered. The reason is that with 8K or larger pages, the amount of
space we're already throwing away is at least 20 bytes, if I did my math
right. If we constrain reserved_page_size to be 20 bytes or less, as I
believe we're currently thinking we won't need that much, then we could
perhaps keep MaxHeapTuplesPerPage as a compile-time constant.
In this version we don't have that explicit constraint. In practice I
don't know that we have many more than 20 bytes, at least for the
first few features, but I don't think we can count on that forever
going forward. At some point we're going to have to parameterize
these, so might as well do it in this pass, since how else would you
know that this magic value has been exceeded?
On the other hand, to the extent that we want to consider having
variable page sizes in the future, perhaps we do want to change this.
If so, the approach broadly looks reasonable to me, but I'd suggest we
make that a separate patch from the introduction of reserved_page_size.
The variable blocksize patch I'm working on includes some of this, so
this will be in the next revision.
@@ -211,7 +213,7 @@ PageAddItemExtended(Page page, if (phdr->pd_lower < SizeOfPageHeaderData || phdr->pd_lower > phdr->pd_upper || phdr->pd_upper > phdr->pd_special || - phdr->pd_special > BLCKSZ) + phdr->pd_special + reserved_page_size > BLCKSZ) ereport(PANIC, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",Probably should add reserved_page_size to that errmsg output? Also,
this check of pointers seems to be done multiple times- maybe it should
be moved into a #define or similar?
Sure, can change; agreed it'd be good to have. I just modified the
existing call sites and didn't attempt to change too much else.
[snipped other instances...]
Making MaxTIDsPerBTreePage dynamic looks to be required as it doesn't
end up with any 'leftover' space, from what I can tell. Again, though,
perhaps this should be split out as an independent patch from the rest.
That is- we can change the higher-level functions to be dynamic in the
initial patches, and then eventually we'll get down to making the
lower-level functions dynamic.
Same; should be accounted for in the next variable blocksize patch.
It does have a cascading effect though, so hard to make the high-level
functions dynamic but not the lower-level ones. What is the benefit in
this case for separating those two?
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h index efdf9415d1..8ebabdd7ee 100644 --- a/contrib/bloom/bloom.h +++ b/contrib/bloom/bloom.h @@ -131,7 +131,7 @@ typedef struct BloomMetaPageData #define BLOOM_MAGICK_NUMBER (0xDBAC0DED)/* Number of blocks numbers fit in BloomMetaPageData */ -#define BloomMetaBlockN (sizeof(FreeBlockNumberArray) / sizeof(BlockNumber)) +#define BloomMetaBlockN() ((sizeof(FreeBlockNumberArray) - SizeOfPageReservedSpace())/ sizeof(BlockNumber))#define BloomPageGetMeta(page) ((BloomMetaPageData *) PageGetContents(page))
@@ -151,6 +151,7 @@ typedef struct BloomState
#define BloomPageGetFreeSpace(state, page) \ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \ + - SizeOfPageReservedSpace() \ - BloomPageGetMaxOffset(page) * (state)->sizeOfBloomTuple \ - MAXALIGN(sizeof(BloomPageOpaqueData)))This formulation (or something close to it) tends to happen quite a bit:
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - SizeOfPageReservedSpace() ...
This is basically asking for "amount of usable space" where the
resulting 'usable space' either includes line pointers and tuples or
similar, or doesn't. Perhaps we should break this down into two
patches- one which provides a function to return usable space on a page,
and then the patch to add reserved_page_size can simply adjust that
instead of changing the very, very many places we have this forumlation.
Yeah, I can make this a computed expression; agreed it's pretty common
to have the usable space on the page so really any AM shouldn't know
or care about the details of either header or footer. Since we
already have PageGetContents() I will probably name it
PageGetContentsSize(). The AM can own everything from the pointer
returned by PageGetContents() through said size, allowing for both the
header and reserved_page_size in said computation.
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index d935ed8fbd..d3d74a9d28 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -430,10 +430,10 @@ BloomFillMetapage(Relation index, Page metaPage) */ BloomInitPage(metaPage, BLOOM_META); metadata = BloomPageGetMeta(metaPage); - memset(metadata, 0, sizeof(BloomMetaPageData)); + memset(metadata, 0, sizeof(BloomMetaPageData) - SizeOfPageReservedSpace());This doesn't seem quite right? The reserved space is off at the end of
the page and this is 0'ing the space immediately after the page header,
if I'm following correctly, and only to the size of BloomMetaPageData...
I think you're correct with that analysis. BloomInitPage() would have
(probably?) had a zero'd page so this underset would have been
unnoticed in practice, but still good to fix.
metadata->magickNumber = BLOOM_MAGICK_NUMBER; metadata->opts = *opts; - ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData); + ((PageHeader) metaPage)->pd_lower += sizeof(BloomMetaPageData) - SizeOfPageReservedSpace();Not quite following what's going on here either.
Heh, not sure either. Not sure if there was a reason or a mechanical
replacement, but will look when I do next revisions.
diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c @@ -116,7 +116,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, */ if (BloomPageGetMaxOffset(page) != 0 && BloomPageGetFreeSpace(&state, page) >= state.sizeOfBloomTuple && - countPage < BloomMetaBlockN) + countPage < BloomMetaBlockN()) notFullPage[countPage++] = blkno;Looks to be another opportunity to have a separate patch making this
change first before actually changing the lower-level #define's,
#include <stdpatch/variable-blocksize>
diff --git a/src/backend/access/brin/brin_tuple.c b/src/backend/access/brin/brin_tuple.c @@ -217,7 +217,7 @@ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno, BrinMemTuple *tuple, * datatype, try to compress it in-line. */ if (!VARATT_IS_EXTENDED(DatumGetPointer(value)) && - VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET && + VARSIZE(DatumGetPointer(value)) > TOAST_INDEX_TARGET() && (atttype->typstorage == TYPSTORAGE_EXTENDED || atttype->typstorage == TYPSTORAGE_MAIN)) {Probably could be another patch but also if we're going to change
TOAST_INDEX_TARGET to be a function we should probably not have it named
in all-CAPS.
Okay, can make those style changes as well; agreed ALLCAPS should be constant.
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c index 43b67893d9..5babbb457a 100644 --- a/src/backend/access/nbtree/nbtsplitloc.c +++ b/src/backend/access/nbtree/nbtsplitloc.c @@ -156,7 +156,7 @@ _bt_findsplitloc(Relation rel,/* Total free space available on a btree page, after fixed overhead */ leftspace = rightspace = - PageGetPageSize(origpage) - SizeOfPageHeaderData - + PageGetPageSize(origpage) - SizeOfPageHeaderData - SizeOfPageReservedSpace() - MAXALIGN(sizeof(BTPageOpaqueData));Also here ... though a bit interesting that this uses PageGetPageSize()
instead of BLCKSZ.
Yeah, a few little exceptions. Variable blocksize patch introduces
those every place it can, and ClusterBlockSize() anywhere it can't.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 011ec18015..022b5eee4e 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -154,3 +154,4 @@ int64 VacuumPageDirty = 0;int VacuumCostBalance = 0; /* working state for vacuum */
bool VacuumCostActive = false;
+Unnecessary whitespace hunk ?
Will clean up.
Thanks for the review,
David
Hi,
On 2023-05-09 17:08:26 -0500, David Christensen wrote:
From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
From: David Christensen <david@pgguru.net>
Date: Tue, 9 May 2023 16:56:15 -0500
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structureThis space is reserved for extended data on the Page structure which will be ultimately used for
encrypted data, extended checksums, and potentially other things. This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at runtime based on specific
ControlFile features.No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with
different settings here.
The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...
I think as a whole this is not an insane idea. A few comments:
- IMO the patch touches many places it shouldn't need to touch, because of
essentially renaming a lot of existing macro names to *Limit,
necessitating modifying a lot of users. I think instead the few places that
care about the runtime limit should be modified.
As-is the patch would cause a lot of fallout in extensions that just do
things like defining an on-stack array of Datums or such - even though all
they'd need is to change the define to the *Limit one.
Even leaving extensions aside, it must makes reviewing (and I'm sure
maintaining) the patch very tedious.
- I'm a bit worried about how the extra special page will be managed - if
there are multiple features that want to use it, who gets to put their data
at what offset?
After writing this I saw that 0002 tries to address this - but I don't like
the design. It introduces runtime overhead that seems likely to be visible.
- Checking for features using PageGetFeatureOffset() seems the wrong design to
me - instead of a branch for some feature being disabled, perfectly
predictable for the CPU, we need to do an external function call every time
to figure out that yet, checksums are *still* disabled.
- Recomputing offsets every time in PageGetFeatureOffset() seems too
expensive. The offsets can't change while running as PageGetFeatureOffset()
have enough information to distinguish between different kinds of relations
- so why do we need to recompute offsets on every single page? I'd instead
add a distinct offset variable for each feature.
- Modifying every single PageInit() call doesn't make sense to me. That'll
just create a lot of breakage for - as far as I can tell - no win.
- Why is it worth sacrificing space on every page to indicate which features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.
- Is it really useful to encode the set of features enabled in a cluster with
a bitmask? That pretty much precludes utilizing extra page space in
extensions. We could instead just have an extra cluster-wide file that
defines a mapping of offset to feature.
Greetings,
Andres Freund
Greetings,
* Andres Freund (andres@anarazel.de) wrote:
On 2023-05-09 17:08:26 -0500, David Christensen wrote:
From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
From: David Christensen <david@pgguru.net>
Date: Tue, 9 May 2023 16:56:15 -0500
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structureThis space is reserved for extended data on the Page structure which will be ultimately used for
encrypted data, extended checksums, and potentially other things. This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at runtime based on specific
ControlFile features.No effort is made to ensure this is backwards-compatible with existing clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with
different settings here.The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...I think as a whole this is not an insane idea. A few comments:
Thanks for all the feedback!
- Why is it worth sacrificing space on every page to indicate which features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.
In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.
Getting a consensus on if that's a requirement or not would definitely
be really helpful.
Thanks,
Stephen
On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Andres Freund (andres@anarazel.de) wrote:
On 2023-05-09 17:08:26 -0500, David Christensen wrote:
From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
From: David Christensen <david@pgguru.net>
Date: Tue, 9 May 2023 16:56:15 -0500
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structureThis space is reserved for extended data on the Page structure which
will be ultimately used for
encrypted data, extended checksums, and potentially other things.
This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at
runtime based on specific
ControlFile features.
No effort is made to ensure this is backwards-compatible with existing
clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with
different settings here.The first part of the last paragraph makes it sound like pg_upgrade
won't be
supported across this commit, rather than just between different
settings...
Yeah, that's vague, but you picked up on what I meant.
I think as a whole this is not an insane idea. A few comments:
Thanks for all the feedback!
- Why is it worth sacrificing space on every page to indicate which
features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.
Unsurprisingly, I agree that it's useful to keep these features on the page
itself; from a forensic standpoint this seems much easier to interpret what
is happening here, as well it would allow you to have different features on
a given page or type of page depending on need. The initial patch utilizes
pg_control to store the cluster page features, but there's no reason it
couldn't be dependent on fork/page type or stored in pg_tablespace to
utilize different features.
Thanks,
David
Greetings,
On Wed, Nov 8, 2023 at 20:55 David Christensen <
david.christensen@crunchydata.com> wrote:
On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost <sfrost@snowman.net> wrote:
* Andres Freund (andres@anarazel.de) wrote:
On 2023-05-09 17:08:26 -0500, David Christensen wrote:
From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
From: David Christensen <david@pgguru.net>
Date: Tue, 9 May 2023 16:56:15 -0500
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structureThis space is reserved for extended data on the Page structure which
will be ultimately used for
encrypted data, extended checksums, and potentially other things.
This data appears at the end of
the Page, after any `pd_special` area, and will be calculated at
runtime based on specific
ControlFile features.
No effort is made to ensure this is backwards-compatible with
existing clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with
different settings here.The first part of the last paragraph makes it sound like pg_upgrade
won't be
supported across this commit, rather than just between different
settings...
Yeah, that's vague, but you picked up on what I meant.
I think as a whole this is not an insane idea. A few comments:
Thanks for all the feedback!
- Why is it worth sacrificing space on every page to indicate which
features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.Unsurprisingly, I agree that it's useful to keep these features on the
page itself; from a forensic standpoint this seems much easier to interpret
what is happening here, as well it would allow you to have different
features on a given page or type of page depending on need. The initial
patch utilizes pg_control to store the cluster page features, but there's
no reason it couldn't be dependent on fork/page type or stored in
pg_tablespace to utilize different features.
When it comes to authenticated encryption, it’s also the case that it’s
unclear what value the checksum field has, if any… it’s certainly not
directly needed as a checksum, as the auth tag is much better for the
purpose of seeing if the page has been changed in some way. It’s also not
big enough to serve as an auth tag per NIST guidelines regarding the size
of the authenticated data vs. the size of the tag. Using it to indicate
what features are enabled on the page seems pretty useful, as David notes.
Thanks,
Stephen
Show quoted text
On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.
The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.
Any design in this area should try to make things as easy to debug as
possible, for the obvious reason: encrypted data that somehow becomes
corrupt is bound to be a nightmare to debug. (Besides, we already
support tools like pg_filedump, so this isn't a new principle.)
--
Peter Geoghegan
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-05-09 17:08:26 -0500, David Christensen wrote:
From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
From: David Christensen <david@pgguru.net>
Date: Tue, 9 May 2023 16:56:15 -0500
Subject: [PATCH v4 1/3] Add reserved_page_space to Page structureThis space is reserved for extended data on the Page structure which
will be ultimately used for
encrypted data, extended checksums, and potentially other things. This
data appears at the end of
the Page, after any `pd_special` area, and will be calculated at runtime
based on specific
ControlFile features.
No effort is made to ensure this is backwards-compatible with existing
clusters for `pg_upgrade`, as
we will require logical replication to move data into a cluster with
different settings here.The first part of the last paragraph makes it sound like pg_upgrade won't
be
supported across this commit, rather than just between different
settings...
Thanks for the review.
I think as a whole this is not an insane idea. A few comments:
- IMO the patch touches many places it shouldn't need to touch, because of
essentially renaming a lot of existing macro names to *Limit,
necessitating modifying a lot of users. I think instead the few places
that
care about the runtime limit should be modified.As-is the patch would cause a lot of fallout in extensions that just do
things like defining an on-stack array of Datums or such - even though
all
they'd need is to change the define to the *Limit one.Even leaving extensions aside, it must makes reviewing (and I'm sure
maintaining) the patch very tedious.
You make a good point, and I think you're right that we could teach the
places that care about runtime vs compile time differences about the
changes while leaving other callers alone. The *Limit ones were introduced
since we need constant values here from the Calc...() macros, but could try
keeping the existing *Limit with the old name and switching things around.
I suspect there will be the same amount of code churn, but less mechanical.
- I'm a bit worried about how the extra special page will be managed - if
there are multiple features that want to use it, who gets to put their
data
at what offset?After writing this I saw that 0002 tries to address this - but I don't
like
the design. It introduces runtime overhead that seems likely to be
visible.
Agreed this could be optimized.
- Checking for features using PageGetFeatureOffset() seems the wrong
design to
me - instead of a branch for some feature being disabled, perfectly
predictable for the CPU, we need to do an external function call every
time
to figure out that yet, checksums are *still* disabled.
This is probably not a supported approach (it felt a little icky), but I'd
played around with const pointers to structs of const elements, where the
initial values of a global var was populated early on (so set once and
never changed post init), and the compiler didn't complain and things
seemed to work ok; not sure if this approach might help balance the early
mutability and constant lookup needs:
typedef struct PageFeatureOffsets {
const Size feature0offset;
const Size feature1offset;
...
} PageFeatureOffsets;
PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = &offsets;
void InitOffsets() {
*((Size*)&offsets.feature0offset) = ...;
*((Size*)&offsets.feature1offset) = ...;
...
}
- Recomputing offsets every time in PageGetFeatureOffset() seems too
expensive. The offsets can't change while running as
PageGetFeatureOffset()
have enough information to distinguish between different kinds of
relations
Yes, this was a simple approach for ease of implementation; there is
certainly a way to precompute a lookup table from the page feature bitmask
into the offsets themselves or otherwise precompute, turn from function
call into inline/macro, etc.
- so why do we need to recompute offsets on every single page? I'd
instead
add a distinct offset variable for each feature.
This would work iff there is a single page feature set across all pages in
the cluster; I'm not sure we don't want more flexibility here.
- Modifying every single PageInit() call doesn't make sense to me. That'll
just create a lot of breakage for - as far as I can tell - no win.
This was a placeholder to allow different features depending on page type;
to keep things simple for now I just used the same values here, but we
could move this inside PageInit() instead (again, assuming single feature
set per cluster).
- Why is it worth sacrificing space on every page to indicate which
features
were enabled? I think there'd need to be some convincing reasons for
introducing such overhead.
The point here is if we can use either GCM authtag or stronger checksums
then we've gained the ability to authenticate the page contents at the cost
of reassigning those bits, in a way that would support variable
permutations of features for different relations or page types, if so
desired. A single global setting here both eliminates that possibility as
well as requires external data in order to fully interpret pages.
- Is it really useful to encode the set of features enabled in a cluster
with
a bitmask? That pretty much precludes utilizing extra page space in
extensions. We could instead just have an extra cluster-wide file that
defines a mapping of offset to feature.
Given the current design, yes we do need that, which does make it harder to
allocate/use from an extension. Due to needing to have consistent offsets
for a given feature set (however represented on a page), the implementation
load going forward as-is involves ensuring that a given bit always maps to
the same offset in the page regardless of additional features available in
the future. So the 0'th bit if enabled would always map to the 8 byte chunk
at the end of the page, the 1st bit corresponds to some amount of space
prior to that, etc. I'm not sure how to get that property without some
sort of bitmap or otherwise indexed operation.
I get what you're saying as far as the more global approach, and while that
does lend itself to some nice properties in terms of extensibility, some of
the features (GCM tags in particular) need to be able to control the page
offset at a consistent location so we can decode the rest of the page
without knowing anything else.
Additionally, since the reserved space/page features are configured at
initdb time I am unclear how a given extension would even be able to stake
a claim here. ...though if we consider this a two-part problem, one of
space reservation and one of space usage, that part could be handled via
allocating more than the minimum in the reserved_page_space and allowing
unallocated page space to be claimed later via some sort of additional
functions/other hook. That opens up other questions though, tracking
whether said space has ever been initialized and what to do when first
accessing existing/new pages as one example.
Best,
David
Hi,
On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page. For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.
I'm not finding that very convincing - without cluster wide data, like keys, a
tool like pg_filedump isn't going to be able to do much with encrypted
pages. Given the need to look at some global data, figuring out the offset at
which data starts based on a value in pg_control isn't meaningfully worse than
having the data on each page.
Storing redundant data in each page header, when we've wanted space in the
page header for plenty other things, just doesn't seem a good use of said
space.
Greetings,
Andres Freund
On Mon, Nov 13, 2023 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
In conversations with folks (my memory specifically is a discussion
with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever)to
provide info needed be able to interpret the page. For my part, I
don't
have a particularly strong feeling on that, but that's what lead to
this
design.
The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.I'm not finding that very convincing - without cluster wide data, like
keys, a
tool like pg_filedump isn't going to be able to do much with encrypted
pages. Given the need to look at some global data, figuring out the offset
at
which data starts based on a value in pg_control isn't meaningfully worse
than
having the data on each page.Storing redundant data in each page header, when we've wanted space in the
page header for plenty other things, just doesn't seem a good use of said
space.
This scheme would open up space per page that would now be available for
plenty of other things; the encoding in the header and the corresponding
available space in the footer would seem to open up quite a few options
now, no?