MaxOffsetNumber for Table AMs

Started by Jeff Davisalmost 5 years ago88 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.

The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:

* InvalidBlockNumber of 0xFFFFFFFF. Not a problem.
* InvalidOffsetNumber of 0. Not a problem.
* MaxOffsetNumber of 2048. Does this limit really apply to table AMs?
It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.
* ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
offset number. Is this a bug?

As a table AM author, I'd like to know what the real limits are so that
I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:

1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.
3. Define MaxOffsetNumber as 65536 and introduce a new
MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().

Thoughts?

Regards,
Jeff Davis

#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Jeff Davis (#1)
Re: MaxOffsetNumber for Table AMs

On Fri, 30 Apr 2021, 09:46 Jeff Davis, <pgsql@j-davis.com> wrote:

The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.

The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:

* InvalidBlockNumber of 0xFFFFFFFF. Not a problem.
* InvalidOffsetNumber of 0. Not a problem.
* MaxOffsetNumber of 2048. Does this limit really apply to table AMs?

MaxOffsetNumber is not per se 2048. It is BLCKSZ / sizeof(ItemIdData),
which is only 2048 for a 8kiB BLCKSZ. As we support BLCKSZ up to 32kiB,
MaxOffsetNumber can be as large as 8196.

Other than that, I believe you've also missed the special offset numbers
specified in itemptr.h (SpecTokenOffsetNumber and MovedPartitionsOffsetNumber).
I am not well enough aware of the usage of these OffsetNumber values, but
those might also be limiting the values any tableAM can use for their TIDs.

It just seems like it's used when scanning heap or index pages for

stack-allocated arrays. For a table AM it appears to waste 5 bits.

MaxOffsetNumber is used for postgres' Page layout, of which the
MaxOffsetNumber is defined as how many item pointers could exist on a page,
and AFAIK should be used for postgres' Page layout only. No thing can or
should change that. If any code asserts limitations to the ip_posid of
table tuples that could also not be tableam tuples, then I believe that is
probably a mistake in postgres, and that should be fixed.

* ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max

offset number. Is this a bug?

No. The documentation for that function explicitly mentions that these item
pointers are optimized for storage when using the heap tableam, and that
that code will be changed once there exist tableAMs with different TID /
ip_posid constraints (see the comment on lines 32 and 33 of that file).

Note that the limiting number that itemptr_to_uint64 should mention for bit
calculations is actually MaxHeaptuplesPerPage, which is about one seventh
of MaxOffsetNumber. The resulting number of bits reserved is not a
miscalculation though, because MaxHeaptuplesPerPage (for 32kiB BLCKSZ)
requires the mentioned 11 bits, and adapting bit swizzling for multiple
page sizes was apparently not considered worth the effort.

As a table AM author, I'd like to know what the real limits are so that

I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:

1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().

I believe that this is the right way when there exist tableAMs that use
those upper 5 bits.

2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.

If you're going to change MaxOffsetNumber, I believe that it's better to
change it to ((BLCKSZ - sizeof(PageHeaderData)) / sizeof(ItemIdData)), as
that is the maximum amount of ItemIds you could put on a Page that has no
page opaque.

3. Define MaxOffsetNumber as 65536 and introduce a new

MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().

I believe that that breaks more things than otherwise required. ip_posid is
already limited to uint16, so I see no reason to add a constant that would
assert that the value of any uint16 is less its max value plus one.

With regards,

Matthias van de Meent

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#1)
Re: MaxOffsetNumber for Table AMs

Jeff Davis <pgsql@j-davis.com> writes:

The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.
The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

I'm inclined to think that when we get around to doing something
about this, we need to make a much bigger change than just poking
at the margins of type tid.

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers. heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

regards, tom lane

In reply to: Tom Lane (#3)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 8:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers. heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

I agree. It'll be easier for a new table AM to be developed with that
constraint than it will be to retrofit it to every index AM. It
probably wouldn't be that hard to make nbtree deduplication and GIN
posting list compression work with uint64 TIDs. But variable-width
TIDs are a very different matter.

Compatibility with index AMs is more than a matter of switching out
the tuple identifier -- if you invent something that has totally
different performance characteristics for index AMs, then it's likely
to break tacit assumptions about the cost of certain things. For
example, index tuple deletion probably relies on the fact that there
just isn't that many table blocks to visit (to get an XID for recovery
conflict purposes) in practice due to various locality-related
effects. Plus deduplication ameliorates problems with version churn in
indexes -- presumably the same problems will exist when any new table
AM is used, and so it'll be useful to address the same problems in the
same way.

I agree that it might well be useful to make TIDs fully logical (not
"physiological" -- physical across blocks, logical within blocks) for
some new table AM. Even then, it would still definitely be a good idea
to make these logical TID values correlate with the physical table
structure in some way. Plus TIDs should still be fixed size. If a new
table AM can't do it that way then that certainly needs to be
justified -- it's unreasonable to imagine that it simply isn't the
table AM's problem to solve.

--
Peter Geoghegan

#5Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#3)
Re: MaxOffsetNumber for Table AMs

On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers.

One challenge might be reliance on InvalidOffsetNumber as a special
value in a number of places (e.g. bitmap index scans). That doesn't
seem like a major problem though.

heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Do you mean that indexes would be expected to hold a uint64, a 48-bit
int (that directly maps to a uint64), or still hold an ItemPointerData?

Regards,
Jeff Davis

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#5)
Re: MaxOffsetNumber for Table AMs

Jeff Davis <pgsql@j-davis.com> writes:

On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers.

Do you mean that indexes would be expected to hold a uint64, a 48-bit
int (that directly maps to a uint64), or still hold an ItemPointerData?

ISTM that would be up to the index AM. We'd need some interlocks on
which index AMs could be used with which table AMs in any case, I think.

It'd likely not be hard for existing index AMs to be repurposed to store
"any 48-bit TID", but extending them to full 64-bit TIDs may be
impractical.

I think the hard part may really be in places like tidbitmap.c, which
one would wish to be AM-independent, but right now it's quite specific
to heap-style TIDs. Maybe we can think of a way to parameterize it.

regards, tom lane

#7Jeff Davis
pgsql@j-davis.com
In reply to: Matthias van de Meent (#2)
Re: MaxOffsetNumber for Table AMs

On Fri, 2021-04-30 at 12:04 +0200, Matthias van de Meent wrote:

Other than that, I believe you've also missed the special offset
numbers specified in itemptr.h (SpecTokenOffsetNumber and
MovedPartitionsOffsetNumber). I am not well enough aware of the usage
of these OffsetNumber values, but those might also be limiting the
values any tableAM can use for their TIDs.

Yes, thank you. If it is treated specially in a heap tuple, it can't be
a regular TID.

It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.

MaxOffsetNumber is used for postgres' Page layout, of which the
MaxOffsetNumber is defined as how many item pointers could exist on a
page, and AFAIK should be used for postgres' Page layout only. No
thing can or should change that. If any code asserts limitations to
the ip_posid of table tuples that could also not be tableam tuples,
then I believe that is probably a mistake in postgres, and that
should be fixed.

A name that would better fit your definition would be something like
"MaxItemsPerPage".

The name "MaxOffsetNumber" implies that any number past that must be
either invalid or special. But it seems like you are saying that if I
use an offset number of 5000 in my table AM, then that's fine and
should be treated like a normal TID.

No. The documentation for that function explicitly mentions that
these item pointers are optimized for storage when using the heap
tableam, and that that code will be changed once there exist tableAMs
with different TID / ip_posid constraints (see the comment on lines
32 and 33 of that file).

Thank you.

I'm a table AM author, and I'd like to use whatever the real range of
TIDs is. Does that mean it's time to change that code in
ginpostinglist.c now?

1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().

I believe that this is the right way when there exist tableAMs that
use those upper 5 bits.

Does that mean we should declare the valid range of offsets to be
between 1 and 0xfffc (inclusive)?

I'm trying to use some mapping now that's somewhat stable so that I
don't have to worry that something will break later, and then require
reindexing all tables with my table AM.

Regards,
Jeff Davis

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers. heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here, because a global index
always needs to have extra bits compared to a partition-local index;
if you let the partition-local index use more bits, the global index
now needs even more space. The other is indirect indexes, work Álvaro
proposed a few years ago, where the index entry points to the primary
key value rather than a TID. The space needs for that are based on the
type of the primary key column. This proposal solves neither of those
problems.

Another problem in this general area is that there is quite a bit of
code that thinks a TID is specifically a block number and an offset,
like the Bitmap Index/Heap Scan code, for example. But making tuple
IDs wider doesn't help with that problem either.

What problem do you think this proposal does solve?

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#4)
Re: MaxOffsetNumber for Table AMs

On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:

Compatibility with index AMs is more than a matter of switching out
the tuple identifier -- if you invent something that has totally
different performance characteristics for index AMs, then it's likely
to break tacit assumptions about the cost of certain things.

I think it would be reasonable to document and expect that table AMs
offer some locality of access for tuples with similar IDs. Do you think
we need something stronger than that?

Plus deduplication ameliorates problems with version churn
in
indexes -- presumably the same problems will exist when any new table
AM is used, and so it'll be useful to address the same problems in
the
same way.

I got lost after "presumably the same problems", can you explain?

I agree that it might well be useful to make TIDs fully logical (not
"physiological" -- physical across blocks, logical within blocks) for
some new table AM. Even then, it would still definitely be a good
idea
to make these logical TID values correlate with the physical table
structure in some way.

Agreed.

Regards,
Jeff Davis

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: MaxOffsetNumber for Table AMs

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here,

I agree that global indexes need more bits, but it doesn't necessarily
follow that we must have variable-width TIDs. We could for example
say that "real" TIDs are only 48 bits and index AMs that want to be
usable as global indexes must be capable of handling 64-bit TIDs,
leaving 16 bits for partition ID. A more forward-looking definition
would require global index AMs to store 96 bits (partition OID plus
64-bit TID). Either way would be far simpler for every moving part
involved than going over to full varlena TIDs.

Another problem in this general area is that there is quite a bit of
code that thinks a TID is specifically a block number and an offset,
like the Bitmap Index/Heap Scan code, for example. But making tuple
IDs wider doesn't help with that problem either.

Agreed, that's an area that will need a lot of thought for anything that
we do here. But varlena TIDs surely do not make that easier to fix.

What problem do you think this proposal does solve?

Accommodating table AMs that want more than 48 bits for a TID.
We're already starting to run up against the fact that that's not
enough bits for plausible use-cases. 64 bits may someday in the far
future not be enough either, but I think that's a very long way off.

regards, tom lane

In reply to: Tom Lane (#10)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here,

I agree that global indexes need more bits, but it doesn't necessarily
follow that we must have variable-width TIDs. We could for example
say that "real" TIDs are only 48 bits and index AMs that want to be
usable as global indexes must be capable of handling 64-bit TIDs,
leaving 16 bits for partition ID. A more forward-looking definition
would require global index AMs to store 96 bits (partition OID plus
64-bit TID). Either way would be far simpler for every moving part
involved than going over to full varlena TIDs.

The question of how the on-disk format on indexes needs to be changed
to accomodate global indexes seems like an entirely separate question
to how we go about expanding or redefining TIDs.

Global indexes should work by adding an extra column that is somewhat
like a TID, that may even have its own pg_attribute entry. It's much
more natural to make the partition number a separate column IMV --
nbtree suffix truncation and deduplication can work in about the same
way as before. Plus you'll need to do predicate pushdown using the
partition identifier in some scenarios anyway. You can make the
partition identifier variable-width without imposing the cost and
complexity of variable-width TIDs on index AMs.

I believe that the main reason why there have been so few problems
with any of the nbtree work in the past few releases is that it
avoided certain kinds of special cases. Any special cases in the
on-disk format and in the space accounting used when choosing a split
point ought to be avoided at all costs. We can probably afford to add
a lot of complexity to make global indexes work, but it ought to be
contained to cases that actually use global indexes in an obvious way.

--
Peter Geoghegan

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree that global indexes need more bits, but it doesn't necessarily
follow that we must have variable-width TIDs. We could for example
say that "real" TIDs are only 48 bits and index AMs that want to be
usable as global indexes must be capable of handling 64-bit TIDs,
leaving 16 bits for partition ID. A more forward-looking definition
would require global index AMs to store 96 bits (partition OID plus
64-bit TID). Either way would be far simpler for every moving part
involved than going over to full varlena TIDs.

16 bits is not much for a partition identifier. We've already had
complaints about INNER_VAR being too small, so apparently there are
people who want to use really large numbers of partitions. But even if
we imagine a hypothetical world where nobody uses more than a couple
thousand partitions at once, it's very reasonable to want to avoid
recycling partition identifiers so that detaching a partition can be
O(1), and there's no way that's going to be viable if the whole
address space is only 16 bits, because with time series data people
are going to be continually creating new partitions and dropping old
ones. I would guess that it probably is viable with 32 bits, but we'd
have to have a mapping layer rather than using the OID directly to
avoid wraparound collisions.

Now this problem can be avoided by just requiring the AM to store more
bits, exactly as you say. I suspect 96 bits is large enough for all of
the practical use cases people have, or at least within spitting
distance. But it strikes me as relatively inefficient to say that
we're always going to store 96 bits for every TID. I certainly don't
think we want to break on-disk compatibility and widen every existing
btree index by changing all the 6-byte TIDs they're storing now to
store 12 bytes TIDs that are at least half zero bytes, so I think
we're bound to end up with at least two options: 6 and 12. But
variable-width would be a lot nicer. You could store small TIDs and
small partition identifiers very compactly, and only use the full
number of bytes when the situation demands it.

What problem do you think this proposal does solve?

Accommodating table AMs that want more than 48 bits for a TID.
We're already starting to run up against the fact that that's not
enough bits for plausible use-cases. 64 bits may someday in the far
future not be enough either, but I think that's a very long way off.

Do people actually want to store more than 2^48 rows in a table, or is
this more about the division of a TID into a block number and an item
number?

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#8)
Re: MaxOffsetNumber for Table AMs

On Fri, 2021-04-30 at 12:51 -0400, Robert Haas wrote:

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs
implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here, because a global index
always needs to have extra bits compared to a partition-local index;
if you let the partition-local index use more bits, the global index
now needs even more space. The other is indirect indexes, work Álvaro
proposed a few years ago, where the index entry points to the primary
key value rather than a TID. The space needs for that are based on
the
type of the primary key column. This proposal solves neither of those
problems.

The particular problem I have now is that Table AMs seem to support
indexes just fine, but TIDs are under-specified so I don't know what I
really have to work with. BlockNumber seems well-specified as
0..0XFFFFFFFE (inclusive), but I don't know what the valid range of
OffsetNumber is for the purposes of a TableAM.

Part of changing to uint64 would be specifying the TIDs in a way that I
could rely on in the future.

The problems you mention above are above the table AM layer, so they
seem orthogonal. There would still need to be an ID that table AMs can
use to fetch a tuple from a particular physical table.

In the future we may support primary unique indexes at the table AM
layer, which would get more interesting. I can see an argument for a
TID being an arbitrary datum in that case, but I haven't really
considered the design implications. Is this what you are suggesting?

Regards,
Jeff Davis

#14Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#11)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 1:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

Global indexes should work by adding an extra column that is somewhat
like a TID, that may even have its own pg_attribute entry. It's much
more natural to make the partition number a separate column IMV --
nbtree suffix truncation and deduplication can work in about the same
way as before. Plus you'll need to do predicate pushdown using the
partition identifier in some scenarios anyway. You can make the
partition identifier variable-width without imposing the cost and
complexity of variable-width TIDs on index AMs.

I agree up to a point but ... are you imagining that the TID continues
to have its own special place in the page, while the partition
identifier is stored more like a regular tuple column? Because it
seems to me that it would be better to try to eliminate the
special-storage case, just like we did for OIDs. If you want a 6-byte
TID, put a 6-byte column in the tuple for it. If you also want a
partition identifier, put an extra column in the tuple for that. If
you want a wider TID or a varlena TID, well, put columns for that into
the tuple instead of the 6-byte column you'd normally put. This seems
extremely flexible and a lot more aesthetically appealing than what we
have today.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Jeff Davis (#9)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 10:04 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:

Compatibility with index AMs is more than a matter of switching out
the tuple identifier -- if you invent something that has totally
different performance characteristics for index AMs, then it's likely
to break tacit assumptions about the cost of certain things.

I think it would be reasonable to document and expect that table AMs
offer some locality of access for tuples with similar IDs. Do you think
we need something stronger than that?

I don't know. This conversation is still too abstract for me to be
able to take a firm position. ISTM that we tend to talk about the
table AM in excessively abstract terms. It would be a lot easier if we
had clear fixed goals for a small number of additional table AMs.

Plus deduplication ameliorates problems with version churn
in
indexes -- presumably the same problems will exist when any new table
AM is used, and so it'll be useful to address the same problems in
the
same way.

I got lost after "presumably the same problems", can you explain?

Well, there are now two things in nbtree that specifically help with
version churn caused by non-HOT updates: deduplication, and bottom-up
index deletion (especially the latter). Presumably any new table AM
will have something like non-HOT updates. Though they may rarely be a
problem (say because the new table AM isn't really for OLTP), whenever
they are a problem they'll be a very big problem. It seems like a good
idea to have the same index AM level protections against accumulating
version-churn index tuples in an unbounded way.

More generally, it seems like a good idea to try to make new table AMs
reasonably close to heapam insofar as possible. The reality is that
everything evolved around heapam, and that that's likely to matter in
all kinds of ways that nobody fully understands just yet. We have a
somewhat abstract table AM interface, which is good, but that doesn't
mean that table AMs can be designed as if it was a green field
situation.

--
Peter Geoghegan

#16Jeff Davis
pgsql@j-davis.com
In reply to: Tom Lane (#6)
Re: MaxOffsetNumber for Table AMs

On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote:

ISTM that would be up to the index AM. We'd need some interlocks on
which index AMs could be used with which table AMs in any case, I
think.

I'm not sure why? It seems like we should be able to come up with
something that's generic enough.

I think the hard part may really be in places like tidbitmap.c, which
one would wish to be AM-independent, but right now it's quite
specific
to heap-style TIDs. Maybe we can think of a way to parameterize it.

For my particular AM, being able to have a parameterized granularity
might be nice, but not required.

Regards,
Jeff Davis

#17Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#13)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 1:37 PM Jeff Davis <pgsql@j-davis.com> wrote:

The particular problem I have now is that Table AMs seem to support
indexes just fine, but TIDs are under-specified so I don't know what I
really have to work with. BlockNumber seems well-specified as
0..0XFFFFFFFE (inclusive), but I don't know what the valid range of
OffsetNumber is for the purposes of a TableAM.

I agree that this is a problem.

Part of changing to uint64 would be specifying the TIDs in a way that I
could rely on in the future.

I mean, from my perspective, the problem here is that the abstraction
layer is leaky and things outside of the table AM layer know what heap
is doing under the hood, and rely on it. If we could refactor the
abstraction to be less leaky, it would be clearer what assumptions
table AM authors can make. If we can't, any specification doesn't seem
worth much.

In the future we may support primary unique indexes at the table AM
layer, which would get more interesting. I can see an argument for a
TID being an arbitrary datum in that case, but I haven't really
considered the design implications. Is this what you are suggesting?

I think that would be the best long-term plan. I guess I have two
distinguishable concerns. One is that I want to be able to have
indexes with a payload that's not just a 6-byte TID; e.g. adding a
partition identifier to support global indexes, or replacing the
6-byte TID with a primary key reference to support indirect indexes.
The other concern is to be able to have table AMs that use arbitrary
methods to identify a tuple. For example, if somebody implemented an
index-organized table, the "TID" would really be the primary key.

Even though these are distinguishable concerns, they basically point
in the same direction as far as index layout is concerned. The
implications for the table AM layer are somewhat different in the two
cases, but both argue that some places that are now talking about TIDs
should be changed to talk about Datums or something of that sort.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#14)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 10:39 AM Robert Haas <robertmhaas@gmail.com> wrote:

I agree up to a point but ... are you imagining that the TID continues
to have its own special place in the page, while the partition
identifier is stored more like a regular tuple column? Because it
seems to me that it would be better to try to eliminate the
special-storage case, just like we did for OIDs.

I agree in principle, but making that work well is very hard in
practice because of the format of IndexTuple -- which bleeds into
everything. That TID is special is probably a natural consequence of
the fact that we don't have an offset-based format of the kind you see
in other DB systems -- systems that don't emphasize extensibility. We
cannot jump to a hypothetical TID attribute inexpensively inside code
like _bt_compare() because we don't have a cheap way to jump straight
to the datum for any attribute. So we just store TID in IndexTuple
directly instead. Imagine how much more expensive VACUUM would be if
it had to grovel through the IndexTuple format.

I wonder how the same useful performance characteristics can be
maintained with a variable-width TID design. If you solve the problem
by changing IndexTuple, then you are kind of obligated to not use
varlena headers to keep the on-disk size manageable. Who knows where
it all ends?

--
Peter Geoghegan

In reply to: Robert Haas (#17)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 10:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think that would be the best long-term plan. I guess I have two
distinguishable concerns. One is that I want to be able to have
indexes with a payload that's not just a 6-byte TID; e.g. adding a
partition identifier to support global indexes, or replacing the
6-byte TID with a primary key reference to support indirect indexes.
The other concern is to be able to have table AMs that use arbitrary
methods to identify a tuple. For example, if somebody implemented an
index-organized table, the "TID" would really be the primary key.

Even though these are distinguishable concerns, they basically point
in the same direction as far as index layout is concerned. The
implications for the table AM layer are somewhat different in the two
cases, but both argue that some places that are now talking about TIDs
should be changed to talk about Datums or something of that sort.

I don't know how it's possible to do any of this without first
addressing what the table AM does in cases where heapam currently does
a non-HOT update. You obviously cannot have the equivalent of
duplicate TIDs when your new table AM runs into these scenarios. So
what do you do instead? How do you make your clustered index/IoT style
identifiers (i.e. your strictly logical TID-like identifiers) deal
with that case?

ISTM that this is by far the biggest issue with generalizing the table
AM for use by a tableam (that isn't already very much like heapam). I
am always surprised to be the one that has to point it out during
these discussions. It's a huge issue.

--
Peter Geoghegan

#20Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#18)
Re: MaxOffsetNumber for Table AMs

On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote:

I agree in principle, but making that work well is very hard in
practice because of the format of IndexTuple -- which bleeds into
everything. That TID is special is probably a natural consequence of
the fact that we don't have an offset-based format of the kind you see
in other DB systems -- systems that don't emphasize extensibility. We
cannot jump to a hypothetical TID attribute inexpensively inside code
like _bt_compare() because we don't have a cheap way to jump straight
to the datum for any attribute. So we just store TID in IndexTuple
directly instead. Imagine how much more expensive VACUUM would be if
it had to grovel through the IndexTuple format.

I can't imagine that, so maybe you want to enlighten me? I see that
there's a potential problem there, and I'm glad you pointed it out
because I hadn't thought about it previously ... but if you always put
the column or columns that VACUUM would need first, it's not obvious
to me that it would be all that expensive. Deforming the tuple to a
sufficient degree to extract the first column, which would even be
fixed-width, shouldn't take much work.

I wonder how the same useful performance characteristics can be
maintained with a variable-width TID design. If you solve the problem
by changing IndexTuple, then you are kind of obligated to not use
varlena headers to keep the on-disk size manageable. Who knows where
it all ends?

What's wrong with varlena headers? It would end up being a 1-byte
header in practically every case, and no variable-width representation
can do without a length word of some sort. I'm not saying varlena is
as efficient as some new design could hypothetically be, but it
doesn't seem like it'd be a big enough problem to stress about. If you
used a variable-width representation for integers, you might actually
save bytes in a lot of cases. An awful lot of the TIDs people store in
practice probably contain several zero bytes, and if we make them
wider, that's going to be even more true.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#15)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#17)
In reply to: Robert Haas (#20)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#19)
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#25)
#27Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#25)
In reply to: Jeff Davis (#27)
In reply to: Robert Haas (#26)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#28)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#29)
In reply to: Robert Haas (#30)
In reply to: Robert Haas (#31)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#32)
In reply to: Robert Haas (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#35)
#37Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#35)
In reply to: Robert Haas (#36)
#39Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#35)
In reply to: Matthias van de Meent (#37)
#41Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#36)
#42Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#38)
In reply to: Jeff Davis (#39)
#44Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#43)
In reply to: Matthias van de Meent (#44)
#46Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#16)
In reply to: Jeff Davis (#46)
#48Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#47)
In reply to: Jeff Davis (#48)
#50Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#23)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#41)
#53Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#47)
#54Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#52)
In reply to: Jeff Davis (#53)
In reply to: Andres Freund (#51)
#57Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#56)
#58Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#55)
In reply to: Andres Freund (#57)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#59)
In reply to: Robert Haas (#60)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#61)
#63Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#60)
In reply to: Robert Haas (#62)
#65Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#61)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#63)
In reply to: Robert Haas (#66)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#64)
#69Jeff Davis
pgsql@j-davis.com
In reply to: Peter Geoghegan (#67)
In reply to: Robert Haas (#68)
#71Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#66)
#72Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#69)
In reply to: Jeff Davis (#69)
In reply to: Andres Freund (#72)
#75Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#71)
In reply to: Jeff Davis (#75)
#77Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#64)
In reply to: Matthias van de Meent (#77)
#79Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#78)
In reply to: Matthias van de Meent (#79)
#81Robert Haas
robertmhaas@gmail.com
In reply to: Matthias van de Meent (#77)
#82Hannu Krosing
hannu@tm.ee
In reply to: Peter Geoghegan (#74)
#83Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#81)
#84Jeff Davis
pgsql@j-davis.com
In reply to: Hannu Krosing (#82)
#85Hannu Krosing
hannu@tm.ee
In reply to: Jeff Davis (#84)
#86Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#84)
#87Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#80)
In reply to: Matthias van de Meent (#87)