Status of the table access method work

Started by Andres Freundalmost 7 years ago21 messages

andres@anarazel.de

almost 7 years ago

Hi,

In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.

I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.

I think what's in v12 - I don't know of any non-cleanup / bugfix work
pending for 12 - is a pretty reasonable initial set of features. It
allows to reimplement a heap like storage without any core modifications
(except WAL logging, see below); it is not sufficient to implement a
good index oriented table AM. It does not allow to store the catalog in
a non heap table.

The tableam interface itself doesn't care that much about the AM
internally stores data. Most of the API (sequential scans, index
lookups, insert/update/delete) don't know about blocks, and only
indirectly & optionally about buffers (via BulkInsertState). There's a
few callbacks / functions that do care about blocks, because it's not
clear, or would have been too much work, to remove the dependency. This
currently is:

- ANALYZE integration - currently the sampling logic is tied to blocks.
- index build range scans - the range is defined as blocks
- planner relation size estimate - but that could trivially just be
filled with size-in-bytes / BLCKSZin the callback.
- the (optional) bitmap heap scan API - that's fairly intrinsically
block based. An AM could just internally subdivide TIDs in a different
way, but I don't think a bitmap scan like we have would e.g. make a
lot of sense for an index oriented table without any sort of stable
tid.
- the sample scan API - tsmapi.h is block based, so the tableam.h API is
as well.

I think none of these are limiting in a particularly bad way.

The most constraining factor for storage, I think, is that currently the
API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
byte tuple identifier). One can implement scans, and inserts into
index-less tables without providing that, but no updates, deletes etc.
One reason for that is that it'd just have required more changes to
executor etc to allow for wider identifiers, but the primary reason is
that indexes currently simply don't support anything else.

I think this is, by far, the biggest limitation of the API. If one
e.g. wanted to implement a practical index-organized-table, the 6 byte
limitation obviously would become a limitation very quickly. I suspect
that we're going to want to get rid of that limitation in indexes before
long for other reasons too, to allow global indexes (which'd need to
encode the partition somewhere).

With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.

Currently there's some, fairly minor, requirement that TIDs are actually
unique when not using a snapshot qualifier. That's currently only
relevant for GetTupleForTrigger(), AfterTriggerSaveEvent() and
EvalPlanQualFetchRowMarks(), which use SnapshotAny. That prevents AMs
from implementing in-place updates (thus a problem e.g. for zheap).
I've a patch that fixes that, but it's too hacky for v12 - there's not
always a convenient snapshot to fetch a row (e.g. in
GetTupleForTrigger() after EPQ the row isn't visible to
es_snapshot).

A second set of limitations is around making more of tableam
optional. Right now it e.g. is not possible to have an AM that doesn't
implement insert/update/delete. Obviously an AM can just throw an error
in the relevant callbacks, but I think it'd be better if we made those
callbacks optional, and threw errors at parse-analysis time (both to
make the errors consistent, and to ensure it's consistently thrown,
rather than only when e.g. an UPDATE actually finds a row to update).

Currently foreign keys are allowed between tables of different types of
AM. I am wondering whether we ought to allow AMs to forbid being
referenced. If e.g. an AM has lower consistency guarantees than the AM
of the table referencing it, it might be preferrable to forbid
that. OTOH, I guess such an AM could just require UNLOGGED to be used.

Another restriction is actually related to UNLOGGED - currently the
UNLOGGED processing after crashes works by recognizing init forks by
file name. But what if e.g. the storage isn't inside postgres files? Not
sure if we actually can do anything good about that.

The last issue I know about is that nodeBitmapHeapscan.c and
nodeIndexOnlyscan.c currently directly accesses the visibilitymap. Which
means if an AM doesn't use the VM, they're never going to use the
optimized path. And conversely if the AM uses the VM, it needs to
internally map tids in way compatible with heap. I strongly suspect
that we're going to have to fix this quite soon.

It'd be a pretty significant amount of work to allow storing catalogs in
a non-heap table. One difficulty is that there's just a lot of direct
accesses to catalog via heapam.h APIs - while a significant amount of
work to "fix" that, it's probably not very hard for each individual
site. There's a few places that rely on heap internals (checking xmin
for invalidation and the like). I think the biggest issue however would
be the catalog bootstrapping - to be able to read pg_am, we obviously
need to go through relcache.c's bootstrapping, and that only works
because we hardcode how those tables look like. I personally don't
think it's particularly important issue to work on, nor am I convinced
that there'd be buy-in to make the necessary extensive changes.

Greetings,

Andres Freund

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Andres Freund (#1)

Re: Status of the table access method work

On 05/04/2019 23:25, Andres Freund wrote:

I think what's in v12 - I don't know of any non-cleanup / bugfix work
pending for 12 - is a pretty reasonable initial set of features.

Hooray!

- the (optional) bitmap heap scan API - that's fairly intrinsically
block based. An AM could just internally subdivide TIDs in a different
way, but I don't think a bitmap scan like we have would e.g. make a
lot of sense for an index oriented table without any sort of stable
tid.

If an AM doesn't implement the bitmap heap scan API, what happens?
Bitmap scans are disabled?

Even if an AM isn't block-oriented, the bitmap heap scan API still makes
sense as long as there's some correlation between TIDs and physical
location. The only really broken thing about that currently is the
prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with
the TID's block numbers. It would be pretty straightforward to wrap that
in a callback, so that the AM could do something different.

Or move even more of the logic to the AM, so that the AM would get the
whole TIDBitmap in table_beginscan_bm(). It could then implement the
fetching and prefetching as it sees fit.

I don't think it's urgent, though. We can cross that bridge when we get
there, with the first AM that needs that flexibility.

The most constraining factor for storage, I think, is that currently the
API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
byte tuple identifier).

I think 48 bits would be just about enough, but it's even more limited
than you might at the moment. There are a few places that assume that
the offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and
MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0,
because that makes the ItemPointer invalid, which is inconvenient if you
tried to use ItemPointer as just an arbitrary 48-bit integer.

- Heikki

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Heikki Linnakangas (#2)

Re: Status of the table access method work

On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote:

On 05/04/2019 23:25, Andres Freund wrote:

- the (optional) bitmap heap scan API - that's fairly intrinsically
block based. An AM could just internally subdivide TIDs in a different
way, but I don't think a bitmap scan like we have would e.g. make a
lot of sense for an index oriented table without any sort of stable
tid.

If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap
scans are disabled?

Yea, the planner doesn't consider them. It just masks the index's
amhasgetbitmap. Seems to be the most reasonable thing to do?

Even if an AM isn't block-oriented, the bitmap heap scan API still makes
sense as long as there's some correlation between TIDs and physical
location.

Yea, it could be a non-linear mapping. But I'm honestly not sure how
many non-block oriented AMs with such a correlation there are - I mean
you're not going to have that in say an IOT. And it'd be trivial to just
"fake" a block mapping for an in-memory AM.

The only really broken thing about that currently is the
prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with the
TID's block numbers. It would be pretty straightforward to wrap that in a
callback, so that the AM could do something different.

That, and the VM_ALL_VISIBLE() checks both in nodeBitmapHeapscan.c and
nodeIndexonlyscan.c.

Or move even more of the logic to the AM, so that the AM would get the whole
TIDBitmap in table_beginscan_bm(). It could then implement the fetching and
prefetching as it sees fit.

I don't think it's urgent, though. We can cross that bridge when we get
there, with the first AM that needs that flexibility.

Yea, it seemed nontrivial (not in really hard, just not obvious), and
the implicated code duplication scared me away.

The most constraining factor for storage, I think, is that currently the
API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
byte tuple identifier).

I think 48 bits would be just about enough

I don't think that's really true. Consider e.g. implementing an index
oriented table - there's no way you can efficiently implement one with
that small a key. You basically need a helper index just to have
efficient and small enough tids. And given that we're also going to
need wider tids for global indexes, I suspect we're just going to have
to bite into the sour apple and make tids variable width.

, but it's even more limited than
you might at the moment. There are a few places that assume that the
offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and
MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0, because
that makes the ItemPointer invalid, which is inconvenient if you tried to
use ItemPointer as just an arbitrary 48-bit integer.

Good point.

Thanks for looking (and playing, in the other thread)!

Greetings,

Andres Freund

Haribabu Kommi

kommi.haribabu@gmail.com

almost 7 years ago

In reply to: Andres Freund (#1)

Re: Status of the table access method work

On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.

I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.

A big thank you Andres for your enormous efforts in this patch.
Without your involvement, this patch couldn't have been made into v12.

With regards to storing the rows themselves, the second biggest

limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.

I got the same doubt when i looked into some of the UNDO patches
where it tries to modify the core code to add UNDO specific WAL types.
Different AM's may need different set of operations to be WAL logged,
so it may be better for the AM's to register their own types?

Regards,
Haribabu Kommi
Fujitsu Australia

Heikki Linnakangas

hlinnaka@iki.fi

almost 7 years ago

In reply to: Andres Freund (#3)

Re: Status of the table access method work

On 08/04/2019 19:38, Andres Freund wrote:

On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote:

On 05/04/2019 23:25, Andres Freund wrote:

- the (optional) bitmap heap scan API - that's fairly intrinsically
block based. An AM could just internally subdivide TIDs in a different
way, but I don't think a bitmap scan like we have would e.g. make a
lot of sense for an index oriented table without any sort of stable
tid.

If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap
scans are disabled?

Yea, the planner doesn't consider them. It just masks the index's
amhasgetbitmap. Seems to be the most reasonable thing to do?

Yep.

Even if an AM isn't block-oriented, the bitmap heap scan API still makes
sense as long as there's some correlation between TIDs and physical
location.

Yea, it could be a non-linear mapping. But I'm honestly not sure how
many non-block oriented AMs with such a correlation there are - I mean
you're not going to have that in say an IOT. And it'd be trivial to just
"fake" a block mapping for an in-memory AM.

Now that Ashwin conveniently posted the ZedStore prototype we started to
hack on [1]/messages/by-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com, I'll point to that as an example :-). It stores data in a
B-tree (or rather, multiple B-trees) on TIDs. So there's very high
correlation between TIDs and physical locality, but it's not block oriented.

Another example would be the "LZ4 Compressed Storage Manager" that
Nikolai envisioned recently [2]/messages/by-id/11996861554042351@iva4-dd95b404a60b.qloud-c.yandex.net. Before we came up with the idea of
using b-trees in ZedStore, we were actually thinking of something very
similar to that. Although that one perhaps still counts as
"block-oriented" as far as the bitmap heap scan API is concerned, as it
still deals with blocks, they're just mapped to different physical
locations.

I'm not sure how an Index-Organized-Table would work, but I think it
would want to just get the whole bitmap, and figure out the best order
to fetch the rows by itself.

PS. Seems that having a table AM API has opened the floodgates for
storage ideas. Nice!

- Heikki

[1]: /messages/by-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
/messages/by-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
[2]: /messages/by-id/11996861554042351@iva4-dd95b404a60b.qloud-c.yandex.net
/messages/by-id/11996861554042351@iva4-dd95b404a60b.qloud-c.yandex.net

Dmitry Dolgov

9erthalion6@gmail.com

almost 7 years ago

In reply to: Heikki Linnakangas (#5)

Re: Status of the table access method work

On Tue, Apr 9, 2019 at 4:12 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.

I got the same doubt when i looked into some of the UNDO patches
where it tries to modify the core code to add UNDO specific WAL types.
Different AM's may need different set of operations to be WAL logged,
so it may be better for the AM's to register their own types?

I'm also curious about that. As far as I can see the main objection against
that was that in this case the recovery process will depend on an extension,
which could violate reliability. But I wonder if this argument is still valid
for AM's, since the whole data is kind of depends on it, not only the recovery.

Btw, can someone elaborate, why exactly generic_xlog is not efficient enough?
I've went through the corresponding thread, looks like generic WAL records are
bigger than normal one - is it the only reason?

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Haribabu Kommi (#4)

Re: Status of the table access method work

On Tue, Apr 9, 2019 at 2:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I got the same doubt when i looked into some of the UNDO patches
where it tries to modify the core code to add UNDO specific WAL types.
Different AM's may need different set of operations to be WAL logged,
so it may be better for the AM's to register their own types?

In the current undo proposal, the undo subsystem itself needs an rmgr
ID for WAL-logging of some low level undo log management records (ie
its own record types), but then any undo-aware AM would also need to
have its own rmgr ID for its own universe of WAL records (its own
types, meaningful to it alone), and that same rmgr ID is used also for
its undo records (which themselves have specific types). That is, in
rmgrlist.h, an undo-aware AM would register not only its redo function
(called for each WAL record in recovery) but also its undo function
(called when transaction roll back, if your transaction generated any
undo records). Which raises the question of how a hypothetical
undo-aware AM could deal with undo records if it's using generic WAL
records. I haven't thought about that. A couple of ideas we've
bounced around to allow extensions to work with specific WAL records:
(1) a community-wide registry of rmgr IDs (basically, just allocate
the IDs for all known extensions in a header in the tree, like IANA),
or (2) a per-cluster registry scheme where an extension, identified by
its name, would be permanently allocated an rmgr number and library +
callback functions for the lifetime of that cluster. Or something
like that.

--
Thomas Munro
https://enterprisedb.com

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Dmitry Dolgov (#6)

Re: Status of the table access method work

Hi,

On 2019-04-09 11:17:29 +0200, Dmitry Dolgov wrote:

I'm also curious about that. As far as I can see the main objection against
that was that in this case the recovery process will depend on an extension,
which could violate reliability.

I don't think that's a primary concern - although it is one. The mapping
from types of records to the handler function needs to be accessible at
a very early state, when the cluster isn't yet in a consistent state. So
we can't just go an look into pg_am, and look up a handler function, etc
- crash recovery happens much earlier than that is possible. Nor do we
want the mapping of 'rmgr id' -> 'extension' to be defined in the config
file, that's way too likely to be wrong. So there needs to be a
different type of mapping, accessible outside the catalog. I supect we'd
have to end up with something very roughly like the relmapper
infrastructure. A tertiary problem is then how to identify extensions
in that mapping - although I suspect just using any library name that
can be passed to load_library() will be OK.

But I wonder if this argument is still valid for AM's, since the whole
data is kind of depends on it, not only the recovery.

I don't buy that argument. If you have an AM that registers, using a new
facility, replay routines, and then it errors out / crashes during
those, there's no way to get the cluster back into a consistent
state. So it's not just the one table in that AM that's gone, it's the
entire cluster that's impacted.

Btw, can someone elaborate, why exactly generic_xlog is not efficient enough?
I've went through the corresponding thread, looks like generic WAL records are
bigger than normal one - is it the only reason?

That's one big reason. But also, you just can't do much more than "write
this block into that file" during recovery with. A lot of our replay
routines intentionally do more complicated tasks.

Greetings,

Andres Freund

Alexander Korotkov

aekorotkov@gmail.com

almost 7 years ago

In reply to: Andres Freund (#1)

Re: Status of the table access method work

On Fri, Apr 5, 2019 at 11:25 PM Andres Freund <andres@anarazel.de> wrote:

I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.

Thank you so much for bringing this project to commit! Excellent work!

Your explanation of existing limitations looks very good and
convincing. But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes. We have "all or nothing" choice during updates. So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update). I think any storage, which is going to address
"write amplification" point raised by Uber, needs to break this
"contract".

For example, zheap is promised to implement delete-marking indexes.
But it's not yet published. And for me it's not clear that this
approach is better among the alternatives. With delete-marking
approach you need to update index tuples corresponding to old values
of updated fields. But additionally to that it's not trivial to
delete index tuple. In order to do that, you need to both locate this
index tuple and know that this index value isn't present in undo
chain. So, it's likely required another index lookup during purging
of undo chain. Thus, we basically need to random lookup index twice
for every deleted index tuple. Also, it becomes more complex to
lookup appropriate heap tuple during index scan. Then you need to
check not only visibility, but also matching index value (here we need
to adjust index_fetch_tuple interface). Because it might happen that
visible to you version have different index value. That may lead to
O(N^2) performance while accessing single row with N versions (MySQL
InnoDB has this problem).

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples. On insertion of index tuple you fill
it's xmin. On update, previous index tuple is marked with xmax.
After that outdated index tuples might be deleted in the lazy manner
when page space is required. So, only one random access is required
for deleted index tuple. With this approach fetching single row is
O(N). Also, index-only scan becomes very easy and doesn't even need a
visibility map. The only problem here is extra space requirements for
index tuples. But I think, this is well-isolated problem, which is
easy to attack. For instance, some visibility information could be
evicted to undo chain (like zheap does for its tuples). Also, we can
have special bit for "all visible" index tuples. With "all visible"
bit set this tuple can get rid of visibility fields. We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

I understand that delete-marking indexes have some advantages, and
some people find them more appealing. But my point is that we
shouldn't builtin one of this approaches into API unless we have
concrete proof that this approach is strongly overcomes another. It
would be better to have our table-AM API flexible enough to implement
both. I can imagine we have proper encapsulation here bringing more
interaction with indexes to the table AM side.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Alexander Korotkov (#9)

Re: Status of the table access method work

Hi,

On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:

Your explanation of existing limitations looks very good and
convincing. But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes. We have "all or nothing" choice during updates. So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update).

I think that's a problem, and yea, I should have mentioned it. I'd
earlier thought about it and then forgot.

I however don't think we should design the interface for this before we
have at least one AM that's actually in decent-ish shape that needs
it. I seriously doubt we'll get the interface right enough.

Note: I'm *extremely* *extremely* doubtful that moving the full executor
invocations for expression indices etc into the tableam is a sane thing
to do. It's possible to convince me there's no alternative, but it'll be
really hard.

I suspect the right direction will be more going in a direction of
computing new index tuples for expression indexes before tableam gets
involved. If we do that right, we can also implement the stuff that
1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

I think any storage, which is going to address "write amplification"
point raised by Uber, needs to break this "contract".

FWIW, I don't think it makes much point in using Uber as a justification
for anything here. Their analysis was so deeply flawed and motivated by
non-technical reasons that it should just be ignored.

Greetings,

Andres Freund

#11

Alexander Korotkov

aekorotkov@gmail.com

almost 7 years ago

In reply to: Andres Freund (#10)

Re: Status of the table access method work

On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:

Your explanation of existing limitations looks very good and
convincing. But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes. We have "all or nothing" choice during updates. So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update).

I think that's a problem, and yea, I should have mentioned it. I'd
earlier thought about it and then forgot.

I however don't think we should design the interface for this before we
have at least one AM that's actually in decent-ish shape that needs
it. I seriously doubt we'll get the interface right enough.

Note: I'm *extremely* *extremely* doubtful that moving the full executor
invocations for expression indices etc into the tableam is a sane thing
to do. It's possible to convince me there's no alternative, but it'll be
really hard.

I suspect the right direction will be more going in a direction of
computing new index tuples for expression indexes before tableam gets
involved. If we do that right, we can also implement the stuff that
1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

Probably we can invent few modes table AM might work: calculation of
all new index tuples, calculation of new and old index tuples for
updated fields, calculation of all new and old index tuples and so on.
And them index tuples would be calculated either in advance or by
callback.

I think any storage, which is going to address "write amplification"
point raised by Uber, needs to break this "contract".

FWIW, I don't think it makes much point in using Uber as a justification
for anything here. Their analysis was so deeply flawed and motivated by
non-technical reasons that it should just be ignored.

Yeah, Uber is just a buzz word here. But problem that update of
single indexed field leads to insertions to every index is well-known
among the PostgreSQL users.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12

Alexander Korotkov

aekorotkov@gmail.com

over 6 years ago

In reply to: Andres Freund (#10)

Re: Status of the table access method work

On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:

Your explanation of existing limitations looks very good and
convincing. But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes. We have "all or nothing" choice during updates. So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update).

I think that's a problem, and yea, I should have mentioned it. I'd
earlier thought about it and then forgot.

I however don't think we should design the interface for this before we
have at least one AM that's actually in decent-ish shape that needs
it. I seriously doubt we'll get the interface right enough.

Sure.

My point is that once we get first table AM which needs this, say
zheap, we shouldn't make it like this

TM_Result (*tuple_update) (Relation rel, ... bool *update_indexes,
bool *delete_marking);

but rather try to design proper encapsulation of logic inside of table AM.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#13

Dmitry Dolgov

9erthalion6@gmail.com

over 6 years ago

In reply to: Alexander Korotkov (#12)

2 attachment(s)

Re: Status of the table access method work

On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:

A second set of limitations is around making more of tableam
optional. Right now it e.g. is not possible to have an AM that doesn't
implement insert/update/delete. Obviously an AM can just throw an error
in the relevant callbacks, but I think it'd be better if we made those
callbacks optional, and threw errors at parse-analysis time (both to
make the errors consistent, and to ensure it's consistently thrown,
rather than only when e.g. an UPDATE actually finds a row to update).

Agree, but I guess some of tableam still should be mandatory, and then I wonder
where to put the live between those that are optional and those that are not.
E.g. looks like it can be relatively straightforward (ignoring `create table as`
and some other stuff) to make insert/update/delete optional with messages at
analysis time, but for others like parallel scan related it's probably not.

Attachments:

0001-Optional-tableam.patchapplication/octet-stream; name=0001-Optional-tableam.patchDownload

From 6ab1e806ab0708a0b139a985be81dd090cbf992f Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 17 Apr 2019 17:48:43 +0200
Subject: [PATCH 1/2] Optional tableam

Make insert/update/delete table am optional
---
 src/backend/access/table/tableamapi.c | 12 ------------
 src/backend/commands/copy.c           |  8 ++++++++
 src/backend/parser/analyze.c          | 29 +++++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |  1 +
 src/include/access/tableam.h          | 34 ++++++++++++++++++++++++++++++++++
 5 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index bfd713f3af..0a58123a0a 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -63,18 +63,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->tuple_fetch_row_version != NULL);
 	Assert(routine->tuple_satisfies_snapshot != NULL);
 
-	Assert(routine->tuple_insert != NULL);
-
-	/*
-	 * Could be made optional, but would require throwing error during
-	 * parse-analysis.
-	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
-
-	Assert(routine->multi_insert != NULL);
-	Assert(routine->tuple_delete != NULL);
-	Assert(routine->tuple_update != NULL);
 	Assert(routine->tuple_lock != NULL);
 
 	Assert(routine->relation_set_new_filenode != NULL);
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c39218f8db..36e2dbf1b8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "parser/parse_coerce.h"
 #include "parser/parse_collate.h"
 #include "parser/parse_expr.h"
@@ -901,6 +902,13 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 											NULL, false, false);
 		rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT);
 
+		if (is_from && !table_support_multi_insert(rel))
+			ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("Table access method doesn't support the operation"),
+						 parser_errposition(pstate,
+											exprLocation((Node *) stmt))));
+
 		if (stmt->whereClause)
 		{
 			/* add rte to column namespace  */
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 400558b552..b320ff1b26 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -25,6 +25,7 @@
 #include "postgres.h"
 
 #include "access/sysattr.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -416,6 +417,13 @@ transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
 										 true,
 										 ACL_DELETE);
 
+	if (!table_support_delete(pstate->p_target_relation))
+		ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("Table access method doesn't support the operation"),
+					 parser_errposition(pstate,
+										exprLocation((Node *) stmt))));
+
 	/* grab the namespace item made by setTargetTable */
 	nsitem = (ParseNamespaceItem *) llast(pstate->p_namespace);
 
@@ -553,6 +561,20 @@ transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
 	qry->resultRelation = setTargetTable(pstate, stmt->relation,
 										 false, false, targetPerms);
 
+	if (!table_support_insert(pstate->p_target_relation))
+		ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("Table access method doesn't support the operation"),
+					 parser_errposition(pstate,
+										exprLocation((Node *) stmt))));
+
+	if (!table_support_speculative(pstate->p_target_relation) && isOnConflictUpdate)
+		ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("Table access method doesn't support the operation"),
+					 parser_errposition(pstate,
+										exprLocation((Node *) stmt))));
+
 	/* Validate stmt->cols list, or build default list if no list given */
 	icolumns = checkInsertTargets(pstate, stmt->cols, &attrnos);
 	Assert(list_length(icolumns) == list_length(attrnos));
@@ -2237,6 +2259,13 @@ transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
 										 true,
 										 ACL_UPDATE);
 
+	if (!table_support_update(pstate->p_target_relation))
+		ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("Table access method doesn't support the operation"),
+					 parser_errposition(pstate,
+										exprLocation((Node *) stmt))));
+
 	/* grab the namespace item made by setTargetTable */
 	nsitem = (ParseNamespaceItem *) llast(pstate->p_namespace);
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index edf24c438c..b90f047e6e 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -18,6 +18,7 @@
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
+#include "access/tableam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6fbfcb96c9..d11cade4a0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1050,6 +1050,12 @@ table_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 								  bistate);
 }
 
+static inline bool
+table_support_insert(Relation rel)
+{
+	return rel->rd_tableam == NULL || rel->rd_tableam->tuple_insert != NULL;
+}
+
 /*
  * Perform a "speculative insertion". These can be backed out afterwards
  * without aborting the whole transaction.  Other sessions can wait for the
@@ -1082,6 +1088,14 @@ table_complete_speculative(Relation rel, TupleTableSlot *slot,
 												succeeded);
 }
 
+static inline bool
+table_support_speculative(Relation rel)
+{
+	return rel->rd_tableam == NULL ||
+		   (rel->rd_tableam->tuple_insert_speculative != NULL &&
+			rel->rd_tableam->tuple_complete_speculative != NULL);
+}
+
 /*
  * Insert multiple tuples into a table.
  *
@@ -1104,6 +1118,14 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
 								  cid, options, bistate);
 }
 
+static inline bool
+table_support_multi_insert(Relation rel)
+{
+	return rel->rd_tableam == NULL ||
+		   (rel->rd_tableam->multi_insert != NULL &&
+			rel->rd_tableam->finish_bulk_insert != NULL);
+}
+
 /*
  * Delete a tuple.
  *
@@ -1140,6 +1162,12 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
 										 wait, tmfd, changingPart);
 }
 
+static inline bool
+table_support_delete(Relation rel)
+{
+	return rel->rd_tableam == NULL || rel->rd_tableam->tuple_delete != NULL;
+}
+
 /*
  * Update a tuple.
  *
@@ -1186,6 +1214,12 @@ table_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 										 lockmode, update_indexes);
 }
 
+static inline TM_Result
+table_support_update(Relation rel)
+{
+	return rel->rd_tableam == NULL || rel->rd_tableam->tuple_update;
+}
+
 /*
  * Lock a tuple in the specified mode.
  *
-- 
2.16.4

0002-Toytable-with-missing-am.patchapplication/octet-stream; name=0002-Toytable-with-missing-am.patchDownload

From 0bf1ad97cafd922d9bf4e57b33dae3ff95eacdef Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Wed, 17 Apr 2019 18:17:07 +0200
Subject: [PATCH 2/2] Toytable with missing am

Add tests for a toytable am to check optional methods
---
 src/test/modules/toytable/Makefile              |  25 ++
 src/test/modules/toytable/expected/toytable.out |  25 ++
 src/test/modules/toytable/sql/toytable.sql      |  13 +
 src/test/modules/toytable/toytable--1.0.sql     |  12 +
 src/test/modules/toytable/toytable.control      |   4 +
 src/test/modules/toytable/toytableam.c          | 524 ++++++++++++++++++++++++
 6 files changed, 603 insertions(+)
 create mode 100644 src/test/modules/toytable/Makefile
 create mode 100644 src/test/modules/toytable/expected/toytable.out
 create mode 100644 src/test/modules/toytable/sql/toytable.sql
 create mode 100644 src/test/modules/toytable/toytable--1.0.sql
 create mode 100644 src/test/modules/toytable/toytable.control
 create mode 100644 src/test/modules/toytable/toytableam.c

diff --git a/src/test/modules/toytable/Makefile b/src/test/modules/toytable/Makefile
new file mode 100644
index 0000000000..142ef2d23e
--- /dev/null
+++ b/src/test/modules/toytable/Makefile
@@ -0,0 +1,25 @@
+# src/test/modules/toytable/Makefile
+
+MODULE_big = toytable
+OBJS = toytableam.o $(WIN32RES)
+PGFILEDESC = "A dummy implementantation of the table AM API"
+
+EXTENSION = toytable
+DATA = toytable--1.0.sql
+
+REGRESS = toytable
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/toytable
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+OBJS = toytableam.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/test/modules/toytable/expected/toytable.out b/src/test/modules/toytable/expected/toytable.out
new file mode 100644
index 0000000000..5421217c8d
--- /dev/null
+++ b/src/test/modules/toytable/expected/toytable.out
@@ -0,0 +1,25 @@
+CREATE EXTENSION toytable;
+create table toytab (i int4, j int4, k int4) using toytable;
+select * from toytab;
+ i  | j  | k  
+----+----+----
+  1 |  1 |  1
+  2 |  2 |  2
+  3 |  3 |  3
+  4 |  4 |  4
+  5 |  5 |  5
+  6 |  6 |  6
+  7 |  7 |  7
+  8 |  8 |  8
+  9 |  9 |  9
+ 10 | 10 | 10
+(10 rows)
+
+delete from toytab;
+ERROR:  Table access method doesn't support the operation
+insert into toytab values(1, 2, 3);
+ERROR:  Table access method doesn't support the operation
+update toytab set i = 4;
+ERROR:  Table access method doesn't support the operation
+copy toytab from stdin;
+ERROR:  Table access method doesn't support the operation
diff --git a/src/test/modules/toytable/sql/toytable.sql b/src/test/modules/toytable/sql/toytable.sql
new file mode 100644
index 0000000000..68662c7f41
--- /dev/null
+++ b/src/test/modules/toytable/sql/toytable.sql
@@ -0,0 +1,13 @@
+CREATE EXTENSION toytable;
+
+create table toytab (i int4, j int4, k int4) using toytable;
+
+select * from toytab;
+
+delete from toytab;
+
+insert into toytab values(1, 2, 3);
+
+update toytab set i = 4;
+
+copy toytab from stdin;
diff --git a/src/test/modules/toytable/toytable--1.0.sql b/src/test/modules/toytable/toytable--1.0.sql
new file mode 100644
index 0000000000..52085d27f4
--- /dev/null
+++ b/src/test/modules/toytable/toytable--1.0.sql
@@ -0,0 +1,12 @@
+/* src/test/modules/toyam/toyam--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION toytable" to load this file. \quit
+
+CREATE FUNCTION toytableam_handler(internal)
+RETURNS pg_catalog.table_am_handler STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE ACCESS METHOD toytable TYPE TABLE HANDLER toytableam_handler
+
+
diff --git a/src/test/modules/toytable/toytable.control b/src/test/modules/toytable/toytable.control
new file mode 100644
index 0000000000..8f613e58d6
--- /dev/null
+++ b/src/test/modules/toytable/toytable.control
@@ -0,0 +1,4 @@
+comment = 'Dummy implementation of table AM api'
+default_version = '1.0'
+module_pathname = '$libdir/toytable'
+relocatable = true
diff --git a/src/test/modules/toytable/toytableam.c b/src/test/modules/toytable/toytableam.c
new file mode 100644
index 0000000000..52ff233e96
--- /dev/null
+++ b/src/test/modules/toytable/toytableam.c
@@ -0,0 +1,524 @@
+/*-------------------------------------------------------------------------
+ *
+ * toyam_handler.c
+ *	  a toy table access method code
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/toytable/toyam_handler.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "miscadmin.h"
+
+#include "access/multixact.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
+#include "catalog/catalog.h"
+#include "catalog/storage.h"
+#include "catalog/index.h"
+#include "catalog/pg_type.h"
+#include "executor/executor.h"
+#include "utils/builtins.h"
+#include "utils/rel.h"
+#include "storage/bufmgr.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(toytableam_handler);
+
+typedef struct
+{
+	TableScanDescData scan;
+
+	int			tupidx;
+} ToyScanDescData;
+typedef ToyScanDescData *ToyScanDesc;
+
+static const TupleTableSlotOps *
+toyam_slot_callbacks(Relation relation)
+{
+	return &TTSOpsVirtual;
+}
+
+static TableScanDesc toyam_scan_begin(Relation rel,
+							 Snapshot snapshot,
+							 int nkeys, struct ScanKeyData *key,
+							 ParallelTableScanDesc pscan,
+							 bool allow_strat,
+							 bool allow_sync,
+							 bool allow_pagemode,
+							 bool is_bitmapscan,
+							 bool is_samplescan,
+							 bool temp_snap)
+{
+	ToyScanDesc tscan;
+
+	tscan = palloc0(sizeof(ToyScanDescData));
+	tscan->scan.rs_rd = rel;
+	tscan->scan.rs_snapshot = snapshot;
+	tscan->scan.rs_nkeys = nkeys;
+	tscan->scan.rs_bitmapscan = is_bitmapscan;
+	tscan->scan.rs_samplescan = is_samplescan;
+	tscan->scan.rs_allow_strat = allow_strat;
+	tscan->scan.rs_allow_sync = allow_sync;
+	tscan->scan.rs_temp_snap = temp_snap;
+	tscan->scan.rs_parallel = pscan;
+
+	tscan->tupidx = 0;
+
+	return &tscan->scan;
+}
+
+static void
+toyam_scan_end(TableScanDesc scan)
+{
+	pfree(scan);
+}
+
+static void
+toyam_scan_rescan(TableScanDesc scan, struct ScanKeyData *key,
+				  bool set_params, bool allow_strat,
+				  bool allow_sync, bool allow_pagemode)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static bool
+toyam_scan_getnextslot(TableScanDesc scan,
+					   ScanDirection direction,
+					   TupleTableSlot *slot)
+{
+	ToyScanDesc tscan = (ToyScanDesc) scan;
+
+	slot->tts_nvalid = 0;
+	slot->tts_flags |= TTS_FLAG_EMPTY;
+
+	/*
+	 * Return a constant 1 rows. Every int4 attribute gets
+	 * a running count, everything else is NULL.
+	 */
+	if (tscan->tupidx < 10)
+	{
+		TupleDesc desc = RelationGetDescr(tscan->scan.rs_rd);
+
+		tscan->tupidx++;
+
+		for (AttrNumber attno = 1; attno <= desc->natts; attno++)
+		{
+			Form_pg_attribute att = &desc->attrs[attno - 1];
+			Datum		d;
+			bool		isnull;
+
+			if (att->atttypid == INT4OID)
+			{
+				d = tscan->tupidx;
+				isnull = false;
+			}
+			else
+			{
+				d = (Datum) 0;
+				isnull = true;
+			}
+
+			slot->tts_values[attno - 1] = d;
+			slot->tts_isnull[attno - 1] = isnull;
+		}
+
+		ItemPointerSet(&slot->tts_tid, 1, tscan->tupidx);
+		slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+		slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+		return true;
+	}
+	else
+		return false;
+}
+
+static Size
+toyam_parallelscan_estimate(Relation rel)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static Size
+toyam_parallelscan_initialize(Relation rel,
+							  ParallelTableScanDesc pscan)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_parallelscan_reinitialize(Relation rel,
+								ParallelTableScanDesc pscan)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static struct IndexFetchTableData *
+toyam_index_fetch_begin(Relation rel)
+{
+	IndexFetchTableData *tfetch = palloc0(sizeof(IndexFetchTableData));
+
+	tfetch->rel = rel;
+
+	return tfetch;
+}
+
+static void
+toyam_index_fetch_reset(struct IndexFetchTableData *data)
+{
+}
+
+static void
+toyam_index_fetch_end(struct IndexFetchTableData *data)
+{
+	pfree(data);
+}
+
+static bool
+toyam_index_fetch_tuple(struct IndexFetchTableData *scan,
+						ItemPointer tid,
+						Snapshot snapshot,
+						TupleTableSlot *slot,
+						bool *call_again, bool *all_dead)
+{
+	TupleDesc desc = RelationGetDescr(scan->rel);
+	int			tupidx;
+
+	if (ItemPointerGetBlockNumber(tid) != 1)
+		return false;
+
+	tupidx = ItemPointerGetOffsetNumber(tid);
+	if (tupidx < 1 || tupidx > 10)
+		return false;
+
+	slot->tts_nvalid = 0;
+	slot->tts_flags |= TTS_FLAG_EMPTY;
+
+	/* Return same data as toyam_scan_getnextslot does */
+	for (AttrNumber attno = 1; attno <= desc->natts; attno++)
+	{
+		Form_pg_attribute att = &desc->attrs[attno - 1];
+		Datum		d;
+		bool		isnull;
+
+		if (att->atttypid == INT4OID)
+		{
+			d = tupidx;
+			isnull = false;
+		}
+		else
+		{
+			d = (Datum) 0;
+			isnull = true;
+		}
+
+		slot->tts_values[attno - 1] = d;
+		slot->tts_isnull[attno - 1] = isnull;
+	}
+
+	ItemPointerSet(&slot->tts_tid, 1, tupidx);
+	slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	return true;
+}
+
+static bool
+toyam_tuple_fetch_row_version(Relation rel,
+							  ItemPointer tid,
+							  Snapshot snapshot,
+							  TupleTableSlot *slot)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_tuple_get_latest_tid(Relation rel,
+						   Snapshot snapshot,
+						   ItemPointer tid)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static bool
+toyam_tuple_satisfies_snapshot(Relation rel,
+							   TupleTableSlot *slot,
+							   Snapshot snapshot)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static TransactionId
+toyam_compute_xid_horizon_for_tuples(Relation rel,
+									 ItemPointerData *items,
+									 int nitems)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static TM_Result
+toyam_tuple_lock(Relation rel,
+				 ItemPointer tid,
+				 Snapshot snapshot,
+				 TupleTableSlot *slot,
+				 CommandId cid,
+				 LockTupleMode mode,
+				 LockWaitPolicy wait_policy,
+				 uint8 flags,
+				 TM_FailureData *tmfd)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_relation_set_new_filenode(Relation rel,
+								char persistence,
+								TransactionId *freezeXid,
+								MultiXactId *minmulti)
+{
+	*freezeXid = InvalidTransactionId;
+	*minmulti = InvalidMultiXactId;
+
+	/*
+	 * FIXME: We don't need this for anything. But index build calls
+	 * RelationGetNumberOfBlocks, from index_update_stats(), and that
+	 * fails if the underlying file doesn't exist.
+	 */
+	RelationCreateStorage(rel->rd_node, persistence);
+}
+
+static void
+toyam_relation_nontransactional_truncate(Relation rel)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_relation_copy_data(Relation rel, RelFileNode newrnode)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_relation_copy_for_cluster(Relation NewHeap,
+								Relation OldHeap,
+								Relation OldIndex,
+								bool use_sort,
+								TransactionId OldestXmin,
+								TransactionId FreezeXid,
+								MultiXactId MultiXactCutoff,
+								double *num_tuples,
+								double *tups_vacuumed,
+								double *tups_recently_dead)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_relation_vacuum(Relation onerel,
+					  struct VacuumParams *params,
+					  BufferAccessStrategy bstrategy)
+{
+	/* we've got nothing to do */
+}
+
+static bool
+toyam_scan_analyze_next_block(TableScanDesc scan,
+							  BlockNumber blockno,
+							  BufferAccessStrategy bstrategy)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static bool
+toyam_scan_analyze_next_tuple(TableScanDesc scan,
+							  TransactionId OldestXmin,
+							  double *liverows,
+							  double *deadrows,
+							  TupleTableSlot *slot)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static double
+toyam_index_build_range_scan(Relation heap_rel,
+							 Relation index_rel,
+							 struct IndexInfo *index_nfo,
+							 bool allow_sync,
+							 bool anyvisible,
+							 bool progress,
+							 BlockNumber start_blockno,
+							 BlockNumber end_blockno,
+							 IndexBuildCallback callback,
+							 void *callback_state,
+							 TableScanDesc scan)
+{
+	TupleTableSlot *slot;
+	EState     *estate;
+
+	estate = CreateExecutorState();
+	slot = table_slot_create(heap_rel, NULL);
+
+	if (!scan)
+		scan = toyam_scan_begin(heap_rel,
+								SnapshotAny,
+								0, NULL,
+								NULL,
+								false,
+								false,
+								false,
+								false,
+								false,
+								false);
+
+	while (toyam_scan_getnextslot(scan, ForwardScanDirection, slot))
+	{
+		Datum           values[INDEX_MAX_KEYS];
+		bool            isnull[INDEX_MAX_KEYS];
+		HeapTuple		heapTuple;
+
+		FormIndexDatum(index_nfo, slot, estate, values, isnull);
+
+		/* Call the AM's callback routine to process the tuple */
+		heapTuple = ExecCopySlotHeapTuple(slot);
+		heapTuple->t_self = slot->tts_tid;
+		callback(heap_rel, heapTuple, values, isnull, true,
+				 callback_state);
+		pfree(heapTuple);
+	}
+
+	toyam_scan_end(scan);
+	ExecDropSingleTupleTableSlot(slot);
+	FreeExecutorState(estate);
+
+	return 10;
+}
+
+static void
+toyam_index_validate_scan(Relation heap_rel,
+						  Relation index_rel,
+						  struct IndexInfo *index_info,
+						  Snapshot snapshot,
+						  struct ValidateIndexState *state)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static void
+toyam_relation_estimate_size(Relation rel, int32 *attr_widths,
+							 BlockNumber *pages, double *tuples,
+							 double *allvisfrac)
+{
+	*pages = 1;
+	*tuples = 1;
+	*allvisfrac = 1.0;
+}
+
+static bool
+toyam_scan_sample_next_block(TableScanDesc scan,
+							 struct SampleScanState *scanstate)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static bool
+toyam_scan_sample_next_tuple(TableScanDesc scan,
+					   struct SampleScanState *scanstate,
+					   TupleTableSlot *slot)
+{
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("function %s not implemented yet", __func__)));
+}
+
+static const TableAmRoutine toyam_methods = {
+	.type = T_TableAmRoutine,
+
+	.slot_callbacks = toyam_slot_callbacks,
+
+	.scan_begin = toyam_scan_begin,
+	.scan_end = toyam_scan_end,
+	.scan_rescan = toyam_scan_rescan,
+	.scan_getnextslot = toyam_scan_getnextslot,
+
+	.parallelscan_estimate = toyam_parallelscan_estimate,
+	.parallelscan_initialize = toyam_parallelscan_initialize,
+	.parallelscan_reinitialize = toyam_parallelscan_reinitialize,
+
+	.index_fetch_begin = toyam_index_fetch_begin,
+	.index_fetch_reset = toyam_index_fetch_reset,
+	.index_fetch_end = toyam_index_fetch_end,
+	.index_fetch_tuple = toyam_index_fetch_tuple,
+
+	.tuple_fetch_row_version = toyam_tuple_fetch_row_version,
+	.tuple_get_latest_tid = toyam_tuple_get_latest_tid,
+	.tuple_satisfies_snapshot = toyam_tuple_satisfies_snapshot,
+	.compute_xid_horizon_for_tuples = toyam_compute_xid_horizon_for_tuples,
+
+	.tuple_lock = toyam_tuple_lock,
+
+	.relation_set_new_filenode = toyam_relation_set_new_filenode,
+	.relation_nontransactional_truncate = toyam_relation_nontransactional_truncate,
+	.relation_copy_data = toyam_relation_copy_data,
+	.relation_copy_for_cluster = toyam_relation_copy_for_cluster,
+	.relation_vacuum = toyam_relation_vacuum,
+
+	.scan_analyze_next_block = toyam_scan_analyze_next_block,
+	.scan_analyze_next_tuple = toyam_scan_analyze_next_tuple,
+	.index_build_range_scan = toyam_index_build_range_scan,
+	.index_validate_scan = toyam_index_validate_scan,
+
+	.relation_estimate_size = toyam_relation_estimate_size,
+
+	.scan_bitmap_next_block = NULL,
+	.scan_bitmap_next_tuple = NULL,
+	.scan_sample_next_block = toyam_scan_sample_next_block,
+	.scan_sample_next_tuple = toyam_scan_sample_next_tuple,
+};
+
+Datum
+toytableam_handler(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_POINTER(&toyam_methods);
+}
-- 
2.16.4

#14

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Dmitry Dolgov (#13)

Re: Status of the table access method work

Hi!

On 2019-04-17 22:02:24 +0200, Dmitry Dolgov wrote:

On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:

A second set of limitations is around making more of tableam
optional. Right now it e.g. is not possible to have an AM that doesn't
implement insert/update/delete. Obviously an AM can just throw an error
in the relevant callbacks, but I think it'd be better if we made those
callbacks optional, and threw errors at parse-analysis time (both to
make the errors consistent, and to ensure it's consistently thrown,
rather than only when e.g. an UPDATE actually finds a row to update).

Agree, but I guess some of tableam still should be mandatory, and then I wonder
where to put the live between those that are optional and those that are not.
E.g. looks like it can be relatively straightforward (ignoring `create table as`
and some other stuff) to make insert/update/delete optional with messages at
analysis time, but for others like parallel scan related it's probably not.

Thanks for the patch! I assume you're aware, but it's probably not going
to be applied for 12...

I think most of the read-only stuff just needs to be non-optional, and
most of the DML stuff needs to be optional.

On the executor side it'd probably be good to make the sample scan
optional too, but then we also need to check for that during
parse-analysis. In contast to bitmap scans there's no alternative way to
execute them.

Assert(routine->relation_set_new_filenode != NULL);
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c39218f8db..36e2dbf1b8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "optimizer/optimizer.h"
#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
#include "parser/parse_coerce.h"
#include "parser/parse_collate.h"
#include "parser/parse_expr.h"
@@ -901,6 +902,13 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
NULL, false, false);
rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT);

+		if (is_from && !table_support_multi_insert(rel))
+			ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("Table access method doesn't support the operation"),
+						 parser_errposition(pstate,
+											exprLocation((Node *) stmt))));

Probably should fall-back to plain inserts if multi-insert isn't
supported.

And if insert isn't supported either, we should probably talk about
that specifically? I.e. a message like
"access method \"%s\" of table \"%s\" does not support %s"
?

Without knowing at least thatmuch operation it might sometimes be very
hard to figure out what's not supported.

+static inline bool
+table_support_speculative(Relation rel)
+{
+	return rel->rd_tableam == NULL ||
+		   (rel->rd_tableam->tuple_insert_speculative != NULL &&
+			rel->rd_tableam->tuple_complete_speculative != NULL);
+}

In GetTableAmRoutine() I'd assert that either both or none are defined.

+static inline bool
+table_support_multi_insert(Relation rel)
+{
+	return rel->rd_tableam == NULL ||
+		   (rel->rd_tableam->multi_insert != NULL &&
+			rel->rd_tableam->finish_bulk_insert != NULL);
+}

bulk insert already is optional...

I think there's more places that need checks like these - consider
e.g. replication and such that don't go through the full blown executor.

Greetings,

Andres Freund

#15

Dmitry Dolgov

9erthalion6@gmail.com

over 6 years ago

In reply to: Andres Freund (#14)

Re: Status of the table access method work

On Wed, Apr 17, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:

I assume you're aware, but it's probably not going to be applied for 12...

Sure, the patch was mostly to express more clearly what I was thinking about :)

I think most of the read-only stuff just needs to be non-optional, and most
of the DML stuff needs to be optional.

On the executor side it'd probably be good to make the sample scan optional
too, but then we also need to check for that during parse-analysis. In
contast to bitmap scans there's no alternative way to execute them.

Yeah, makes sense.

bulk insert already is optional...

Oh, haven't noticed.

#16

Simon Riggs

simon@2ndquadrant.com

over 6 years ago

In reply to: Alexander Korotkov (#9)

Re: Status of the table access method work

On Wed, 10 Apr 2019 at 18:14, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples. On insertion of index tuple you fill
it's xmin. On update, previous index tuple is marked with xmax.

xmax can be provided through to index by indexam when 1) we mark killed
tuples, 2) when we do indexscan of index entry without xmax set.
xmax can be set as a hint on normal scans, or set as part of an update, as
the index chooses

After that outdated index tuples might be deleted in the lazy manner

when page space is required.

That is already done, so hardly any change there.

Also, we can

have special bit for "all visible" index tuples. With "all visible"
bit set this tuple can get rid of visibility fields. We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

Keeping the xmin/xmax would also be useful for historical indexes, i.e.
indexes that can be used to search for data with historic snapshots.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Alexander Korotkov (#9)

Re: Status of the table access method work

On 2019-Apr-10, Alexander Korotkov wrote:

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples.

"We liked freezing xmin so much that we had to introduce freezing for
xmax" -- rhaas dixit. And now we want to introduce freezing for
indexes?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#17)

Re: Status of the table access method work

On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2019-Apr-10, Alexander Korotkov wrote:

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples.

"We liked freezing xmin so much that we had to introduce freezing for
xmax" -- rhaas dixit. And now we want to introduce freezing for
indexes?

Plus it would add 8 bytes to the size of every index tuple. if you
are indexing long-ish strings it may not hurt too much, but if your
primary key is an integer, I think your index is going to get a lot
bigger.

The problem with freezing is perhaps avoidable if you store an epoch
in the page special space as part of all this. But I don't see any
way to avoid having the tuples get wider. Possibly you could include
xmin and xmax only when needed, removing xmin once the tuples are
all-visible and splitting pages if you need to make room to add an
xmax. I'm not sure how much that helps, though, because if you do a
bulk insert, you're going to have to leave room for all of the xmins
initially, and removing them later will produce free space for which
you may have little use.

I don't think that including visibility information in indexes is a
bad idea; we've thought about making zheap do this someday. But I
think that we need to use some more sophisticated approach involving,
maybe, undo pointers, or some other kind of magic, rather than just
widening the tuples. I expect that just widening the tuples would be
good enough to win for some use cases, but I think there would be
others that lose heavily.

In any case, I agree that we do not want to create more things that
need freezing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Peter Geoghegan

pg@bowt.ie

over 6 years ago

In reply to: Robert Haas (#18)

Re: Status of the table access method work

On Tue, Jun 11, 2019 at 8:59 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think that including visibility information in indexes is a
bad idea; we've thought about making zheap do this someday. But I
think that we need to use some more sophisticated approach involving,
maybe, undo pointers, or some other kind of magic, rather than just
widening the tuples. I expect that just widening the tuples would be
good enough to win for some use cases, but I think there would be
others that lose heavily.

+1. Limited visibility information would make sense (e.g. maybe a per
tuple all-visible bit), which would have to be backed by something
like UNDO, but storing XIDs in tuples seems like a very bad idea. The
idea that something like this would have to be usable by any possible
table access method seems unworkable in general.

Sometimes it seems like the table access method work could use some
specific non-goals. Perhaps I just haven't being paying enough
attention to have noticed them.

--
Peter Geoghegan

#20

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Robert Haas (#18)

Re: Status of the table access method work

Hi,

On 2019-06-11 11:59:36 -0400, Robert Haas wrote:

On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2019-Apr-10, Alexander Korotkov wrote:

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples.

"We liked freezing xmin so much that we had to introduce freezing for
xmax" -- rhaas dixit. And now we want to introduce freezing for
indexes?

Plus it would add 8 bytes to the size of every index tuple. if you
are indexing long-ish strings it may not hurt too much, but if your
primary key is an integer, I think your index is going to get a lot
bigger.

The problem with freezing is perhaps avoidable if you store an epoch
in the page special space as part of all this. But I don't see any
way to avoid having the tuples get wider. Possibly you could include
xmin and xmax only when needed, removing xmin once the tuples are
all-visible and splitting pages if you need to make room to add an
xmax. I'm not sure how much that helps, though, because if you do a
bulk insert, you're going to have to leave room for all of the xmins
initially, and removing them later will produce free space for which
you may have little use.

I don't think that including visibility information in indexes is a
bad idea; we've thought about making zheap do this someday. But I
think that we need to use some more sophisticated approach involving,
maybe, undo pointers, or some other kind of magic, rather than just
widening the tuples. I expect that just widening the tuples would be
good enough to win for some use cases, but I think there would be
others that lose heavily.

Yea, I think there's plenty reasons to want to do something different
than what we're doing. But I'd like to see a concrete proposal before
building API for it...

Greetings,

Andres Freund

#21

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Andres Freund (#20)

Re: Status of the table access method work

On Tue, Jun 11, 2019 at 12:32 PM Andres Freund <andres@anarazel.de> wrote:

Yea, I think there's plenty reasons to want to do something different
than what we're doing. But I'd like to see a concrete proposal before
building API for it...

I wasn't intending to propose that you should. We're just in the
brainstorming stage here I think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company