Table AM and DDLs
Hi all,
I am quite new to PostgreSQL so forgive me if my understanding of the code
below is wrong and please clarify what I have misunderstood.
I started to experiment with the table access method interface to see if it
can be used for some ideas I have.
For the experiment, I am using a simple in-memory table access method that
stores all the data as shared memory pages instead of disk pages. I know
there are other ways to get good performance, but this implementation is
good enough for my experiments since it tests a few things with the Table
AM interface that I am wondering about.
Now, the first question I was looking at is if it is possible to
handle DDL properly if you have a non-normal storage. Both create new
storage blocks on table creation, clean up on dropping a table as well as
handling schema changes on alter table?
Creating new blocks for a table is straightforward to implement by using
the `relation_set_new_filenode` callback where you can create new memory
blocks for a relation, but I cannot find a way to clean up those blocks
when the table is dropped nor a way to handle a change of the schema for a
table.
The `relation_set_new_filenode` is indirectly called from
`heap_create_with_catalog`, but there is no corresponding callback from
`heap_drop_with_catalog`. It also seems like the intention is that the
callback should call `RelationCreateStorage` itself (makes sense, since the
access method knows about how to use the storage), so it seems natural to
add a `relation_reset_filenode` to the table AM that is called from
`heap_drop_with_catalog` for tables and add that to the heap implementation
(see the attached patch).
Altering the schema does not seem to be covered at all, but this is
something that table access methods need to know about since it might want
to optimize the internal storage when the schema changes. I have not been
able to find any discussions around this, but it seems like a natural thing
to do with a table. Have I misunderstood how this works?
Best wishes,
Mats Kindahl
Timescale
Attachments:
0001-Add-callback-to-reset-filenode-to-table-access-metho.patchtext/x-patch; charset=US-ASCII; name=0001-Add-callback-to-reset-filenode-to-table-access-metho.patchDownload
From 354ec1b5e86c673c27d1e578c8cddfa483fac659 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Sun, 14 Feb 2021 18:58:43 +0100
Subject: Add callback to reset filenode to table access method
This commit adds a callback to `TableAccessMethod` that is called when
the table should be scheduled for unlinking and implement the method
for the heap access method.
---
src/backend/access/heap/heapam_handler.c | 7 +++++++
src/backend/access/table/tableamapi.c | 1 +
src/backend/catalog/heap.c | 23 ++++++++++++++++++++++-
src/include/access/tableam.h | 20 ++++++++++++++++++++
4 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..287f8a47cf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -613,6 +613,12 @@ heapam_relation_set_new_filenode(Relation rel,
smgrclose(srel);
}
+static void
+heapam_relation_reset_filenode(Relation rel)
+{
+ RelationDropStorage(rel);
+}
+
static void
heapam_relation_nontransactional_truncate(Relation rel)
{
@@ -2566,6 +2572,7 @@ static const TableAmRoutine heapam_methods = {
.index_delete_tuples = heap_index_delete_tuples,
.relation_set_new_filenode = heapam_relation_set_new_filenode,
+ .relation_reset_filenode = heapam_relation_reset_filenode,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 325ecdc122..30e070383c 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -83,6 +83,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_lock != NULL);
Assert(routine->relation_set_new_filenode != NULL);
+ Assert(routine->relation_reset_filenode != NULL);
Assert(routine->relation_nontransactional_truncate != NULL);
Assert(routine->relation_copy_data != NULL);
Assert(routine->relation_copy_for_cluster != NULL);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 9abc4a1f55..e1e8b93285 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1990,7 +1990,28 @@ heap_drop_with_catalog(Oid relid)
* Schedule unlinking of the relation's physical files at commit.
*/
if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
- RelationDropStorage(rel);
+ switch (rel->rd_rel->relkind)
+ {
+ case RELKIND_VIEW:
+ case RELKIND_COMPOSITE_TYPE:
+ case RELKIND_FOREIGN_TABLE:
+ case RELKIND_PARTITIONED_TABLE:
+ case RELKIND_PARTITIONED_INDEX:
+ Assert(false);
+ break;
+
+ case RELKIND_INDEX:
+ case RELKIND_SEQUENCE:
+ RelationDropStorage(rel);
+ break;
+
+ case RELKIND_RELATION:
+ case RELKIND_TOASTVALUE:
+ case RELKIND_MATVIEW:
+ table_relation_reset_filenode(rel);
+ break;
+
+ }
/*
* Close relcache entry, but *keep* AccessExclusiveLock on the relation
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 33bffb6815..b5bb44a470 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -550,6 +550,14 @@ typedef struct TableAmRoutine
TransactionId *freezeXid,
MultiXactId *minmulti);
+ /*
+ * This callback needs to remove all associations with the relation `rel`
+ * since the relation is being dropped.
+ *
+ * See also table_relation_reset_filenode().
+ */
+ void (*relation_reset_filenode) (Relation rel);
+
/*
* This callback needs to remove all contents from `rel`'s current
* relfilenode. No provisions for transactional behaviour need to be made.
@@ -1510,6 +1518,18 @@ table_relation_set_new_filenode(Relation rel,
freezeXid, minmulti);
}
+/*
+ * Remove all association with storage for the relation.
+ *
+ * This is used when a relation is about to be dropped and removed from the
+ * catalog.
+ */
+static inline void
+table_relation_reset_filenode(Relation rel)
+{
+ rel->rd_tableam->relation_reset_filenode(rel);
+}
+
/*
* Remove all table contents from `rel`, in a non-transactional manner.
* Non-transactional meaning that there's no need to support rollbacks. This
--
2.25.1
Hi,
On 2021-02-22 08:33:21 +0100, Mats Kindahl wrote:
I started to experiment with the table access method interface to see if it
can be used for some ideas I have.
Cool.
The `relation_set_new_filenode` is indirectly called from
`heap_create_with_catalog`, but there is no corresponding callback from
`heap_drop_with_catalog`. It also seems like the intention is that the
callback should call `RelationCreateStorage` itself (makes sense, since the
access method knows about how to use the storage), so it seems natural to
add a `relation_reset_filenode` to the table AM that is called from
`heap_drop_with_catalog` for tables and add that to the heap implementation
(see the attached patch).
I don't think that's quite right. It's not exactly obvious from the
name, but RelationDropStorage() does not actually drop storage. Instead
it *schedules* the storage to be dropped upon commit.
The reason for deferring the dropping of table storage is that DDL in
postgres is transactional. Therefore we cannot remove the storage at the
moment the DROP TABLE is executed - only when the transaction that
performed the DDL commits. Therefore just providing you with a callback
that runs in heap_drop_with_catalog() doesn't really achieve much -
you'd not have a way to execute the "actual" dropping of the relation at
the later stage.
Creating new blocks for a table is straightforward to implement by using
the `relation_set_new_filenode` callback where you can create new memory
blocks for a relation, but I cannot find a way to clean up those blocks
when the table is dropped nor a way to handle a change of the schema for a
table.
What precisely do you mean with the "handle a change of the schema" bit?
I.e. what would you like to do, and what do you think is preventing you
from it? But before you answer see my next point below.
Altering the schema does not seem to be covered at all, but this is
something that table access methods need to know about since it might want
to optimize the internal storage when the schema changes. I have not been
able to find any discussions around this, but it seems like a natural thing
to do with a table. Have I misunderstood how this works?
Due to postgres' transactional DDL you cannot really change the storage
layout of *existing data* when that DDL command is executed - the data
still needs to be interpretable in case the DDL is rolled back
(including when crashing).
Before I explain some more: Could you describe in a bit more detail what
kind of optimization you'd like to make?
Back to schema change handling:
For some schema changes postgres assumes that they can be done
"in-place", e.g. adding a column to a table.
Other changes, e.g. changing the type of a column "sufficiently", will
cause a so called table rewrite. Which means that a new relation will be
created (including a call to relation_set_new_filenode()), then that new
relation will get all the new data inserted, and then
pg_class->relfilenode for the "original" relation will be changed to the
"rewritten" table (there's two variants of this, once for rewrites due
to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
When the transaction containing such a rewrite commits that
->relfilenode change becomes visible for everyone, and the old
relfilenode will be deleted.
This means that right now there's no easy way to store the data anywhere
but in the file referenced by pg_class.relfilenode. I don't think
anybody would object on principle to making the necessary infrastructure
changes to support storing data elsewhere - but I think it'll also not
quite as simple as the change you suggested :(.
Greetings,
Andres Freund
On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
Hi Andres,
Thanks for the answer and sorry about the late reply.
On 2021-02-22 08:33:21 +0100, Mats Kindahl wrote:
I started to experiment with the table access method interface to see if
it
can be used for some ideas I have.
Cool.
The `relation_set_new_filenode` is indirectly called from
`heap_create_with_catalog`, but there is no corresponding callback from
`heap_drop_with_catalog`. It also seems like the intention is that the
callback should call `RelationCreateStorage` itself (makes sense, sincethe
access method knows about how to use the storage), so it seems natural to
add a `relation_reset_filenode` to the table AM that is called from
`heap_drop_with_catalog` for tables and add that to the heapimplementation
(see the attached patch).
I don't think that's quite right. It's not exactly obvious from the
name, but RelationDropStorage() does not actually drop storage. Instead
it *schedules* the storage to be dropped upon commit.The reason for deferring the dropping of table storage is that DDL in
postgres is transactional. Therefore we cannot remove the storage at the
moment the DROP TABLE is executed - only when the transaction that
performed the DDL commits. Therefore just providing you with a callback
that runs in heap_drop_with_catalog() doesn't really achieve much -
you'd not have a way to execute the "actual" dropping of the relation at
the later stage.
Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
-> heap_drop_with_catalog) where the delete was just scheduled for deletion
but it appeared like this was the place to actually perform the "actual"
delete. Looking closer, I see this was the wrong location. However, the
intention was to get a callback when the "actual" delete should happen.
Before that, the blocks are still potentially alive and could be read, so
shouldn't be recycled.
It seems the right location seems to be in the storage manager (smgr_unlink
in smgr.c), but that does not seem to be extensible, or are there any plans
to make it available so that you can implement something other than just
"magnetic disk"?
Creating new blocks for a table is straightforward to implement by using
the `relation_set_new_filenode` callback where you can create new memory
blocks for a relation, but I cannot find a way to clean up those blocks
when the table is dropped nor a way to handle a change of the schema fora
table.
What precisely do you mean with the "handle a change of the schema" bit?
I.e. what would you like to do, and what do you think is preventing you
from it? But before you answer see my next point below.Altering the schema does not seem to be covered at all, but this is
something that table access methods need to know about since it mightwant
to optimize the internal storage when the schema changes. I have not been
able to find any discussions around this, but it seems like a naturalthing
to do with a table. Have I misunderstood how this works?
Due to postgres' transactional DDL you cannot really change the storage
layout of *existing data* when that DDL command is executed - the data
still needs to be interpretable in case the DDL is rolled back
(including when crashing).
No, didn't expect this, but some means to see that a schema change is about
to happen.
Before I explain some more: Could you describe in a bit more detail what
kind of optimization you'd like to make?
This is not really about any optimizations, it more about a good API for
tables and managing storage. If a memory table can be implemented entirely
in the extension and storage managed fully, there is a lot of interesting
potential for various implementations of table backends. For this to work I
think it is necessary to be able to handle schema changes for the backend
storage in addition to scans, inserts, updates, and deletes, but I am not
sure if it is already possible in some way that I haven't discovered or if
I should just try to propose something (making the storage manager API
extensible seems like a good first attempt).
Back to schema change handling:
For some schema changes postgres assumes that they can be done
"in-place", e.g. adding a column to a table.Other changes, e.g. changing the type of a column "sufficiently", will
cause a so called table rewrite. Which means that a new relation will be
created (including a call to relation_set_new_filenode()), then that new
relation will get all the new data inserted, and then
pg_class->relfilenode for the "original" relation will be changed to the
"rewritten" table (there's two variants of this, once for rewrites due
to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
But that is not visible in the access method interface. If I add debug
output to the memory table, I only see a call to needs_toast_table. If
there were a new call to create a new block and some additional information
about , this would be possible to handle.
I *was* expecting either a call of set_filenode with a new xact id, or
something like that, and with some information so that you can locate the
schema change planned (e.g., digging through pg_class and friends), I just
don't see that when I add debug output.
When the transaction containing such a rewrite commits that
->relfilenode change becomes visible for everyone, and the old
relfilenode will be deleted.This means that right now there's no easy way to store the data anywhere
but in the file referenced by pg_class.relfilenode. I don't think
anybody would object on principle to making the necessary infrastructure
changes to support storing data elsewhere - but I think it'll also not
quite as simple as the change you suggested :(.
Seems it is not. It is fine to store table information in pg_class, but to
implement "interesting" backends there need to be a way to handle schema
changes (among other things), but I do not see how, or I have misunderstood
how this is expected to work.
I have a lot more questions about the table access API, but this is the
first thing.
Best wishes,
Mats Kindahl
Show quoted text
Greetings,
Andres Freund
Hi,
On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
Thanks for the answer and sorry about the late reply.
Mine is even later ;)
I don't think that's quite right. It's not exactly obvious from the
name, but RelationDropStorage() does not actually drop storage. Instead
it *schedules* the storage to be dropped upon commit.The reason for deferring the dropping of table storage is that DDL in
postgres is transactional. Therefore we cannot remove the storage at the
moment the DROP TABLE is executed - only when the transaction that
performed the DDL commits. Therefore just providing you with a callback
that runs in heap_drop_with_catalog() doesn't really achieve much -
you'd not have a way to execute the "actual" dropping of the relation at
the later stage.Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
-> heap_drop_with_catalog) where the delete was just scheduled for deletion
but it appeared like this was the place to actually perform the "actual"
delete. Looking closer, I see this was the wrong location. However, the
intention was to get a callback when the "actual" delete should happen.
Before that, the blocks are still potentially alive and could be read, so
shouldn't be recycled.It seems the right location seems to be in the storage manager (smgr_unlink
in smgr.c), but that does not seem to be extensible, or are there any plans
to make it available so that you can implement something other than just
"magnetic disk"?
There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.
Before I explain some more: Could you describe in a bit more detail what
kind of optimization you'd like to make?This is not really about any optimizations, it more about a good API for
tables and managing storage. If a memory table can be implemented entirely
in the extension and storage managed fully, there is a lot of interesting
potential for various implementations of table backends. For this to work I
think it is necessary to be able to handle schema changes for the backend
storage in addition to scans, inserts, updates, and deletes, but I am not
sure if it is already possible in some way that I haven't discovered or if
I should just try to propose something (making the storage manager API
extensible seems like a good first attempt).
As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.
Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.
If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.
Due to postgres' transactional DDL you cannot really change the storage
layout of *existing data* when that DDL command is executed - the data
still needs to be interpretable in case the DDL is rolled back
(including when crashing).No, didn't expect this, but some means to see that a schema change is about
to happen.
For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().
Other changes, e.g. changing the type of a column "sufficiently", will
cause a so called table rewrite. Which means that a new relation will be
created (including a call to relation_set_new_filenode()), then that new
relation will get all the new data inserted, and then
pg_class->relfilenode for the "original" relation will be changed to the
"rewritten" table (there's two variants of this, once for rewrites due
to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
But that is not visible in the access method interface. If I add debug
output to the memory table, I only see a call to needs_toast_table. If
there were a new call to create a new block and some additional information
about , this would be possible to handle.
It should be. If I e.g. do
CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()
#0 table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c, minmulti=0x7ffc8c612638)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410, accessmtd=2,
tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=false,
relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0,
typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3 0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0,
queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null
then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0 table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c, minmulti=0x7ffc8c612758)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, relfilenode=3016407,
accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=true,
relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true, is_internal=true, relrewrite=3016404,
typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
I *was* expecting either a call of set_filenode with a new xact id, or
something like that, and with some information so that you can locate the
schema change planned (e.g., digging through pg_class and friends), I just
don't see that when I add debug output.
You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)
Greetings,
Andres Freund
On Mon, Mar 22, 2021 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de>
wrote:
Thanks for the answer and sorry about the late reply.
Mine is even later ;)
:)
Seems I keep the tradition. :)
Thanks a lot for the pointers, I have some comments below both about DROP
TABLE and ALTER TABLE.
I don't think that's quite right. It's not exactly obvious from the
name, but RelationDropStorage() does not actually drop storage. Instead
it *schedules* the storage to be dropped upon commit.The reason for deferring the dropping of table storage is that DDL in
postgres is transactional. Therefore we cannot remove the storage atthe
moment the DROP TABLE is executed - only when the transaction that
performed the DDL commits. Therefore just providing you with a callback
that runs in heap_drop_with_catalog() doesn't really achieve much -
you'd not have a way to execute the "actual" dropping of the relationat
the later stage.
Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
-> heap_drop_with_catalog) where the delete was just scheduled fordeletion
but it appeared like this was the place to actually perform the "actual"
delete. Looking closer, I see this was the wrong location. However, the
intention was to get a callback when the "actual" delete should happen.
Before that, the blocks are still potentially alive and could be read, so
shouldn't be recycled.It seems the right location seems to be in the storage manager
(smgr_unlink
in smgr.c), but that does not seem to be extensible, or are there any
plans
to make it available so that you can implement something other than just
"magnetic disk"?There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.
I have done some more research and I do not think it is necessary to extend
the storage layer. As a matter of fact, I think the patch I suggested is
the right approach: let me elaborate on why.
Let's look at how the implementation works with the heap access method (the
file heapam_handler.c) and for this case let's use CREATE TABLE, DROP
TABLE, and TRUNCATE TABLE (last one since that is supported in the Table AM
and hence is a good reference for the comparison).
Disregarding surrounding layers, we have three layers that are important
here:
1. Heap catalog layer (not sure what to call it, but it's the
src/backend/catalog/heap.c file)
2. AM layer (the src/backend/access/heap/heapam_handler.c file)
3. Storage layer (the src/backend/catalog/storage.c file) "code to
create and destroy physical storage for relations".
Looking at CREATE TRUNCATE, we have the following calls through these
layers:
1. In the heap catalog layer we have a call of heap_truncate_one_rel
which calls the table AM layer.
2. In the Table AM layer heapam_relation_nontransactional_truncate will
just call the storage layer to truncate the storage.
3. The storage layer gets called through RelationTruncate, which will
truncate the actual files.
Looking at CREATE TABLE, we have a similar pattern:
1. In the heap catalog layer heap_create_with_catalog is called, which
in turn calls heap_create, which will create the actual relcache and also
call the table AM layer if it is a relation, materialized view, or
toastvalue.
2. In the Table AM layer, heapam_relation_set_new_filenode is called
which will record the transaction identifiers and call the storage layer to
create the underlying storage.
3. In the storage layer, RelationCreateStorage will create the necessary
storage, but also register the table for deletion if the transaction is
aborted.
Note here that the storage layer remembers the table for deletion by saving
it in pendingDeletes, which is local to the storage layer.
Looking at DROP TABLE, we have a similar pattern, but am missing one step:
1. In the heap catalog layer the function heap_drop_with_catalog is
called, which releases the system cache and calls the storage layer to drop
the relation
2. In the storage layer, the function RelationDropStorage is called,
which will record the table to be dropped in the pendingDeletes
When committing (or aborting) the transaction, there are two calls that are
interesting, in this order:
1. CallXactCallbacks which calls registered callbacks
2. smgrDoPendingDeletes, which calls the storage layer directly to
perform the actual deletion, if necessary.
Now, suppose that we want to replace the storage layer with a different
one. It is straightforward to replace it by implementing the Table AM
methods above, but we are missing a callback on dropping the table. If we
have that, we can record the table-to-be-dropped in a similar manner to how
the heap AM does it and register a transaction callback using
RegisterXactCallback.
Before I explain some more: Could you describe in a bit more detail
what
kind of optimization you'd like to make?
This is not really about any optimizations, it more about a good API for
tables and managing storage. If a memory table can be implementedentirely
in the extension and storage managed fully, there is a lot of interesting
potential for various implementations of table backends. For this towork I
think it is necessary to be able to handle schema changes for the backend
storage in addition to scans, inserts, updates, and deletes, but I am not
sure if it is already possible in some way that I haven't discovered orif
I should just try to propose something (making the storage manager API
extensible seems like a good first attempt).As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.
No, this should work fine.
Due to postgres' transactional DDL you cannot really change the storage
layout of *existing data* when that DDL command is executed - the data
still needs to be interpretable in case the DDL is rolled back
(including when crashing).No, didn't expect this, but some means to see that a schema change is
about
to happen.
For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().Other changes, e.g. changing the type of a column "sufficiently", will
cause a so called table rewrite. Which means that a new relation willbe
created (including a call to relation_set_new_filenode()), then that
new
relation will get all the new data inserted, and then
pg_class->relfilenode for the "original" relation will be changed tothe
"rewritten" table (there's two variants of this, once for rewrites due
to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).But that is not visible in the access method interface. If I add debug
output to the memory table, I only see a call to needs_toast_table. If
there were a new call to create a new block and some additionalinformation
about , this would be possible to handle.
It should be. If I e.g. do
CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()#0 table_relation_set_new_filenode (rel=0x7f84c2417b70,
newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c,
minmulti=0x7ffc8c612638)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg",
relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410,
accessmtd=2,
tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p',
shared_relation=false, mapped_relation=false,
allow_system_table_mods=false,
relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900
"blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8,
cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p',
shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0,
use_user_acl=true, allow_system_table_mods=false, is_internal=false,
relrewrite=0,
typaddress=0x0) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3 0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114
'r', ownerId=10, typaddress=0x0,
queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not nullthen when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0 table_relation_set_new_filenode (rel=0x7f84c241f2a8,
newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c,
minmulti=0x7ffc8c612758)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860
"pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407,
relfilenode=3016407,
accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r',
relpersistence=112 'p', shared_relation=false, mapped_relation=false,
allow_system_table_mods=true,
relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860
"pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407,
reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0,
cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p',
shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0,
use_user_acl=false, allow_system_table_mods=true, is_internal=true,
relrewrite=3016404,
typaddress=0x0) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:1291I *was* expecting either a call of set_filenode with a new xact id, or
something like that, and with some information so that you can locate the
schema change planned (e.g., digging through pg_class and friends), Ijust
don't see that when I add debug output.
You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)
I didn't get a callback because I did ADD COLUMN and that works
differently: it does not call set_filenode until you either try to insert
something or run a vacuum. Thanks for the pointers, it helped a lot. I need
to look over the code a little more.
Thanks,
Mats Kindahl
Show quoted text
Greetings,
Andres Freund
On 05.04.2021 22:57, Mats Kindahl wrote:
Now, suppose that we want to replace the storage layer with a
different one. It is straightforward to replace it by implementing the
Table AM methods above, but we are missing a callback on dropping the
table. If we have that, we can record the table-to-be-dropped in a
similar manner to how the heap AM does it and register a transaction
callback using RegisterXactCallback.
This explanation makes sense, and the suggested patch makes it easier to
replace the storage layer with a different one.
Some other places might become problematic if we're trying to implement
fully memory-based tables. For example, the heap_create_with_catalog ->
GetNewRelFilenode -> access() call that directly checks the existence of
a file bypassing the smgr layer. But I think that adding a symmetric
callback to the tableam layer can be a good start for further experiments.
Some nitpicks:
+ /*
+ * This callback needs to remove all associations with the relation `rel`
+ * since the relation is being dropped.
+ *
+ * See also table_relation_reset_filenode().
+ */
"Remove all associations" sounds vague, maybe something like "schedule
the relation files to be deleted at transaction commit"?
+ void (*relation_reset_filenode) (Relation rel);
This line uses spaces instead of tabs.
For the reference, there is a recent patch that makes the smgr layer
itself pluggable:
/messages/by-id/1dc080496f58ce5375778baed0c0fbcc@postgrespro.ru
--
Alexander Kuzmenkov
https://www.timescale.com/
Hi hackers,
As a matter of fact, I think the patch I suggested is the right approach:
let me elaborate on why.
[...]
It is straightforward to replace it by implementing the Table AM methods
above, but we are missing a callback on dropping the table. If we have that,
we can record the table-to-be-dropped in a similar manner to how the heap AM
does it and register a transaction callback using RegisterXactCallback.
Since no one objected in 5 months, I assume Mats made a good point. At least,
personally, I can't argue.
The patch looks good to me except for the fact that comments seem to be
inaccurate in light of the discussion. The corrected patch is attached.
I'm going to mark it as "Ready for Committer" unless anyone objects.
--
Best regards,
Aleksander Alekseev
Attachments:
v2-0001-Add-callback-to-reset-filenode-to-table-access-metho.patchapplication/octet-stream; name=v2-0001-Add-callback-to-reset-filenode-to-table-access-metho.patchDownload
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9befe012a9..143941bb1b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -617,6 +617,12 @@ heapam_relation_set_new_filenode(Relation rel,
smgrclose(srel);
}
+static void
+heapam_relation_reset_filenode(Relation rel)
+{
+ RelationDropStorage(rel);
+}
+
static void
heapam_relation_nontransactional_truncate(Relation rel)
{
@@ -2572,6 +2578,7 @@ static const TableAmRoutine heapam_methods = {
.index_delete_tuples = heap_index_delete_tuples,
.relation_set_new_filenode = heapam_relation_set_new_filenode,
+ .relation_reset_filenode = heapam_relation_reset_filenode,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 325ecdc122..30e070383c 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -83,6 +83,7 @@ GetTableAmRoutine(Oid amhandler)
Assert(routine->tuple_lock != NULL);
Assert(routine->relation_set_new_filenode != NULL);
+ Assert(routine->relation_reset_filenode != NULL);
Assert(routine->relation_nontransactional_truncate != NULL);
Assert(routine->relation_copy_data != NULL);
Assert(routine->relation_copy_for_cluster != NULL);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 83746d3fd9..2a6670cc60 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1997,7 +1997,28 @@ heap_drop_with_catalog(Oid relid)
* Schedule unlinking of the relation's physical files at commit.
*/
if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
- RelationDropStorage(rel);
+ switch (rel->rd_rel->relkind)
+ {
+ case RELKIND_VIEW:
+ case RELKIND_COMPOSITE_TYPE:
+ case RELKIND_FOREIGN_TABLE:
+ case RELKIND_PARTITIONED_TABLE:
+ case RELKIND_PARTITIONED_INDEX:
+ Assert(false);
+ break;
+
+ case RELKIND_INDEX:
+ case RELKIND_SEQUENCE:
+ RelationDropStorage(rel);
+ break;
+
+ case RELKIND_RELATION:
+ case RELKIND_TOASTVALUE:
+ case RELKIND_MATVIEW:
+ table_relation_reset_filenode(rel);
+ break;
+
+ }
/*
* Close relcache entry, but *keep* AccessExclusiveLock on the relation
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 9f1e4a1ac9..c56ae46ffc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -579,6 +579,14 @@ typedef struct TableAmRoutine
TransactionId *freezeXid,
MultiXactId *minmulti);
+ /*
+ * This callback needs to schedule the removal of all associations with the
+ * relation `rel` since the relation is about to be dropped.
+ *
+ * See also table_relation_reset_filenode().
+ */
+ void (*relation_reset_filenode) (Relation rel);
+
/*
* This callback needs to remove all contents from `rel`'s current
* relfilenode. No provisions for transactional behaviour need to be made.
@@ -1597,6 +1605,18 @@ table_relation_set_new_filenode(Relation rel,
freezeXid, minmulti);
}
+/*
+ * Schedule the removal of all association with storage for the relation.
+ *
+ * This is used when a relation is about to be dropped and removed from the
+ * catalog.
+ */
+static inline void
+table_relation_reset_filenode(Relation rel)
+{
+ rel->rd_tableam->relation_reset_filenode(rel);
+}
+
/*
* Remove all table contents from `rel`, in a non-transactional manner.
* Non-transactional meaning that there's no need to support rollbacks. This
Hi hackers,
I'm going to mark it as "Ready for Committer" unless anyone objects.
I updated the status of the patch.
To clarify, Alexander and I replied almost at the same time. The
drawbacks noted by Alexander are fixed in the v2 version of the patch.
--
Best regards,
Aleksander Alekseev
On 27/09/2021 14:59, Aleksander Alekseev wrote:
Hi hackers,
As a matter of fact, I think the patch I suggested is the right approach:
let me elaborate on why.
[...]
It is straightforward to replace it by implementing the Table AM methods
above, but we are missing a callback on dropping the table. If we have that,
we can record the table-to-be-dropped in a similar manner to how the heap AM
does it and register a transaction callback using RegisterXactCallback.Since no one objected in 5 months, I assume Mats made a good point. At least,
personally, I can't argue.
I agree that having a table AM callback at relation drop would make it
more consistent with creating and truncating a relation. Then again, the
indexam API doesn't have a drop-callback either.
But what can you actually do in the callback? WAL replay of dropping the
storage needs to work without running any AM-specific code. It happens
as part of replaying a commit record. So whatever action you do in the
callback will not be executed at WAL replay. Also, because the callback
merely *schedules* things to happen at commit, it cannot generate
separate WAL records about dropping resources either.
Mats's in-memory table is an interesting example. I guess you don't even
try WAL-logging that, so it's OK that nothing happens at WAL replay. As
you said, the callback to schedule deletion of the shared memory block
and use an end-of-xact callback to perform the deletion. You're
basically re-inventing a pending-deletes mechanism similar to smgr's.
I think you could actually piggyback on smgr's pending-deletions
mechanism instead of re-inventing it. In the callback, you can call
smgrGetPendingDeletes(), and drop the shared memory segment for any
relation in that list.
- Heikki
On Wed, Feb 16, 2022 at 10:07 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 27/09/2021 14:59, Aleksander Alekseev wrote:
Hi hackers,
As a matter of fact, I think the patch I suggested is the right
approach:
let me elaborate on why.
[...]
It is straightforward to replace it by implementing the Table AM methods
above, but we are missing a callback on dropping the table. If we havethat,
we can record the table-to-be-dropped in a similar manner to how the
heap AM
does it and register a transaction callback using RegisterXactCallback.
Since no one objected in 5 months, I assume Mats made a good point. At
least,
personally, I can't argue.
I agree that having a table AM callback at relation drop would make it
more consistent with creating and truncating a relation. Then again, the
indexam API doesn't have a drop-callback either.
That is actually a good point. We could add an on-drop-callback for the
indexam as well, if we add it for tableam. Haven't looked at that though,
so if you think it should be added, I can investigate.
But what can you actually do in the callback? WAL replay of dropping the
storage needs to work without running any AM-specific code. It happens
as part of replaying a commit record. So whatever action you do in the
callback will not be executed at WAL replay.
Also, because the callback
merely *schedules* things to happen at commit, it cannot generate
separate WAL records about dropping resources either.
Digressing slightly: This is actually a drawback and I have been looking
for a way to do things on recovery.
Just to have an example: if you want to have an in-memory table that is
distributed the problem will be that even though the table is empty, it
might actually be of interest to fetch the contents from another node on
recovery, but this is currently not possible since it is assumed that all
actions are present in the WAL and a memory table would not use the WAL for
this since one of the goals is to make it fast and avoid writes to disk.
This might be possible to piggyback on the first select or insert done on
the table, but it makes the system more complicated since it is easy to
miss one place where you need to do this fetching. If you always do this on
recovery it is a single place that you need to add and the system will
after that be in a predictable state.
In addition, if it were to be added to the first access of the table, it
would add execution time to this first operation, but most users would
assume that all such work is done at recovery and that the database is
"warm" after recovery.
However, this is slightly outside the discussion for this proposed change,
so we can ignore it for now.
Mats's in-memory table is an interesting example. I guess you don't even
try WAL-logging that, so it's OK that nothing happens at WAL replay. As
you said, the callback to schedule deletion of the shared memory block
and use an end-of-xact callback to perform the deletion. You're
basically re-inventing a pending-deletes mechanism similar to smgr's.
I think you could actually piggyback on smgr's pending-deletions
mechanism instead of re-inventing it. In the callback, you can call
smgrGetPendingDeletes(), and drop the shared memory segment for any
relation in that list.
Hmm... it is a good point that smgrGetPendingDeletes() can be used in the
commit callback for something as simple as a memory table when all the data
is local. It should also work well with a distributed optimistic storage
engine when you certify the transaction at commit time. What will happen
then is that the actual "drop command" will be sent out at commit time
rather than when the command is actually executed.
My main interest is, however, to have an API that works for all kinds of
storage engines, not just limited to local storage but also supporting
distributed storage systems and also being able to interact with existing
implementations. There are a few reasons why getting a notification when
the table is dropped rather than when the commit is done is beneficial.
1. In a distributed storage engine you might want to distribute changes
speculatively when they happen so that the commit, once it occurs, will be
fast. By sending out the action early, you allow work to start
independently of the current machine, which will improve parallelization.
2. In a distributed storage engine or order of the statements received
remotely make a difference. For example, if you want to use a distributed
locking scheme for your distributed storage, you are currently forced to
implement an optimistic scheme while in reality you might want to
distribute the drop and lock the table exclusively on all remote nodes
(this is already what PostgreSQL does, locking the table on a drop). I do
realize that distributed transactions are not that simple and there are
other problems associated with this, but this would still introduce an
unnecessary restriction on what you can do.
3. A problem with optimistic protocols in general is that they drop in
performance when you have a lot of writes. It is simply the case that other
smaller transactions will constantly force a long-running transaction to be
aborted. This also means that there is a risk that a transaction that drops
a table will have to be aborted out of necessity since other transactions
are updating the table. In a distributed system there will be more of
those, so the odds of aborting "more important" transactions (in the sense
of needing stronger locks) is higher.
4. A smaller issue is that right now the storage manager (smgr) and the
transaction system are quite tightly coupled, which makes it more difficult
to make the storage system "pluggable". I think that not requiring the use
of pendingDeletes would move one step in the direction of removing this
coupling, but I am not entirely sure here.
It is likely that many of these problems can be worked around by placing
restrictions on how DDLs can be used in transactions, but that would create
unnecessary restrictions for the end-user. It might also be possible to
find implementation workarounds by placing code at strategic points in the
implementation, but this is error prone and the risk of making an error is
higher.
Best wishes,
Mats Kindahl
Show quoted text
- Heikki
Hi,
On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
2. In the storage layer, the function RelationDropStorage is called,
which will record the table to be dropped in the pendingDeletesWhen committing (or aborting) the transaction, there are two calls that are
interesting, in this order:1. CallXactCallbacks which calls registered callbacks
2. smgrDoPendingDeletes, which calls the storage layer directly to
perform the actual deletion, if necessary.Now, suppose that we want to replace the storage layer with a different
one. It is straightforward to replace it by implementing the Table AM
methods above, but we are missing a callback on dropping the table. If we
have that, we can record the table-to-be-dropped in a similar manner to how
the heap AM does it and register a transaction callback using
RegisterXactCallback.
I don't think implementing dropping relation data at-commit/rollback using
xact callbacks can be correct. The dropping needs to be integrated with the
commit / abort records, so it is redone during crash recovery - that's not
possible with xact callbacks.
To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.
Greetings,
Andres Freund
Hi,
The CF entry for this patch doesn't currently apply and there has been a bunch
of feedback on the approach. Mats, are you actually waiting for further
feedback right now?
Greetings,
Andres Freund
On Sun, Oct 02, 2022 at 09:53:01AM -0700, Andres Freund wrote:
The CF entry for this patch doesn't currently apply and there has been a bunch
of feedback on the approach. Mats, are you actually waiting for further
feedback right now?
Okay, for now this has been marked as RwF.
--
Michael
Hello all,
I think the discussion went a little sideways, so let me recap what I'm
suggesting:
1. I mentioned that there is a missing callback when the filenode is
unlinked and this is particularly evident when dropping a table.
2. It was correctly pointed out to me that an implementor need to ensure
that dropping a table is transactional.
3. I argued that the callback is still correct and outlined how this can
be handled by a table access method using xact callbacks, if necessary.
I see huge potential in the table access method and would like to do my
part in helping it in succeeding. I noted that the API is biased in how the
existing heap implementation works and is also very focused on
implementations of "local" storage engines. For more novel architectures
(for example, various sorts of distributed architectures) and to be easier
to work with, I think that the API can be improved in a few places. This is
a first step in the direction of making the API both easier to use as well
as enabling more novel use-cases.
Writing implementations with ease is more about having the right callbacks
in the right places and providing the right information, but in some cases
it is just not possible to implement efficient functionality with the
current interface. I think it can be useful to separate these two kinds of
enhancements when discussing the API, but I think both are important for
the table access methods to be practically usable and to leverage the full
power of this concept.
On Tue, Aug 2, 2022 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
2. In the storage layer, the function RelationDropStorage is called,
which will record the table to be dropped in the pendingDeletesWhen committing (or aborting) the transaction, there are two calls that
are
interesting, in this order:
1. CallXactCallbacks which calls registered callbacks
2. smgrDoPendingDeletes, which calls the storage layer directly to
perform the actual deletion, if necessary.Now, suppose that we want to replace the storage layer with a different
one. It is straightforward to replace it by implementing the Table AM
methods above, but we are missing a callback on dropping the table. If we
have that, we can record the table-to-be-dropped in a similar manner tohow
the heap AM does it and register a transaction callback using
RegisterXactCallback.don't think implementing dropping relation data at-commit/rollback using
xact callbacks can be correct. The dropping needs to be integrated with the
commit / abort records, so it is redone during crash recovery - that's not
possible with xact callbacks.
Yes, but this patch is about making the extension aware that a file node is
being unlinked.
To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.
This is not really a "drop table" callback, it is just the most obvious
case where this is missing. So, just to recap the situation as it looks
right now.
Here is (transactional) truncate table:
1. Allocate a new file node in the same tablespace as the table
2. Add the file node to the list of pending node to delete
3. Overwrite the existing file node in the relation with the new one
4. Call table_relation_set_new_filenode to tell extension that there is
a new filenode for the relation
Here is drop table:
1. Add the existing file node to the list of pending deletes
2. Remove the table from the catalogs
For an extension writer, the disappearance of the old file node is
"invisible" since there is no callback about this, but it is very clear
when a new file node is allocated. In addition to being inconsistent, it
adds an extra burden on the extension writer. To notice that a file node
has been unlinked you can register a transaction handler and investigate
the pending list at commit or abort time. Even though possible, there are
two problems with this: 1) the table access method is notified "late" in
the transaction that the file node is going away, and 2) it is
unnecessarily complicated to register a transaction handler only for
inspecting this.
Telling the access method that the filenode is unlinked by adding a
callback is by far the best solution since it does not affect existing
extensions and will give the table access methods opportunities to act on
it immediately.
I have attached an updated patch that changes the names of the callbacks
since there was a name change. I had also missed the case of unlinking a
file node when tables were truncated, so I have added a callback for this
as well.
Best wishes,
Mats Kindahl
Show quoted text
Greetings,
Andres Freund
Attachments:
0001-Add-callback-for-unlinking-file-node-of-table.patchtext/x-patch; charset=US-ASCII; name=0001-Add-callback-for-unlinking-file-node-of-table.patchDownload
From 5023c5d9deea7008e4b74f02564262714fb8c6b6 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Mon, 31 Oct 2022 14:27:03 +0100
Subject: [PATCH] Add callback for unlinking file node of table
When dropping or truncating a table, the old file locator is unlinked
and a new file locator created for the relation. This is, however, done
without notifying the table access method that the unlinking is taking
place. This becomes a problem if the table access method has an
internal resource associated with the file locator that also needs to
be unlinked or cleaned up.
This commit add a callback to the table access method for resetting the
file locator when it is dropped and call it to unlink the file locator
so that the table access method and unlink the resource internally, if
necessary. The callback can be undefined (that is, NULL) if the table
access method do not require notification that a resource is unlinked.
---
src/backend/access/heap/heapam_handler.c | 7 +++++++
src/backend/catalog/heap.c | 22 ++++++++++++++++++++-
src/backend/utils/cache/relcache.c | 7 ++++++-
src/include/access/tableam.h | 25 ++++++++++++++++++++++++
4 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 41f1ca65d0..d67b1dcfbf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -615,6 +615,12 @@ heapam_relation_set_new_filelocator(Relation rel,
smgrclose(srel);
}
+static void
+heapam_relation_reset_filelocator(Relation rel)
+{
+ RelationDropStorage(rel);
+}
+
static void
heapam_relation_nontransactional_truncate(Relation rel)
{
@@ -2566,6 +2572,7 @@ static const TableAmRoutine heapam_methods = {
.index_delete_tuples = heap_index_delete_tuples,
.relation_set_new_filelocator = heapam_relation_set_new_filelocator,
+ .relation_reset_filelocator = heapam_relation_reset_filelocator,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 5b49cc5a09..1d1379424f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1850,7 +1850,27 @@ heap_drop_with_catalog(Oid relid)
* Schedule unlinking of the relation's physical files at commit.
*/
if (RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
- RelationDropStorage(rel);
+ switch (rel->rd_rel->relkind)
+ {
+ case RELKIND_VIEW:
+ case RELKIND_COMPOSITE_TYPE:
+ case RELKIND_FOREIGN_TABLE:
+ case RELKIND_PARTITIONED_TABLE:
+ case RELKIND_PARTITIONED_INDEX:
+ Assert(false);
+ break;
+
+ case RELKIND_INDEX:
+ case RELKIND_SEQUENCE:
+ RelationDropStorage(rel);
+ break;
+
+ case RELKIND_RELATION:
+ case RELKIND_TOASTVALUE:
+ case RELKIND_MATVIEW:
+ table_relation_reset_filelocator(rel);
+ break;
+ }
/* ensure that stats are dropped if transaction commits */
pgstat_drop_relation(rel);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bd6cd4e47b..d39c57b855 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3778,9 +3778,14 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
smgrdounlinkall(&srel, 1, false);
smgrclose(srel);
}
+ if (RELKIND_HAS_TABLE_AM(relation->rd_rel->relkind))
+ {
+ table_relation_reset_filelocator(relation);
+ }
else
{
- /* Not a binary upgrade, so just schedule it to happen later. */
+ /* Not a binary upgrade and not a relation with a table access method,
+ * so just schedule it to happen later. */
RelationDropStorage(relation);
}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e45d73eae3..5141f9b3b7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -581,6 +581,15 @@ typedef struct TableAmRoutine
TransactionId *freezeXid,
MultiXactId *minmulti);
+ /*
+ * This callback needs to schedule the removal of all associations with
+ * the relation `rel` since the relation is about to be dropped or the
+ * storage recycled for other reasons.
+ *
+ * See also table_relation_reset_filelocator().
+ */
+ void (*relation_reset_filelocator) (Relation rel);
+
/*
* This callback needs to remove all contents from `rel`'s current
* relfilelocator. No provisions for transactional behaviour need to be
@@ -1600,6 +1609,22 @@ table_relation_set_new_filelocator(Relation rel,
minmulti);
}
+/*
+ * Schedule the removal of all association with storage for the relation.
+ *
+ * This is used when a relation is about to be dropped and removed from the
+ * catalog.
+ *
+ * Since not all access methods require this callback, we do not require the
+ * reset callback to be defined and only call it if it is actually available.
+ */
+static inline void
+table_relation_reset_filelocator(Relation rel)
+{
+ if (rel->rd_tableam)
+ rel->rd_tableam->relation_reset_filelocator(rel);
+}
+
/*
* Remove all table contents from `rel`, in a non-transactional manner.
* Non-transactional meaning that there's no need to support rollbacks. This
--
2.34.1
Hi,
On 2022-11-16 14:49:59 +0100, Mats Kindahl wrote:
I think the discussion went a little sideways, so let me recap what I'm
suggesting:1. I mentioned that there is a missing callback when the filenode is
unlinked and this is particularly evident when dropping a table.
2. It was correctly pointed out to me that an implementor need to ensure
that dropping a table is transactional.
3. I argued that the callback is still correct and outlined how this can
be handled by a table access method using xact callbacks, if necessary.
I still think 3) isn't a solution to 2). The main issue is that xact callbacks
don't address that storage has to be removed during redo as well. For that
dropping storage has to be integrated with commit / abort records.
I don't think custom a custom WAL rmgr addresses this either - it has to be
integrated with the commit record, and the record has to be replayed as a whole.
I see huge potential in the table access method and would like to do my
part in helping it in succeeding. I noted that the API is biased in how the
existing heap implementation works and is also very focused on
implementations of "local" storage engines. For more novel architectures
(for example, various sorts of distributed architectures) and to be easier
to work with, I think that the API can be improved in a few places. This is
a first step in the direction of making the API both easier to use as well
as enabling more novel use-cases.
I agree with that - there's lots more work to be done and the evolution from
not having the abstraction clearly shows. Some of the deficiencies are easy to
fix, but others are there because there's no quick solution to them.
To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.This is not really a "drop table" callback, it is just the most obvious
case where this is missing. So, just to recap the situation as it looks
right now.Here is (transactional) truncate table:
1. Allocate a new file node in the same tablespace as the table
2. Add the file node to the list of pending node to delete
3. Overwrite the existing file node in the relation with the new one
4. Call table_relation_set_new_filenode to tell extension that there is
a new filenode for the relationHere is drop table:
1. Add the existing file node to the list of pending deletes
2. Remove the table from the catalogsFor an extension writer, the disappearance of the old file node is
"invisible" since there is no callback about this, but it is very clear
when a new file node is allocated. In addition to being inconsistent, it
adds an extra burden on the extension writer. To notice that a file node
has been unlinked you can register a transaction handler and investigate
the pending list at commit or abort time. Even though possible, there are
two problems with this: 1) the table access method is notified "late" in
the transaction that the file node is going away, and 2) it is
unnecessarily complicated to register a transaction handler only for
inspecting this.
Using a transaction callback solely also doesn't address the redo issue...
I think to make this viable for filenodes that don't look like md.c's, you'd
have to add a way to make commit/abort records extensible by custom
rmgrs. Then the replay of a commit/abort could call the custom rmgr for the
"sub-record" during replay.
Telling the access method that the filenode is unlinked by adding a
callback is by far the best solution since it does not affect existing
extensions and will give the table access methods opportunities to act on
it immediately.
I'm loathe to add a callback that I don't think can be used correctly without
further changes.
Greetings,
Andres Freund