Extensible storage manager API - SMGR hook Redux
Hi hackers,
At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year by Anastasia [0]/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com.
As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com; offloading of cold
relation data was also mentioned [1]/messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru.
In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.
The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.
Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis
Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.
Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)
TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options
Looking forward to any comments, suggestions and reviews.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
[0]: /messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com
[1]: /messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
Attachments:
v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchapplication/octet-stream; name=v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload+242-127
v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patchapplication/octet-stream; name=v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patchDownload+407-179
Hi,
On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote:
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 4c49393fc5..8685b9fde6 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[]) */ ApplyLauncherRegister();+ /* + * Register built-in managers that are not part of static arrays + */ + register_builtin_dynamic_managers(); + /* * process any libraries that should be preloaded at postmaster start */
That doesn't strike me as a good place to initialize this, we'll need it in
multiple places that way. How about putting it into BaseInit()?
-static const f_smgr smgrsw[] = { +static f_smgr *smgrsw;
This adds another level of indirection. I would rather limit the number of
registerable smgrs than do that.
+SMgrId +smgr_register(const f_smgr *smgr, Size smgrrelation_size) +{
+ MemoryContextSwitchTo(old); + + pg_compiler_barrier();
Huh, what's that about?
@@ -59,14 +63,8 @@ typedef struct SMgrRelationData * Fields below here are intended to be private to smgr.c and its * submodules. Do not touch them from elsewhere. */ - int smgr_which; /* storage manager selector */ - - /* - * for md.c; per-fork arrays of the number of open segments - * (md_num_open_segs) and the segments themselves (md_seg_fds). - */ - int md_num_open_segs[MAX_FORKNUM + 1]; - struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; + SMgrId smgr_which; /* storage manager selector */ + int smgrrelation_size; /* size of this struct, incl. smgr-specific data */
It looked to me like you determined this globally - why do we need it in every
entry then?
Greetings,
Andres Freund
Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation
From what I can see, all the md* APIs that were exposed in md.h can now
be made static in md.c. The only other references to those APIs were in
smgr.c.
Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
they use
-typedef uint8 SMgrId; +/* + * volatile ID of the smgr. Across various configurations IDs may vary, + * true identity is the name of each smgr. + */ +typedef int SMgrId;-#define MaxSMgrId UINT8_MAX +#define MaxSMgrId INT_MAX
In a future revision of this patch, seems worthwhile to just start as
int instead of a uint8 to avoid this song and dance. Maybe int8 instead
of int?
+static SMgrId recent_smgrid = -1;
You could use InvalidSmgrId here.
+void smgrvalidatetspopts(const char *smgrname, List *opts) +{ + SMgrId smgrid = get_smgr_by_name(smgrname, false); + + smgrsw[smgrid].smgr_validate_tspopts(opts); +} + +void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo) +{ + SMgrId smgrid = get_smgr_by_name(smgrname, false); + + smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo); +} + +void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo) +{ + SMgrId smgrid = get_smgr_by_name(smgrname, false); + + smgrsw[smgrid].smgr_drop_tsp(tsp, isredo); +}
Do you not need to check if smgrid is the InvalidSmgrId? I didn't see
any other validation anywhere.
+ char *smgr; + List *smgropts; /* list of DefElem nodes */
smgrname would probably work better alongside tablespacename in that
struct.
@@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
if (!InRecovery)
mdclose(reln, forknum);- return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL); + return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL); }
Was this a victim of a bad rebase? Seems like it belongs in the previous
patch.
+void mddroptsp(Oid tsp, bool isredo) +{ + +}
Some functions in this file have the return type on the previous line.
This is a pretty slick patchset. Excited to read more dicussion and how
it evolves.
--
Tristan Partin
Neon (https://neon.tech)
Found these warnings while compiling while only 0001 is applied.
[1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o
../src/backend/storage/smgr/md.c: In function ‘mdexists’:
../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of ‘mdopenfork’ from incompatible pointer type [-Wincompatible-pointer-types]
224 | return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
| ^~~~
| |
| SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
| ~~~~~~~~~~~~~~~^~~~
../src/backend/storage/smgr/md.c: In function ‘mdcreate’:
../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of ‘register_dirty_segment’ from incompatible pointer type [-Wincompatible-pointer-types]
287 | register_dirty_segment(reln, forknum, mdfd);
| ^~~~
| |
| SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
Here is a diff to be applied to 0001 which fixes the warnings that get
generated when compiling. I did see that one of the warnings gets fixed
0002 (the mdexists() one). I am assuming that change was just missed
while rebasing the patchset or something. I did not see a fix for
mdcreate() in 0002 however.
--
Tristan Partin
Neon (https://neon.tech)
Attachments:
smgr.difftext/x-patch; charset=utf-8; name=smgr.diffDownload+4-2
Sorry for double-posting, I accidentally replied to Matthias, not the
mailing list :(
---------- Forwarded message ---------
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Mon, 4 Dec 2023 at 19:46
Subject: Re: Extensible storage manager API - SMGR hook Redux
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Hi!
On Fri, 30 Jun 2023 at 15:27, Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
Hi hackers,
At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year by Anastasia [0].As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apisChanges compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager
optionsLooking forward to any comments, suggestions and reviews.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)[0]
/messages/by-id/CAP4vRV6JKXyFfEOf=n+v5RGsZywAQ3CTM8ESWvgq+S87Tmgx_g@mail.gmail.com
[1]
/messages/by-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
So, 0002 patch uses the `get_tablespace` function, which searches Catalog
to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
Is it possible to query the system catalog during crash recovery? As far as
i understand the answer is "no", correct me if I'm wrong.
Furthermore, why do we only allow tablespace to have its own SMGR
implementation, can we have per-relation SMGR? Maybe we can do it in a way
similar to custom RMGR (meaning, write SMGR OID into WAL and use it in
crash recovery etc.)?
Import Notes
Reply to msg id not found: CALdSSPg5SvrJLPperW8SkSJhdMfwKrid9GBn_syCYkJP-ZNu5g@mail.gmail.com
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.
Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct me if I'm wrong.
Yes, you're correct, we can't access buffers like this during
recovery. That's going to need some more effort.
Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybe we can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)?
AMs (and by extension, their RMGRs) that use Postgres' buffer pool
have control over how they want to layout their blocks and files, but
generally don't care about where those blocks and files are located,
as long as they _can_ be retrieved.
Tablespaces, however, describe 'drives' or 'storage pools' in which
the tables/relations are stored, which to me seems to be the more
logical place to configure the SMGR abstraction of how and where to
store the actual data, as SMGRs manage the low-level relation block IO
(= file accesses), and tablespaces manage where files are stored.
Note that nothing prevents you from using one tablespace (thus
different SMGR) per relation, apart from bloated catalogs and the
superuser permissions required for creating those tablespaces. It'd be
difficult to manage, but not impossible.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
So, 0002 patch uses the `get_tablespace` function, which searches
Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId ->
smgr id mapping in a separate file and update the whole file on any
changes, right?
Looks reasonable to me, but it is clear that this solution can be really
slow in some patterns, like if we create many-many tablespaces(the way you
suggested it in the per-relation SMGR feature). Maybe we can store data in
files somehow separately, and only update one chunk per operation.
Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its
code infrasture, right? For example, it seems that code that calculates
checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe.
Or is it not really worth it? We can just write similar code twice.
On Mon, 4 Dec 2023 at 22:03, Kirill Reshke <reshkekirill@gmail.com> wrote:
On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and update the whole file on any changes, right?
Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-many tablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehow separately, and only update one chunk per operation.
Yes, but that's a later issue... I'm not sure many-many tablespaces is
actually a good thing. There are already very few reasons to store
tables in more than just the default tablespace. For temporary
relations, there is indeed a guc to automatically put them into one
tablespace; and I can see a similar thing being useful for temporary
relations, too. Then there I can see high-performant local disks vs
lower-performant (but cheaper) local disks also as something
reasonable. But that only gets us to ~6 tablespaces, assuming separate
tablespaces for each combination of (normal, temp, unlogged) * (fast,
cheap). I'm not sure there are many other reasons to add tablespaces,
let alone making one for each table.
Note that you can select which tablespace a table is stored in, so I
see very little reason to actually do something about large numbers of
tablespaces being prohibitively expensive performance-wise.
Why do you want to have a whole new storage configuration for each of
your relations?
Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seems that code that calculates checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just write similar code twice.
I'm not sure about that. I really doubt we'll need things that are
that similar: right now, the tablespace->smgr mapping could be
considered to be implied by the symlinks in /pg_tblspc/. Non-MD
tablespaces could add a file <oid>.tblspc that detail their
configuration, which would also fix the issue of spcoid->smgr mapping.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Thought I would show off what is possible with this patchset.
Heikki, a couple of months ago in our internal Slack, said:
[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.
My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com.
Andres, in another thread[1]/messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de, said:
I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced. All of this in assert builds only, of course.
I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.
Here is a description of the patches:
0001:
This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.
0002:
This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.
0003:
Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.
0004:
This is the actual fsync_checker extension itself. It must be preloaded.
Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.
[0]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[1]: /messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de
--
Tristan Partin
Neon (https://neon.tech)
Attachments:
v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=utf-8; name=v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload+252-132
v1-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=utf-8; name=v1-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload+11-3
v1-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=utf-8; name=v1-0003-Add-checkpoint_create_hook.patchDownload+9-1
v1-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=utf-8; name=v1-0004-Add-contrib-fsync_checker.patchDownload+278-1
Hi,
Thought I would show off what is possible with this patchset.
[...]
Just wanted to let you know that cfbot doesn't seem to be entirely
happy with the patch [1]http://cfbot.cputube.org/. Please consider submitting an updated
version.
Best regards,
Aleksander Alekseev (wearing co-CFM hat)
Hi,
I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.
We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.
The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.
Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.
Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.
Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft
Show quoted text
On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
Thought I would show off what is possible with this patchset.
Heikki, a couple of months ago in our internal Slack, said:
[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0].Andres, in another thread[1], said:
I've been thinking that we need a validation layer for fsyncs, it's too hard
to get right without testing, and crash testing is not likel enough to catch
problems quickly / resource intensive.My thought was that we could keep a shared hash table of all files created /
dirtied at the fd layer, with the filename as key and the value the current
LSN. We'd delete files from it when they're fsynced. At checkpoints we go
through the list and see if there's any files from before the redo that aren't
yet fsynced. All of this in assert builds only, of course.I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.Here is a description of the patches:
0001:
This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.0002:
This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.0003:
Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.0004:
This is the actual fsync_checker extension itself. It must be preloaded.
Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.[0]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[1]: /messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de--
Tristan Partin
Neon (https://neon.tech)
Thank you for your detailed review and insights. I share your view that a
registration-based approach for custom storage managers (smgr) is more
versatile. This method allows for the implementation of custom table access
methods, facilitating the use of various storage services (such as file
services or object storage), different file organization formats (files in
one directory or many sub directories), and flexible deletion logic
(direct deletion or mark-and-sweep).
While I acknowledge that the registration-based approach requires more
modifications, I believe the benefits in terms of extensibility and
functionality are significant. I have seen similar explorations in the
implementation of AO tables in Greenplum, where a dedicated smgr was
created due to its distinct file organization compared to heap tables.
I look forward to further discussions on this topic
Nitin Jadhav <nitinjadhavpostgres@gmail.com> 于2024年12月10日周二 11:25写道:
Show quoted text
Hi,
I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
MicrosoftOn Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
Thought I would show off what is possible with this patchset.
Heikki, a couple of months ago in our internal Slack, said:
[I would like] a debugging tool that checks that we're not missing any
fsyncs. I bumped into a few missing fsync bugs with unlogged tables
lately and a tool like that would've been very helpful.My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0].Andres, in another thread[1], said:
I've been thinking that we need a validation layer for fsyncs, it's
too hard
to get right without testing, and crash testing is not likel enough to
catch
problems quickly / resource intensive.
My thought was that we could keep a shared hash table of all files
created /
dirtied at the fd layer, with the filename as key and the value the
current
LSN. We'd delete files from it when they're fsynced. At checkpoints we
go
through the list and see if there's any files from before the redo
that aren't
yet fsynced. All of this in assert builds only, of course.
I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.Here is a description of the patches:
0001:
This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.0002:
This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.0003:
Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.0004:
This is the actual fsync_checker extension itself. It must be preloaded.
Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.[0]:
/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
[1]:
/messages/by-id/20220127182838.ba3434dp2pe5vcia@alap3.anarazel.de
--
Tristan Partin
Neon (https://neon.tech)
Hi!
We at Percona are very interested in this patch for our transparent data
encryption extension. So we would love to collaborate with you, and
anyone else interested, on making the SMGR extensible.
I have attached rebased and a bit cleaned up versions of Tristan's
patches plus a couple of patches we have been working on in-house
(mainly my colleague Zsolt). I also have some questions which I would
like to discuss.
0001-0004
The same patches as Tristan posted but rebased and cleaned up a bit to
better follow the code style. I also removed a couple of dead variables
which seemed like left overs.
0005
Since we support having both encrypted and unencrypted relations we use
the RelFileLocator to look up if a relation is encrypted. And to
preserve that information when smgrcreate() creates a new relfile for a
relation we pass along the old RelFileLocator.
For our use case it is possible that we could solve this in other ways.
For example if we decide to go with configuring the SMGR per schema this
will probably not be necessary at all.
0006
The patch introduces the concept of "chaining" SMGRs where we have tail
(e.g. md or a theoretical Ceph SMGR) and modifier (e.g. TDE or the
fsync_checker). Something like this would be useful for our case since
it would be nice to be able to use the same encryption code for md and
for some other potential replacement for md which uses some kind object
storage for example.
As a bonus this allowed us to make the functions implementing md static.
It is currently controlled via a GUC, smgr_chain, but this will of
course depend on how we decide to implement configuring which SMGR to use.
Questions
- What is up with the barrier when loading SMGRs? That does not seem
necessary or am I missing something? I believe Andres also spotted this.
- How should we configure which SMGR to use for each relation? People
have talked about doing it per tablespace or using hooks and we have a
patch which uses a GUC for this. I have personally not researched these
options enough to have an opinion yet.
- Is our idea about chaining SMGRs useful? In its current form or some
variant inspired by it?
- We need to benchmark this to make sure we do not introduce too much
overhead, especially for people who just want to use md. I saw for
example that Andres had some complaint about extra indirection which we
may have to address.
Andreas
Attachments:
v3-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=UTF-8; name=v3-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload+266-142
v3-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=UTF-8; name=v3-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload+7-2
v3-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-checkpoint_create_hook.patchDownload+9-1
v3-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=UTF-8; name=v3-0004-Add-contrib-fsync_checker.patchDownload+282-1
v3-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchtext/x-patch; charset=UTF-8; name=v3-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchDownload+29-26
v3-0006-SMGR-GUC-variable-and-chaining.patchtext/x-patch; charset=UTF-8; name=v3-0006-SMGR-GUC-variable-and-chaining.patchDownload+448-162
On 9/21/24 8:24 PM, Nitin Jadhav wrote:
I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.
I personally think that it is fine that there are patches which provide
a PoC implementation of the new API. It is hard to verify if an API is
correct if there are zero alternative implementations. And there is also
a case to be made for having one of them in contrib just to make it hard
for us to break the API for external users.
That said I have not made up my mind yet if this is a good extension for
contrib.
We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.
100% agreed.
The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.
I do not think there is any plan for core to support multiple storage
managers, but there are open source thrid party extensions which plan to
implement this API once it has been merged.
Andreas
Hi,
Here is a rebased version of it to make the CI happy. I plan to work
more on this next week but am happy with any feedback on what is already
there.
Andreas
Attachments:
v4-0006-SMGR-GUC-variable-and-chaining.patchtext/x-patch; charset=UTF-8; name=v4-0006-SMGR-GUC-variable-and-chaining.patchDownload+448-162
v4-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchtext/x-patch; charset=UTF-8; name=v4-0005-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchDownload+29-26
v4-0004-Add-contrib-fsync_checker.patchtext/x-patch; charset=UTF-8; name=v4-0004-Add-contrib-fsync_checker.patchDownload+280-1
v4-0003-Add-checkpoint_create_hook.patchtext/x-patch; charset=UTF-8; name=v4-0003-Add-checkpoint_create_hook.patchDownload+9-1
v4-0002-Allow-extensions-to-override-the-global-storage-m.patchtext/x-patch; charset=UTF-8; name=v4-0002-Allow-extensions-to-override-the-global-storage-m.patchDownload+7-2
v4-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchtext/x-patch; charset=UTF-8; name=v4-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload+266-142
On Fri, 7 Mar 2025 at 16:52, Andreas Karlsson <andreas@proxel.se> wrote:
Hi,
Hi!
Here is a rebased version of it to make the CI happy.
Looks like CI is still unhappy with this change[0]https://cirrus-ci.com/task/6466113875214336 -- Best regards, Kirill Reshke
0001:
+ +SMgrId +smgr_register(const f_smgr *smgr, Size smgrrelation_size)
...
+ Assert(smgr->smgr_open != NULL); + Assert(smgr->smgr_close != NULL); + Assert(smgr->smgr_create != NULL); + Assert(smgr->smgr_exists != NULL); + Assert(smgr->smgr_unlink != NULL); + Assert(smgr->smgr_extend != NULL); + Assert(smgr->smgr_zeroextend != NULL); + Assert(smgr->smgr_prefetch != NULL); + Assert(smgr->smgr_readv != NULL); + Assert(smgr->smgr_writev != NULL); + Assert(smgr->smgr_writeback != NULL); + Assert(smgr->smgr_nblocks != NULL); + Assert(smgr->smgr_truncate != NULL); + Assert(smgr->smgr_immedsync != NULL);
Are we sure we need to force extension authors to implement prefetch?
Also, do we intentionally skip Assert on smgr_registersync and
smgr_init here? I am not questioning smgr_shutdown here, as I can see
it is NULL for md implementation.
0002:
should we merge this with 0001?
0003: Looks mature, no comments.
0004:
It's a bit strange to place fsync_checker under contrib, huh? Like,
you will never use it in production. Maybe src/test/modules is a
better place?
0005:
We are missing rationale for this change in the commit message.
I didn't look at the 0006 modifications. Later, I'll try to take another look.
[0]: https://cirrus-ci.com/task/6466113875214336 -- Best regards, Kirill Reshke
--
Best regards,
Kirill Reshke
On Fri, 7 Mar 2025 at 17:22, Andreas Karlsson <andreas@proxel.se> wrote:
Hi,
Here is a rebased version of it to make the CI happy. I plan to work
more on this next week but am happy with any feedback on what is already
there.
I noticed that Kirill's comments from [1]/messages/by-id/CALdSSPimrJWeex1RbvVXoGCROLiC6VgKUdEE0pUcib=GNYo58g@mail.gmail.com are not yet addressed, I
have changed the commitfest entry status to "Waiting on Author. Please
address them and change the status to "Needs Review".
[1]: /messages/by-id/CALdSSPimrJWeex1RbvVXoGCROLiC6VgKUdEE0pUcib=GNYo58g@mail.gmail.com
Regards,
Vignesh
Hello!
I rebased the patch and addressed all comments.
Are we sure we need to force extension authors to implement prefetch?
Also, do we intentionally skip Assert on smgr_registersync and
smgr_init here? I am not questioning smgr_shutdown here, as I can see
it is NULL for md implementation.
Added the new asserts, also for the new startreadv.
I don't see a problem with requiring implementing prefetch, the
implementing function can be empty.
0002:
should we merge this with 0001?
Done!
0004:
It's a bit strange to place fsync_checker under contrib, huh? Like,
you will never use it in production. Maybe src/test/modules is a
better place?
I moved it to the test folder. In a later version I'll also modify it
to something more suited for actual testing - for now it's just a
simple example use.
0005:
We are missing rationale for this change in the commit message.
I added a more detailed commit message, which hopefully explains why
this change could be useful.
Attachments:
v5-0002-Add-checkpoint_create_hook.patchapplication/octet-stream; name=v5-0002-Add-checkpoint_create_hook.patchDownload+9-1
v5-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchapplication/octet-stream; name=v5-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patchDownload+285-152
v5-0004-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchapplication/octet-stream; name=v5-0004-Refactor-smgr-API-mdcreate-needs-the-old-relfilel.patchDownload+516-204
v5-0003-Add-src-test-modules-fsync_checker.patchapplication/octet-stream; name=v5-0003-Add-src-test-modules-fsync_checker.patchDownload+281-1
Hello!
I made a mistake during my previous rebase, I fixed that in the new v6 version.
I also added one more patch, which makes AIO callbacks (for the
startreadv functions) extensible. Without that the "read" part of the
SMGR extensibility isn't that useful, as most read calls already use
the new AIO infrastructure.
To showcase another open source use case for this extension other than
pg_tde, I also created a new simple extension, pg_smgrstat, which
provides file level IO statistics using this API:
https://github.com/dutow/pg_smgrstat/
I'm also experimenting with another extension on top of these
statistics, which implements automatic offloading to slower storage
and back, based on the usage patterns of the file. I'll share that
along with later patch updates when it's ready.