Skip unregistered custom kinds on stats load

Started by Sami Imseih5 months ago13 messages

samimseih@gmail.com

5 months ago

Hi,

While working on something nearby, I observed that if a custom
pgstats kind is not registered while the on-disk pgstats.stat file
contains entries for this custom kind, it results in all of the stats,
including the built-in ones, being discarded.

This can happen if you initially register this custom kind, cleanly
shut down the database, then start it up without the kind being
registered (i.e. not including the extension that loads this custom
kind in shared_preload_libraries).

This can be observed in 001_stat.pl in the injection_points tests,
where the last $node->start shows a corrupted pgstat.stat file
message because "injection_points_fixed" and "injection_points"
were removed from shared_preload_libraries after a clean shutdown.

```
2025-10-20 18:38:48.664 UTC startup[93662] LOG: database system was
shut down at 2025-10-20 18:38:48 UTC
2025-10-20 18:38:48.665 UTC startup[93662] WARNING: could not find
information of kind 26 for entry of type F
2025-10-20 18:38:48.665 UTC startup[93662] LOG: corrupted statistics
file "pg_stat/pgstat.stat"
2025-10-20 18:38:48.666 UTC postmaster[93643] LOG: database system is
ready to accept connections
2025-10-20 18:38:48.750 UTC postmaster[93643] LOG: received immediate
shutdown request
2025-10-20 18:38:48.757 UTC postmaster[93643] LOG: database system is shut down
```

While I do not think this is a bug, I also do not think this is not the
ideal behavior. Unlike built-in stats, custom stats are not guaranteed
to be registered, and completely resetting all stats in that edge case
seems too strict.

I think this can be better handled by tracking the lengths of the
kinds in the pgstat.stat file when saving, and re-reading the lengths
on startup. This would allow skipping past any unrecognized custom
data. We need to save the lengths since we have no way to know how
much to skip when re-reading. The data can be saved at the start of
the file, after the FORMAT, so the extra space required is minimal.

With the attached patch, this is the warning message instead.

```
2025-10-20 18:25:51.763 UTC startup[47833] LOG: database system was
shut down at 2025-10-20 18:25:51 UTC
2025-10-20 18:25:51.764 UTC startup[47833] WARNING: found unregistered
custom stats kind 26 for entry of type F, skipping
2025-10-20 18:25:51.765 UTC postmaster[47814] LOG: database system is
ready to accept connections
2025-10-20 18:25:51.848 UTC postmaster[47814] LOG: received immediate
shutdown request
2025-10-20 18:25:51.855 UTC postmaster[47814] LOG: database system is shut down
```

Thoughts?

--
Sami Imseih
Amazon Web Services (AWS)

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Sami Imseih (#1)

Re: Skip unregistered custom kinds on stats load

On Mon, Oct 20, 2025 at 01:39:37PM -0500, Sami Imseih wrote:

I think this can be better handled by tracking the lengths of the
kinds in the pgstat.stat file when saving, and re-reading the lengths
on startup. This would allow skipping past any unrecognized custom
data. We need to save the lengths since we have no way to know how
much to skip when re-reading. The data can be saved at the start of
the file, after the FORMAT, so the extra space required is minimal.

+    /* Write the lengths of all stats kinds */
+    for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+    {
+        size_t        kind_len = 0;
+
+        const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+        if (kind_info)
+            kind_len = pgstat_get_entry_len(kind);
+
+        kind_lengths[kind - 1] = kind_len;
+        write_chunk_s(fpout, &kind_len);
+    }

The fun does not stop there. This also means that the data tracked in
the file becomes incorrect if a server tries to reuse the same ID
across two restarts with a different kind attached to them. The point
is that giving up is kind of always the safest bet and acts as a
sanity measure, giving priority to the availability of the server.

Perhaps we should document better all these properties, meaning that
keeping a file compatible requires the stats kind to be loaded, even
if it means replacing a past one with an "idle" kind, that could
discard past stats that are not supported anymore so as the same stats
kind can be reused in the future (that could be also settable as a GUC
defined by the library loaded with shared_preload_libraries from where
the custom stats kind originates).
--
Michael

Sami Imseih

samimseih@gmail.com

5 months ago

In reply to: Michael Paquier (#2)

Re: Skip unregistered custom kinds on stats load

The fun does not stop there. This also means that the data tracked in
the file becomes incorrect if a server tries to reuse the same ID
across two restarts with a different kind attached to them. The point
is that giving up is kind of always the safest bet and acts as a
sanity measure, giving priority to the availability of the server.

hmm, that is a different case, right? When we restart with a
different kind ( different struct ) but the same kind ID, we
would find the kind when reading but it may have a different
length, and in that case we will fail reading the entry, and
actually we will have a legitimate corrupt stats file in that
case. Here is what I see in that case.

```
2025-10-20 19:41:14.365 CDT [36636] WARNING: could not read entry of type
2025-10-20 19:41:14.365 CDT [36636] LOG: corrupted statistics file
"pg_stat/pgstat.stat"
```

The more worrying case is if the struct of this other kind has
the same length, and the data ends up being read back
into the wrong fields.

[0]: https://github.com/postgres/postgres/blob/master/src/backend/utils/activity/pgstat.c#L1851-L1868

--
Sami

Sami Imseih

samimseih@gmail.com

5 months ago

In reply to: Sami Imseih (#3)

Re: Skip unregistered custom kinds on stats load

I hit send too fast. Link [0] referenced above was just pointing
out the part of the code that reads a chunk of data using the
length of the kind entry.

--
Sami

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Sami Imseih (#3)

Re: Skip unregistered custom kinds on stats load

On Mon, Oct 20, 2025 at 07:54:09PM -0500, Sami Imseih wrote:

The more worrying case is if the struct of this other kind has
the same length, and the data ends up being read back
into the wrong fields.

More to the point. If I were to put defenses when reading back a
stats file that includes custom stats kinds before reading the file
entries and checking if its contents are OK to be used by the server,
validating the size of the on-disk entries is not sufficient. We
should write down and cross-check more data when reading them back
with what has been loaded by the postmaster:
- Name of custom callback, that should match.
- Its reserved ID.
- Written size on disk, to ensure that what is read from the file is
OK to load.
- Likely if it is a fixed_amount or not, not really mandatory IMO.

Then, based on if these contents match with what has been loaded from
s_p_l, we could issue a harder failure that prevents startup and not
throw about the stats file. If one needs to ditch some stats, they
would still need to load a dummy version of the custom stats kind they
want to drop, dropping the entries on read. That's still an existing
requirement, or have a tool that's able to read, filter and rewrite an
existing stats file (refactoring pgstat_shmem.c to expose some of its
routines for this purpose has been on my TODO list of things for some
time).

The existing implementation is a choice of simplicity, prioritizing
availability, FWIW, to keep the contents of the on-disk file simpler.
Being able for an extension developer to drop stats in an existing
file would still need to be handled on the developer side. So I am
not really convinced that this is something we need to treat that as a
problem in search of a solution.
--
Michael

Sami Imseih

samimseih@gmail.com

5 months ago

In reply to: Michael Paquier (#5)

Re: Skip unregistered custom kinds on stats load

The more worrying case is if the struct of this other kind has
the same length, and the data ends up being read back
into the wrong fields.

More to the point. If I were to put defenses when reading back a
stats file that includes custom stats kinds before reading the file
entries and checking if its contents are OK to be used by the server,
validating the size of the on-disk entries is not sufficient. We
should write down and cross-check more data when reading them back
with what has been loaded by the postmaster:
- Name of custom callback, that should match.
- Its reserved ID.
- Written size on disk, to ensure that what is read from the file is
OK to load.
- Likely if it is a fixed_amount or not, not really mandatory IMO.

The proposal is not about validating that the data being read back
belongs to the same registered kind. It is about ensuring that if a
kind is not registered, we simply skip that portion of the
pgstat.stat file. It's a very narrow case, perhaps too narrow to
be worthwhile.

Then, based on if these contents match with what has been loaded from
s_p_l, we could issue a harder failure that prevents startup and not
throw about the stats file. If one needs to ditch some stats, they
would still need to load a dummy version of the custom stats kind they
want to drop, dropping the entries on read. That's still an existing
requirement, or have a tool that's able to read, filter and rewrite an
existing stats file (refactoring pgstat_shmem.c to expose some of its
routines for this purpose has been on my TODO list of things for some
time).

Hmm, since the pgstat.stat file is managed by core and not by the
extension, I think this should be handled semi-automatically by
pgstats. Even with the checks you mention above, validating that we
are indeed loading the same kind will never be totally foolproof. I
think we can validate kinds better by adding some MAGIC NUMBER that is
set by the extension during registration and tracked in pgstats.stat.
If the kind ID
and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of
a guarantee that this is the same extension. Also, this is probably a
good idea to support extension upgrades.

--
Sami

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Sami Imseih (#6)

Re: Skip unregistered custom kinds on stats load

On Tue, Oct 21, 2025 at 01:26:30PM -0500, Sami Imseih wrote:

Hmm, since the pgstat.stat file is managed by core and not by the
extension, I think this should be handled semi-automatically by
pgstats. Even with the checks you mention above, validating that we
are indeed loading the same kind will never be totally foolproof. I
think we can validate kinds better by adding some MAGIC NUMBER that is
set by the extension during registration and tracked in pgstats.stat.
If the kind ID
and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of
a guarantee that this is the same extension. Also, this is probably a
good idea to support extension upgrades.

Probably. Still, it does not feel like something we need to push
strongly in a given direction, because we may pay in flexibility
compared to what's on HEAD.

With a single kind ID forcing how large a fixed-sized entry or one
variable-sized entry should be based on what kind of data is loaded,
versioning something based on a single kind is impossible (or the same
size is used, leading to different contents translated). Of course,
one could use multiple kind IDs if they want to enforce different
stats versions in a single major version of the backend. To me, this
property comes down that the version of the pgstats file cannot change
in a single major version, implying that the same rule is also pushed
to the custom kinds: if you begin to use a kind ID in a given major
version, do not change the stats format, limitting new changes to a
new major version. Or an upgrade format path to use is something an
extension needs to handle on its own. I'd bet that stats upgrade
requirements for the same major version are not going to be that
common, the future may tell a different story depending on the use
cases that pop.
--
Michael

Alvaro Herrera

alvherre@2ndquadrant.com

5 months ago

In reply to: Michael Paquier (#5)

Re: Skip unregistered custom kinds on stats load

On 2025-Oct-21, Michael Paquier wrote:

The existing implementation is a choice of simplicity, prioritizing
availability, FWIW, to keep the contents of the on-disk file simpler.

FWIW I don't necessarily agree with what Sami is proposing, but (if I
understand the situation correctly) I think throwing away the whole
contents of stats when a module is removed is the opposite of
availability, as it will lead to autovacuum lacking data and thus bloat
problems. This is of course always problematic after a crash (unless
somebody has fixed this already?) but at least it is supposed to be a
known issue. With these custom stat kind thingies, nobody knows that
this problem exists.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Por suerte hoy explotó el califont porque si no me habría muerto
de aburrido" (Papelucho)

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Alvaro Herrera (#8)

Re: Skip unregistered custom kinds on stats load

On Wed, Oct 22, 2025 at 08:44:27AM +0300, Alvaro Herrera wrote:

FWIW I don't necessarily agree with what Sami is proposing, but (if I
understand the situation correctly) I think throwing away the whole
contents of stats when a module is removed is the opposite of
availability, as it will lead to autovacuum lacking data and thus bloat
problems. This is of course always problematic after a crash (unless
somebody has fixed this already?) but at least it is supposed to be a
known issue.

A first part here is something that Bertrand Drouvot has been working
on: being able to rebuild the table stats during WAL replay requires
us to move the stats to be relfilenode-based, so as the startup could
know where to recreate some of the numbers lost during crash recovery
from WAL, giving autovacuum a safety net.

With these custom stat kind thingies, nobody knows that this problem
exists.

Noted.
--
Michael

#10

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

5 months ago

In reply to: Michael Paquier (#9)

Re: Skip unregistered custom kinds on stats load

Hi,

On Wed, Oct 22, 2025 at 02:51:31PM +0900, Michael Paquier wrote:

A first part here is something that Bertrand Drouvot has been working
on: being able to rebuild the table stats during WAL replay requires
us to move the stats to be relfilenode-based, so as the startup could
know where to recreate some of the numbers lost during crash recovery
from WAL, giving autovacuum a safety net.

Yeah, I resumed working on it recently and should be able to share a patch
next week (based on what we discussed in [1]/messages/by-id/aN07p71KNFR2HdaD@ip-10-97-1-34.eu-west-3.compute.internal).

Next week patch will be the first step i.e replace (i.e get rid) PGSTAT_KIND_RELATION
by a brand new PGSTAT_KIND_RELFILENODE and move all the existing stats that are
currently under the PGSTAT_KIND_RELATION to this new PGSTAT_KIND_RELFILENODE.

It will do this by keeping the pg_stat_all_tables|indexes and pg_statio_all_tables|indexes
on top of the PGSTAT_KIND_RELFILENODE and ensure that a relation rewrite keeps
those stats.

Next step will be that, once done, we could work from there to add new stats (add
writes counters (heap_blks_written and friends) and ensure that some counters
(n_dead_tup and friends) are replicated).

[1]: /messages/by-id/aN07p71KNFR2HdaD@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#11

Sami Imseih

samimseih@gmail.com

5 months ago

In reply to: Michael Paquier (#7)

Re: Skip unregistered custom kinds on stats load

On Tue, Oct 21, 2025 at 01:26:30PM -0500, Sami Imseih wrote:

Hmm, since the pgstat.stat file is managed by core and not by the
extension, I think this should be handled semi-automatically by
pgstats. Even with the checks you mention above, validating that we
are indeed loading the same kind will never be totally foolproof. I
think we can validate kinds better by adding some MAGIC NUMBER that is
set by the extension during registration and tracked in pgstats.stat.
If the kind ID
and MAGIC NUMBER (for example, FORMAT ID) match, then we have more of
a guarantee that this is the same extension. Also, this is probably a
good idea to support extension upgrades.

Probably. Still, it does not feel like something we need to push
strongly in a given direction, because we may pay in flexibility
compared to what's on HEAD.

With a single kind ID forcing how large a fixed-sized entry or one
variable-sized entry should be based on what kind of data is loaded,
versioning something based on a single kind is impossible (or the same
size is used, leading to different contents translated). Of course,
one could use multiple kind IDs if they want to enforce different
stats versions in a single major version of the backend. To me, this
property comes down that the version of the pgstats file cannot change
in a single major version, implying that the same rule is also pushed
to the custom kinds: if you begin to use a kind ID in a given major
version, do not change the stats format, limitting new changes to a
new major version. Or an upgrade format path to use is something an
extension needs to handle on its own. I'd bet that stats upgrade
requirements for the same major version are not going to be that
common, the future may tell a different story depending on the use
cases that pop.

OK, fair enough. I don't have enough compelling use-cases to push
back harder on this. Initially, my focus was on the fact that unlike
built-in stats, custom stats may not have a registered kind and
disposing of all the stats in that case did not seem like the appropriate
behavior, or least too strict of a behavior.

But also, stats themselves are for the most part are consumed
in a way that can tolerate stats being reset, because most of the
time a user is taking deltas of stats between snapshots.

The autovacuum case is special, and the work that Bertrand is
doing will be beneficial to ensure that a/v has the correct stats
after crash recovery.

Sami

#12

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Sami Imseih (#11)

Re: Skip unregistered custom kinds on stats load

On Wed, Oct 22, 2025 at 03:57:15PM -0500, Sami Imseih wrote:

OK, fair enough. I don't have enough compelling use-cases to push
back harder on this. Initially, my focus was on the fact that unlike
built-in stats, custom stats may not have a registered kind and
disposing of all the stats in that case did not seem like the appropriate
behavior, or least too strict of a behavior.

Two things that may be useful here, as long as I have this problem in
mind..

First, could it could be useful to have an in-core function able to
remove from shared memory all the entries related to a specific stats
kind, taking in input a kind ID? At least for variable-sized data in
the shared hashtable, that could become useful.

A second potential improvement would be to bypass the write of a
fixed-numbered stats kind if we detect that nothing needs to be
written because there is no data, after a reset or if we detect it as
full of zeroes. The format of the file with PGSTAT_FILE_ENTRY_FIXED
makes this flexibility possible, and we could extend the snapshot_cb
and pgstat_build_snapshot_fixed() so as they return a boolean flag to
let the upper layer decide if some data should be written or not.

If you combine the first system function with resets and the tweak for
the fixed-numbered stats, then one could reset/remove the stats, drop
a custom kind from s_p_l, and still preserve the rest of the stats.
That's more steps than doing all that automatically, but that feels
also a bit more bullet-proof if one wants to "upgrade" a stats kind by
reusing the same kind ID across two restarts.
--
Michael

#13

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

4 months ago

In reply to: Bertrand Drouvot (#10)

Re: Skip unregistered custom kinds on stats load

Hi,

On Wed, Oct 22, 2025 at 02:41:42PM +0000, Bertrand Drouvot wrote:

Hi,

On Wed, Oct 22, 2025 at 02:51:31PM +0900, Michael Paquier wrote:

A first part here is something that Bertrand Drouvot has been working
on: being able to rebuild the table stats during WAL replay requires
us to move the stats to be relfilenode-based, so as the startup could
know where to recreate some of the numbers lost during crash recovery
from WAL, giving autovacuum a safety net.

Yeah, I resumed working on it recently and should be able to share a patch
next week (based on what we discussed in [1]).

FWIW I changed the design so expect some delay. Will share the design decision
with the patch(es) once ready.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Skip unregistered custom kinds on stats load

Attachments: