Vacuumlo improvements
I'd like to discuss a behavior in the vacuumlo utility that can lead
to silent data loss when large object references are stored in columns
whose type is a domain over `oid` or `lo`. While fully stated in the
docs, we have observed users getting some surprises when they are
trying to do routine maintenance. I'll attach a very simple repro
that displays the behavior using
a fairly routine use of a domain. [1]attached vacuumlo_repro.txt
DESCRIPTION
---------------------
The vacuumlo documentation [2]https://www.postgresql.org/docs/current/vacuumlo.html states:
"Only types with these names are considered; in particular, domains
over them are not considered."
While the behavior is documented, in the field, the consequence is
severe: if a user creates a domain over `oid` (e.g., for semantic
clarity or to add constraints) and uses that domain-typed column to
store large object OIDs, vacuumlo will treat those LOs as orphaned and
delete them. The referenced data is silently destroyed.
This is particularly dangerous because:
- Domains over base types are a popular PostgreSQL practice
- There is no warning or diagnostic output from vacuumlo when it skips
domain-typed columns
- The --dry-run / -n flag will show these LOs as "would be removed"
but nothing indicates *why* they appear orphaned
- The data loss is irreversible
SUGGESTION
---------------------
Ideally, vacuumlo could be improved to:
- Resolve domain types back to their base types when scanning columns
(using pg_type.typbasetype), or
- At least emit a WARNING when it encounters columns with domains over
oid/lo that it is skipping, so the user is aware.
I don't currently have a patch attached, but wanted to shine light on
the issue given the silent data-loss risk. I know LO's are a sensitive
topic with discussions that wander towards deprecation, but they seem
to be here to stay and are very commonly used in the field.
At minimum, I can submit a documentation improvement to make the
data-loss risk more prominent. The current parenthetical note is easy
to miss.
[1]: attached vacuumlo_repro.txt
[2]: https://www.postgresql.org/docs/current/vacuumlo.html
Thanks,
Shawn
Attachments:
vacuumlo_repro.txttext/plain; charset=US-ASCII; name=vacuumlo_repro.txtDownload
On Tue, May 12, 2026 at 11:34:10AM -0600, Shawn McCoy wrote:
Ideally, vacuumlo could be improved to:
- Resolve domain types back to their base types when scanning columns
(using pg_type.typbasetype), or
- At least emit a WARNING when it encounters columns with domains over
oid/lo that it is skipping, so the user is aware.
Commit 64c604898e added the note about domains to the docs. Unfortunately,
neither that nor the corresponding thread [0]/messages/by-id/BAY164-W265A089BD32F8901A686C9FF430@phx.gbl offer any clues as to why
vacuumlo doesn't resolve domains. The commit history for vacuumlo has been
pretty quiet for a long time, so maybe it's just been overlooked.
At minimum, I can submit a documentation improvement to make the
data-loss risk more prominent. The current parenthetical note is easy
to miss.
Improving the documentation seems reasonable, too. Another thing we could
explore is allowing users to specify which tables/columns refer to LOs,
perhaps with a user-provided query. One wrinkle is that dblink allows
specifying multiple databases, and presumably each database will be a
little different.
Separately, do you know whether users are using lo_manage() at all? And if
not, why?
[0]: /messages/by-id/BAY164-W265A089BD32F8901A686C9FF430@phx.gbl
--
nathan
Ideally, vacuumlo could be improved to:
- Resolve domain types back to their base types when scanning columns
(using pg_type.typbasetype), or
- At least emit a WARNING when it encounters columns with domains over
oid/lo that it is skipping, so the user is aware.Commit 64c604898e added the note about domains to the docs. Unfortunately,
neither that nor the corresponding thread [0] offer any clues as to why
vacuumlo doesn't resolve domains. The commit history for vacuumlo has been
pretty quiet for a long time, so maybe it's just been overlooked.At minimum, I can submit a documentation improvement to make the
data-loss risk more prominent. The current parenthetical note is easy
to miss.Improving the documentation seems reasonable, too.
+1 to documentation that calls out the risk of data-loss.
Another thing we could explore is allowing users to specify which
tables/columns refer to LOs,
perhaps with a user-provided query. One wrinkle is that dblink allows
specifying multiple databases, and presumably each database will be a
little different.
Separately, do you know whether users are using lo_manage() at all? And if
not, why?
I think recommending the use of the LO extension [1]https://www.postgresql.org/docs/current/lo.html in the core large object
documentation is a good start. Ideally, a user should not have to run vacuumlo.
Using the LO extension, a user can use lo_manage for simple types ( or domain
over simple types ) or if they have a more complex situation, like a composite
type holding an LO, they can use a custom trigger.
In the case of TRUNCATE, since per-row triggers don't fire. But even
that can be handled with a statement level BEFORE TRUNCATE trigger that scans
and unlinks. vacuumlo then becomes a cleanup tool for legacy schemas, not a
routine requirement.
All to say, we should be steering the users towards this extension with more
recommendations, perhaps.
[1]: https://www.postgresql.org/docs/current/lo.html
--
Sami Imseih
Amazon Web Services (AWS)
On Tue, May 12, 2026 at 10:00:15PM -0500, Nathan Bossart wrote:
Commit 64c604898e added the note about domains to the docs. Unfortunately,
neither that nor the corresponding thread [0] offer any clues as to why
vacuumlo doesn't resolve domains. The commit history for vacuumlo has been
pretty quiet for a long time, so maybe it's just been overlooked.
It seems to be relatively easy to teach vacuumlo to handle domains over
oid. Note that you need a recursive query because you can have domains
over domains. Please test it out. I noticed that vacuumlo's tests are
pretty sad, so this might be a good opportunity to change that.
--
nathan
Attachments:
v1-0001-teach-vacuumlo-to-handle-domains-over-oid.patchtext/plain; charset=us-asciiDownload+13-2
Commit 64c604898e added the note about domains to the docs. Unfortunately,
neither that nor the corresponding thread [0] offer any clues as to why
vacuumlo doesn't resolve domains. The commit history for vacuumlo has been
pretty quiet for a long time, so maybe it's just been overlooked.It seems to be relatively easy to teach vacuumlo to handle domains over
oid. Note that you need a recursive query because you can have domains
over domains. Please test it out.
I think there is value in expanding the vacuumlo search capability for
LOs and OIDs.
We can also detect LOs and OIDs stored in composite types, or OID[] and LO[].
All these are detectable from the catalog.
The one complexity will be we will need vacuumlo to generate more complex
expressions for deleting the data.
DELETE FROM t WHERE lo IN (SELECT ("data")."lo_ref" FROM t_lo);
But, this will be more comprehensive and can cover all potential ways
an OID or LO can be used.
What do you think?
Please test it out. I noticed that vacuumlo's tests are
pretty sad, so this might be a good opportunity to change that.
More tests will be needed for sure.
But with all this done, I am not sure how much this moves the needle. It may
somewhat, but it's hard to tell how much.
I know I have seen users store LO references in text or other types,
so I think we still need the documentation enhancement to call out
the "data loss" potential.
I also think it will be good for the LO documentation [1]https://www.postgresql.org/docs/current/lo.html to nudge the users
to think about using the LO extension, as is done with the vacuumlo [2]https://www.postgresql.org/docs/current/vacuumlo.html
documentation.
[1]: https://www.postgresql.org/docs/current/lo.html
[2]: https://www.postgresql.org/docs/current/vacuumlo.html
--
Sami
On Fri, May 15, 2026 at 02:36:22PM -0500, Sami Imseih wrote:
I think there is value in expanding the vacuumlo search capability for
LOs and OIDs. We can also detect LOs and OIDs stored in composite types,
or OID[] and LO[]. All these are detectable from the catalog.The one complexity will be we will need vacuumlo to generate more complex
expressions for deleting the data.DELETE FROM t WHERE lo IN (SELECT ("data")."lo_ref" FROM t_lo);
But, this will be more comprehensive and can cover all potential ways
an OID or LO can be used.What do you think?
It seems worth exploring.
But with all this done, I am not sure how much this moves the needle. It may
somewhat, but it's hard to tell how much. I know I have seen users store
LO references in text or other types, so I think we still need the
documentation enhancement to call out the "data loss" potential.I also think it will be good for the LO documentation [1] to nudge the users
to think about using the LO extension, as is done with the vacuumlo [2]
documentation.
Yeah, I think we'll have to do some combination of 1) improving vacuumlo,
2) improving the documentation to warn users about things vacuumlo doesn't
catch, and 3) improving the documentation to nudge users toward the lo
extension..
--
nathan