Show various offset arrays for heap WAL records

Started by Andres Freundover 3 years ago49 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

A couple times when investigating data corruption issues, the last time just
yesterday in [1]/messages/by-id/CANtu0ojby3eBdMXfs4QmS+K1avBc7NcRq_Ot5bnzrbwM+uQ55w@mail.gmail.com, I needed to see the offsets affected by PRUNE and VACUUM
records. As that's probably not just me, I think we should make that change
in-tree.

The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
XLOG_HEAP2_FREEZE_PAGE.

The biggest issue I have with the patch is that it's very hard to figure out
what punctuation to use where ;). The existing code is very inconsistent.

I chose to include infomask[2] for the different freeze plans mainly because
it looks odd to see different plans without a visible reason. But I'm not sure
that's the right choice.

Greetings,

Andres Freund

[1]: /messages/by-id/CANtu0ojby3eBdMXfs4QmS+K1avBc7NcRq_Ot5bnzrbwM+uQ55w@mail.gmail.com

Attachments:

v3-0001-heapdesc-Include-information-about-offset-arrays.patchtext/x-diff; charset=us-asciiDownload+94-1
In reply to: Andres Freund (#1)
Re: Show various offset arrays for heap WAL records

On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:

A couple times when investigating data corruption issues, the last time just
yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
records. As that's probably not just me, I think we should make that change
in-tree.

I remember how useful this was when we were investigating that early
bug in 14, that turned out to be in parallel VACUUM. So I'm all in
favor of it.

The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

I chose to include infomask[2] for the different freeze plans mainly because
it looks odd to see different plans without a visible reason. But I'm not sure
that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

I think that it easily could be useful, and I also think that it
easily could be a bit annoying. How hard would it be to invent a
general mechanism to control the verbosity of what we'll show for each
WAL record?

--
Peter Geoghegan

#3Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#2)
Re: Show various offset arrays for heap WAL records

Hi,

On 2023-01-09 19:59:42 -0800, Peter Geoghegan wrote:

On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:

A couple times when investigating data corruption issues, the last time just
yesterday in [1], I needed to see the offsets affected by PRUNE and VACUUM
records. As that's probably not just me, I think we should make that change
in-tree.

I remember how useful this was when we were investigating that early
bug in 14, that turned out to be in parallel VACUUM. So I'm all in
favor of it.

Cool.

The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

Hm, there doesn't seem to be a great location for them today. I guess we could
add something like src/include/access/rmgrdesc_utils.h? And put the
implementation in src/backend/access/rmgrdesc/rmgrdesc_utils.c? I first was
thinking of just rmgrdesc.[ch], but custom rmgrs added
src/bin/pg_waldump/rmgrdesc.[ch] ...

I chose to include infomask[2] for the different freeze plans mainly because
it looks odd to see different plans without a visible reason. But I'm not sure
that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.

How hard would it be to invent a general mechanism to control the verbosity
of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Greetings,

Andres Freund

In reply to: Andres Freund (#3)
Re: Show various offset arrays for heap WAL records

On Tue, Jan 10, 2023 at 11:35 AM Andres Freund <andres@anarazel.de> wrote:

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Okay. Let's just get the basics in soon, then.

I would like to have a similar capability for index access methods,
but mostly just for investigating performance. Whenever we've really
needed something like this for debugging it seems to have been a
heapam thing, just because there's a lot more that can go wrong with
pruning, which is spread across many different places.

--
Peter Geoghegan

#5Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#4)
Re: Show various offset arrays for heap WAL records

Hi,

On 2023-01-11 14:53:54 -0800, Peter Geoghegan wrote:

On Tue, Jan 10, 2023 at 11:35 AM Andres Freund <andres@anarazel.de> wrote:

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

Okay. Let's just get the basics in soon, then.

I would like to have a similar capability for index access methods,
but mostly just for investigating performance. Whenever we've really
needed something like this for debugging it seems to have been a
heapam thing, just because there's a lot more that can go wrong with
pruning, which is spread across many different places.

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Greetings,

Andres Freund

In reply to: Andres Freund (#5)
Re: Show various offset arrays for heap WAL records

On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

We may well need to put more stuff in that file. We're overdue a big
overhaul of the rmgr output, so that everybody uses the same format
for everything. We made some progress on that for 16 already, by
standardizing on the name snapshotConflictHorizon, but a lot of
annoying inconsistencies still remain. Like the punctuation issue you
mentioned.

Ideally we'd be able to make the output more easy to manipulate via
the SQL interface from pg_walinspect, or perhaps via scripting. That
would require some rules that are imposed top-down, so that consumers
of the data can make certain general assumptions. But that's fairly
natural. It's not like there is just inherently a great deal of
diversity that we need to be considered. For example, the WAL records
used by each individual index access method are all very similar. In
fact the most important index AM WAL records used by each index AM
(e.g. insert, delete, vacuum) have virtually the same format as each
other already.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#6)
Re: Show various offset arrays for heap WAL records

On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

--
Peter Geoghegan

#8Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#7)
Re: Show various offset arrays for heap WAL records

Hi,

I have taken a stab at doing some of the tasks listed in this email.

I have made the new files rmgr_utils.c/h.

I have come up with a standard format that I like for the output and
used it in all the heap record types.

Examples below:

snapshotConflictHorizon: 2184, nplans: 2, plans [ { xmax: 0, infomask:
2816, infomask2: 2, ntuples: 5, offsets: [ 10, 11, 12, 18, 71 ] }, {
xmax: 0, infomask: 11008, infomask2: 2, ntuples: 2, offsets: [ 72, 73
] } ]

snapshotConflictHorizon: 2199, nredirected: 4, ndead: 0, nunused: 4,
redirected: [ 1->38, 2->39, 3->40, 4->41 ], dead: [], unused: [ 24,
25, 26, 27, 37 ]

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

This is still a rough draft, as I anticipate changes will be requested.
I would split it into multiple patches, etc. But I am looking for
feedback on the suggested format and the array formatting helper
function API.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

I've changed the array format helper functions that Andres added to be a
single function with an additional layer of indirection so that any
record with an array can use it regardless of type and format of the
individual elements. The signature is based somewhat off of qsort_r()
and allows the user to pass a function with the the desired format of
the elements.

On a semi-unrelated note, I think it might be nice to have a comment in
heapam_xlog.h about what the infobits fields actually are and why they
exist -- e.g. we only need a subset of infomask[2] bits in these
records.
I put a random comment in the code where I think it should go.
I will delete it later, of course.

On Mon, Jan 9, 2023 at 11:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 9, 2023 at 1:58 PM Andres Freund <andres@anarazel.de> wrote:

The attached patch adds details to XLOG_HEAP2_PRUNE, XLOG_HEAP2_VACUUM,
XLOG_HEAP2_FREEZE_PAGE.

I'm bound to end up doing the same in index access methods. Might make
sense for the utility routines to live somewhere more centralized, at
least when code reuse is likely. Practically every index AM has WAL
records that include a sorted page offset number array, just like
these ones. It's a very standard thing, obviously.

I plan to add these if the format and API I suggested seems like the
right direction.

On Tue, Jan 10, 2023 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:

I chose to include infomask[2] for the different freeze plans mainly because
it looks odd to see different plans without a visible reason. But I'm not sure
that's the right choice.

I don't think that it is particularly necessary to do so in order for
the output to make sense -- pg_waldump is inherently a tool for
experts. What it comes down to for me is whether or not this
information is sufficiently useful to display, and/or can be (or needs
to be) controlled via some kind of verbosity knob.

It seemed useful enough to me, but I likely also stare more at this stuff than
most. Compared to the list of offsets it's not that much content.

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

How hard would it be to invent a general mechanism to control the verbosity
of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

In terms of a more structured format, it seems like it would make the
most sense to pass a JSON or composite datatype structure to rm_desc
instead of that StringInfo.

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

On Mon, Jan 16, 2023 at 10:09 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 11, 2023 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 11, 2023 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:

What are your thoughts about the place for the helper functions? You're ok
with rmgrdesc_utils.[ch]?

Yeah, that seems okay.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

I believe I have addressed this in the attached patch.

- Melanie

Attachments:

v1-0001-Add-rmgr_desc-utilities.patchapplication/octet-stream; name=v1-0001-Add-rmgr_desc-utilities.patchDownload+251-32
#9Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#8)
Re: Show various offset arrays for heap WAL records

On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I believe I have addressed this in the attached patch.

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details. I'd
even like to have a way to extract the tuple data, when it's
operations on tuples and we have those tuples in the payload. That'd
be a lot more verbose than what you are doing here, though, and I'm
not saying you should go do it right now or anything like that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Melanie Plageman (#8)
Re: Show various offset arrays for heap WAL records

On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I have taken a stab at doing some of the tasks listed in this email.

Cool.

I have made the new files rmgr_utils.c/h.

I have come up with a standard format that I like for the output and
used it in all the heap record types.

Examples below:

That seems like a reasonable approach.

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

I think that standardization is good, but ISTM that we need clarity on
what the scope is -- what is *not* being standardized? It may or may
not be useful to call the end result an API. Or it may not make sense
to do so in the first committed version, even though we may ultimately
end up as something that deserves to be called an API. The obligation
to not break tools that are scraping the output in whatever way seems
kind of onerous right now -- just not having any gratuitous
inconsistencies (e.g., fixing totally inconsistent punctuation, making
the names for fields across WAL records consistent when they serve
exactly the same purpose) would be a big improvement.

As I mentioned in passing already, I actually don't think that the
B-Tree WAL records are all that special, as far as this stuff goes.
For example, the DELETE Btree record type is very similar to heapam's
PRUNE record type, and practically identical to Btree's VACUUM record
type. All of these record types use the same basic conventions, like a
snapshotConflictHorizon field for recovery conflicts (which is
generated in a very similar way during original execution, and
processed in precisely the same way during REDO), and arrays of page
offset numbers sorted in ascending order.

There are some remaining details where things from an index AM WAL
record aren't directly analogous (or pretty much identical) to some
other heapam WAL records, such as the way that the DELETE Btree record
type deals with deleting a subset of TIDs from a posting list index
tuple (generated by B-Tree deduplication). But even these exceptions
don't require all that much discussion. You could either choose to
only display the array of deleted index tuple page offset numbers, as
well as the similar array of "updated" index tuple page offset numbers
from xl_btree_delete, in which case you just display two arrays of
page offset numbers, in the same standard way. You may or may not want
to also show each individual xl_btree_update entry -- doing so would
be kinda like showing the details of individual freeze plans, except
that you'd probably display something very similar to the page offset
number display here too (even though these aren't page offset numbers,
they're 0-based offsets into the posting list's item pointer data
array).

BTW, there is also a tendency for non-btree index AM WAL records to be
fairly similar or even near-identical to the B-Tree WAL records. While
Hash indexes are very different to B-Tree indexes at a high level, it
is nevertheless the case that xl_hash_vacuum_one_page is directly
based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
directly based on xl_btree_insert. There are some other WAL record
types that are completely different across hash and B-Tree, which is a
reflection of the fact that the index grows using a totally different
approach in each AM -- but that doesn't seem like something that
throws up any roadblocks for you (these can all be displayed as simple
structs anyway).

Speaking with my B-Tree hat on, I'd just be happy to be able to see
both of the page offset number arrays (the deleted + updated offset
number arrays from xl_btree_delete/xl_btree_vacuum), without also
being able to\ see output for each individual xl_btree_update
item-pointer-array-offset arrays -- just seeing that much is already a
huge improvement. That's why I'm a bit hesitant to use the term API
just yet, because an obligation to be consistent in whatever way seems
like it might block incremental progress.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

That would definitely make sense.

I've changed the array format helper functions that Andres added to be a
single function with an additional layer of indirection so that any
record with an array can use it regardless of type and format of the
individual elements. The signature is based somewhat off of qsort_r()
and allows the user to pass a function with the the desired format of
the elements.

That's handy.

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

I agree, in general, though long term the best approach is one that
has a configurable level of verbosity, with some kind of roughly
uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
probably with only 2 or 3 distinct levels).

Obviously what you're doing here will lead to a significant increase
in the verbosity of the output for affected WAL records. I don't feel
too bad about that, though. It's really an existing problem, and one
that should be fixed either way. You kind of have to deal with this
already, by having a good psql pager, since record types such as
COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
verbose in roughly the same way. You only need to have one of these
record types output by a function like pg_get_wal_records_info() to
get absurdly wide output -- it hardly matters that most individual WAL
record types have terse output at that point.

How hard would it be to invent a general mechanism to control the verbosity
of what we'll show for each WAL record?

Nontrivial, I'm afraid. We don't pass any relevant parameters to rm_desc:
void (*rm_desc) (StringInfo buf, XLogReaderState *record);

so we'd need to patch all of them. That might be worth doing at some point,
but I don't want to tackle it right now.

In terms of a more structured format, it seems like it would make the
most sense to pass a JSON or composite datatype structure to rm_desc
instead of that StringInfo.

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

That does seem particularly important. It's a pain to do this from
SQL. In general I'm okay with focussing on pg_walinspect over
pg_waldump, since it'll become more important over time. Obviously
pg_waldump needs to still work, but I think it's okay to care less
about pg_waldump usability.

BTW, while playing around with this patch today, I noticed that it
won't display the number of elements in each offset array directly.
Perhaps it's worth including that, too?

I believe I have addressed this in the attached patch.

Thanks for taking care of that.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#10)
Re: Show various offset arrays for heap WAL records

On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

That does seem particularly important. It's a pain to do this from
SQL. In general I'm okay with focussing on pg_walinspect over
pg_waldump, since it'll become more important over time. Obviously
pg_waldump needs to still work, but I think it's okay to care less
about pg_waldump usability.

I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
probably shouldn't even be used by pg_walinspect at all (just by
pg_waldump). Using something like XLogRecGetBlockRefInfo() within
pg_walinspect misses out on the opportunity to output information in a
more descriptive tuple format, with real data types. It's not just the
relfilenode, either -- it's the block numbers themselves. And the fork
number.

In other words, I suspect that this is out of scope for this patch,
strictly speaking. We simply shouldn't be using
XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
pg_walinspect should be calling some other function that ultimately
allows the user to work with (say) an array of int8 from SQL for the
block numbers. There is no great reason not to, AFAICT, since this
information is completely generic -- it's not like the rmgr-specific
output from GetRmgr(), where fine grained type information is just a
nice-to-have, with usability issues of its own (on account of the
details being record type specific).

I've been managing this problem within my own custom pg_walinspect
queries by using my own custom ICU collation. I use ICU's natural sort
order to order based on block_ref, or based on a substring()
expression that extracts something interesting from block_ref, such as
relfilenode. You can create a custom collation for this like so, per
the docs:

CREATE COLLATION IF NOT EXISTS numeric (provider = icu, locale =
'en-u-kn-true');

Obviously this hack of mine works, but hardly anybody else would be
willing to take the time to figure something like this out. Plus it's
error prone when it doesn't really have to be. And it suggests that
the block_ref field isn't record type generic -- that's sort of
misleading IMV.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#10)
Re: Show various offset arrays for heap WAL records

On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

Obviously what you're doing here will lead to a significant increase
in the verbosity of the output for affected WAL records. I don't feel
too bad about that, though. It's really an existing problem, and one
that should be fixed either way. You kind of have to deal with this
already, by having a good psql pager, since record types such as
COMMIT_PREPARED, INVALIDATIONS, and RUNNING_XACTS are already very
verbose in roughly the same way. You only need to have one of these
record types output by a function like pg_get_wal_records_info() to
get absurdly wide output -- it hardly matters that most individual WAL
record types have terse output at that point.

Actually the really wide output comes from COMMIT records. After I run
the regression tests, and execute some of my own custom pg_walinspect
queries, I see that some individual COMMIT records have a
length(description) of over 10,000 bytes/characters. There is even one
particular COMMIT record whose length(description) is about 46,000
bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
uncommon today. The worst case (or even particularly bad cases) won't
be made any worse by this patch, because there are obviously limits on
the width of the arrays that it outputs details descriptions of, that
don't apply to these COMMIT records.

--
Peter Geoghegan

#13Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Show various offset arrays for heap WAL records

On Tue, Jan 31, 2023 at 6:20 PM Peter Geoghegan <pg@bowt.ie> wrote:

Actually the really wide output comes from COMMIT records. After I run
the regression tests, and execute some of my own custom pg_walinspect
queries, I see that some individual COMMIT records have a
length(description) of over 10,000 bytes/characters. There is even one
particular COMMIT record whose length(description) is about 46,000
bytes/characters. So *ludicrously* verbose GetRmgr() strings are not
uncommon today. The worst case (or even particularly bad cases) won't
be made any worse by this patch, because there are obviously limits on
the width of the arrays that it outputs details descriptions of, that
don't apply to these COMMIT records.

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#13)
Re: Show various offset arrays for heap WAL records

On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

It seems practically mandatory to use a pager when running
pg_walinspect queries in psql right now -- pspg is good for this. I
really can't use expanded table mode here, since it obscures the
relationship between adjoining records. I'm usually looking through
rows/records in LSN order, and want to be able to easily compare the
LSNs (or other details) of groups of adjoining records.

--
Peter Geoghegan

#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#14)
Re: Show various offset arrays for heap WAL records

On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#11)
Re: Show various offset arrays for heap WAL records

On Tue, Jan 31, 2023 at 5:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 31, 2023 at 1:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

I would also like to see functions like XLogRecGetBlockRefInfo() pass
something more useful than a stringinfo buffer so that we could easily
extract out the relfilenode in pgwalinspect.

That does seem particularly important. It's a pain to do this from
SQL. In general I'm okay with focussing on pg_walinspect over
pg_waldump, since it'll become more important over time. Obviously
pg_waldump needs to still work, but I think it's okay to care less
about pg_waldump usability.

I just realized why you mentioned XLogRecGetBlockRefInfo() -- it
probably shouldn't even be used by pg_walinspect at all (just by
pg_waldump). Using something like XLogRecGetBlockRefInfo() within
pg_walinspect misses out on the opportunity to output information in a
more descriptive tuple format, with real data types. It's not just the
relfilenode, either -- it's the block numbers themselves. And the fork
number.

In other words, I suspect that this is out of scope for this patch,
strictly speaking. We simply shouldn't be using
XLogRecGetBlockRefInfo() in pg_walinspect in the first place. Rather,
pg_walinspect should be calling some other function that ultimately
allows the user to work with (say) an array of int8 from SQL for the
block numbers. There is no great reason not to, AFAICT, since this
information is completely generic -- it's not like the rmgr-specific
output from GetRmgr(), where fine grained type information is just a
nice-to-have, with usability issues of its own (on account of the
details being record type specific).

Something like the attached?

start_lsn | 0/19823390
end_lsn | 0/19824360
prev_lsn | 0/19821358
xid | 1355
resource_manager | Heap
record_type | UPDATE
record_length | 4021
main_data_length | 14
fpi_length | 3948
description | off 11 xmax 1355 flags 0x00 ; new off 109 xmax 0
block_ref |
[0:1][0:8]={{0,1663,5,17033,0,442,460,4244,0},{1,1663,5,17033,0,0,0,0,0}}

It is a bit annoying not to have information about what each block_ref
item in the array represents (previously in the string), so maybe the
format in the attached shouldn't be a replacement for what is already
displayed by pg_get_wal_records_info() and friends.

It could instead be a new function which returns information in this
format -- perhaps tuples with separate columns for each labeled block
ref field denormalized to repeat the wal record info for every block?

The one piece of information I didn't include in the new block_ref
columns is the compression type (since it is a string). Since I used the
forknum value instead of the forknum name, maybe it is defensible to
also provide a documented int value for the compression type and make
that an int too?

- Melanie

Attachments:

v1-0001-Return-block_ref-details-as-an-array.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Return-block_ref-details-as-an-array.patchDownload+114-28
#17Peter Eisentraut
peter_e@gmx.net
In reply to: Melanie Plageman (#16)
Re: Show various offset arrays for heap WAL records

On 01.03.23 17:11, Melanie Plageman wrote:

diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql b/contrib/pg_walinspect/pg_walinspect--1.0.sql
index 08b3dd5556..eb8ff82dd8 100644
--- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
+++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
@@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
OUT main_data_length int4,
OUT fpi_length int4,
OUT description text,
-    OUT block_ref text
+    OUT block_ref int4[][]
)
AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
LANGUAGE C STRICT PARALLEL SAFE;

A change like this would require a new extension version and an upgrade
script.

I suppose it's ok to postpone that work while the actual meat of the
patch is still being worked out, but I figured I'd mention it in case it
wasn't considered yet.

#18Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#10)
Re: Show various offset arrays for heap WAL records

Thanks for the various perspectives and feedback.

Attached v2 has additional info for xl_btree_vacuum and xl_btree_delete.

I've quoted various emails by various senders below and replied.

On Fri, Jan 27, 2023 at 3:02 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 27, 2023 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I believe I have addressed this in the attached patch.

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details. I'd
even like to have a way to extract the tuple data, when it's
operations on tuples and we have those tuples in the payload. That'd
be a lot more verbose than what you are doing here, though, and I'm
not saying you should go do it right now or anything like that.

If I'm not mistaken, this would be quite difficult without changing
rm_desc to return some kind of self-describing data type.

On Tue, Jan 31, 2023 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Jan 27, 2023 at 9:24 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I started documenting it in the rmgr_utils.h header file in a comment,
however it may be worth a README?

I haven't polished this description of the format (or added examples,
etc) or used it in the btree-related functions because I assume the
format and helper function API will need more discussion.

I think that standardization is good, but ISTM that we need clarity on
what the scope is -- what is *not* being standardized? It may or may
not be useful to call the end result an API. Or it may not make sense
to do so in the first committed version, even though we may ultimately
end up as something that deserves to be called an API. The obligation
to not break tools that are scraping the output in whatever way seems
kind of onerous right now -- just not having any gratuitous
inconsistencies (e.g., fixing totally inconsistent punctuation, making
the names for fields across WAL records consistent when they serve
exactly the same purpose) would be a big improvement.

So, we can scrap any README or big comment, but are there other changes
to the code or structure you think would avoid it being seen as an
API?

As I mentioned in passing already, I actually don't think that the
B-Tree WAL records are all that special, as far as this stuff goes.
For example, the DELETE Btree record type is very similar to heapam's
PRUNE record type, and practically identical to Btree's VACUUM record
type. All of these record types use the same basic conventions, like a
snapshotConflictHorizon field for recovery conflicts (which is
generated in a very similar way during original execution, and
processed in precisely the same way during REDO), and arrays of page
offset numbers sorted in ascending order.

There are some remaining details where things from an index AM WAL
record aren't directly analogous (or pretty much identical) to some
other heapam WAL records, such as the way that the DELETE Btree record
type deals with deleting a subset of TIDs from a posting list index
tuple (generated by B-Tree deduplication). But even these exceptions
don't require all that much discussion. You could either choose to
only display the array of deleted index tuple page offset numbers, as
well as the similar array of "updated" index tuple page offset numbers
from xl_btree_delete, in which case you just display two arrays of
page offset numbers, in the same standard way. You may or may not want
to also show each individual xl_btree_update entry -- doing so would
be kinda like showing the details of individual freeze plans, except
that you'd probably display something very similar to the page offset
number display here too (even though these aren't page offset numbers,
they're 0-based offsets into the posting list's item pointer data
array).

I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
the updated/deleted target offset numbers and the updated tuples
metadata.

I wondered if there was any reason to do xl_btree_dedup deduplication
intervals.

BTW, there is also a tendency for non-btree index AM WAL records to be
fairly similar or even near-identical to the B-Tree WAL records. While
Hash indexes are very different to B-Tree indexes at a high level, it
is nevertheless the case that xl_hash_vacuum_one_page is directly
based on xl_btree_delete/xl_btree_vacuum, and that xl_hash_insert is
directly based on xl_btree_insert. There are some other WAL record
types that are completely different across hash and B-Tree, which is a
reflection of the fact that the index grows using a totally different
approach in each AM -- but that doesn't seem like something that
throws up any roadblocks for you (these can all be displayed as simple
structs anyway).

I chose not to take on any other index types until I saw if this was viable.

Perhaps there should also be example output of the offset arrays in
pgwalinspect docs?

That would definitely make sense.

I wanted to include at least a minimal example for those following along
with this thread that would cause creation of one of the record types
which I have enhanced, but I had a little trouble making a reliable
example.

Below is my strategy for getting a Heap PRUNE record with redirects, but
it occasionally doesn't end up working and I wasn't sure why (I can do
more investigation if we think that having some kind of test for this is
useful).

CREATE EXTENSION pg_walinspect;
DROP TABLE IF EXISTS lsns;
CREATE TABLE lsns(name TEXT, lsn pg_lsn);

DROP TABLE IF EXISTS baz;
create table baz(a int, b int) with (autovacuum_enabled=false);
insert into baz select i, i % 3 from generate_series(1,100)i;

update baz set b = 0 where b = 1;
update baz set b = 7 where b = 0;
INSERT INTO lsns VALUES('start_lsn', (SELECT pg_current_wal_lsn()));
vacuum baz;
select count(*) from baz;
INSERT INTO lsns VALUES('end_lsn', (SELECT pg_current_wal_lsn()));
SELECT * FROM pg_get_wal_records_info((select lsn from lsns where name
= 'start_lsn'),
(select lsn from lsns where name = 'end_lsn'))
WHERE record_type LIKE 'PRUNE%' AND resource_manager = 'Heap2' LIMIT 1;

Personally, I like having the infomasks for the freeze plans. If we
someday have a more structured input to rmgr_desc, we could then easily
have them in their own column and use functions like
heap_tuple_infomask_flags() on them.

I agree, in general, though long term the best approach is one that
has a configurable level of verbosity, with some kind of roughly
uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
probably with only 2 or 3 distinct levels).

Given this comment and Robert's concern quoted below, I am wondering if
the consensus is that a lack of verbosity control is a dealbreaker for
adding offsets or not.

On Wed, Feb 1, 2023 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2023 at 12:47 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 1, 2023 at 5:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

If we're dumping a lot of details out of each WAL record, we might
want to switch to a multi-line format of some kind. No one enjoys a
460-character wide line, let alone 46000.

I generally prefer it when I can use psql without using expanded table
format mode, and without having to use a pager. Of course that isn't
always possible, but it often is. I just don't think that that's going
to become feasible with pg_walinspect queries any time soon, since it
really requires a comprehensive strategy to deal with the issue of
verbosity.

Well, if we're thinking of making the output a lot more verbose, it
seems like we should at least do a bit of brainstorming about what
that strategy could be.

In terms of strategies for controlling output verbosity, it seems
difficult to do without changing the rmgrdesc function signature. Unless
you are thinking of trying to reparse the rmgrdesc string output on the
pg_walinspect/pg_waldump side?

I think if there was a more structured output of rmgrdesc, then this
would also solve the verbosity level problem. Consumers could decide on
their verbosity level -- in various pg_walinspect function outputs, that
would probably just be column selection. For pg_waldump, I imagine that
some kind of parameter or flag would work.

Unless you are suggesting that we add a verbosity parameter to the
rmgrdesc function API now?

On Thu, Mar 2, 2023 at 3:17 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 01.03.23 17:11, Melanie Plageman wrote:

diff --git a/contrib/pg_walinspect/pg_walinspect--1.0.sql b/contrib/pg_walinspect/pg_walinspect--1.0.sql
index 08b3dd5556..eb8ff82dd8 100644
--- a/contrib/pg_walinspect/pg_walinspect--1.0.sql
+++ b/contrib/pg_walinspect/pg_walinspect--1.0.sql
@@ -17,7 +17,7 @@ CREATE FUNCTION pg_get_wal_record_info(IN in_lsn pg_lsn,
OUT main_data_length int4,
OUT fpi_length int4,
OUT description text,
-    OUT block_ref text
+    OUT block_ref int4[][]
)
AS 'MODULE_PATHNAME', 'pg_get_wal_record_info'
LANGUAGE C STRICT PARALLEL SAFE;

A change like this would require a new extension version and an upgrade
script.

I suppose it's ok to postpone that work while the actual meat of the
patch is still being worked out, but I figured I'd mention it in case it
wasn't considered yet.

Thanks for letting me know. This pg_walinspect patch ended up being
discussed over in [1]/messages/by-id/CAAKRu_bORebdZmcV8V4cZBzU8M_C6tDDdbiPhCZ6i-iuSXW9TA@mail.gmail.com.

- Melanie

[1]: /messages/by-id/CAAKRu_bORebdZmcV8V4cZBzU8M_C6tDDdbiPhCZ6i-iuSXW9TA@mail.gmail.com

Attachments:

v2-0001-Add-rmgr_desc-utilities.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-rmgr_desc-utilities.patchDownload+250-32
v2-0002-Add-detail-to-some-btree-xlog-records.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Add-detail-to-some-btree-xlog-records.patchDownload+78-13
In reply to: Melanie Plageman (#18)
Re: Show various offset arrays for heap WAL records

On Mon, Mar 13, 2023 at 4:01 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Jan 27, 2023 at 3:02 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure what's best in terms of formatting details but I
definitely like the idea of making pg_waldump show more details.

If I'm not mistaken, this would be quite difficult without changing
rm_desc to return some kind of self-describing data type.

I'd say that it would depend on how far you went with it. Basic
information about the tuple wouldn't require any of that. I suggest
leaving this part out for now, though.

So, we can scrap any README or big comment, but are there other changes
to the code or structure you think would avoid it being seen as an
API?

I think that it would be good to try to build something that looks
like an API, while making zero promises about its stability -- at
least until further notice. Kind of like how there are no guarantees
about the stability of internal interfaces within the Linux kernel.

There is no reason to not take a firm position on some things now.
Things like punctuation, and symbol names for generic cross-record
symbols like snapshotConflictHorizon. Many of the differences that
exist now are wholly gratuitous -- just accidents. It would make sense
to standardize-away these clearly unnecessary variations. And to
document the new standard. I'd be surprised if anybody disagreed with
me on this point.

I have added detail to xl_btree_delete and xl_btree_vacuum. I have added
the updated/deleted target offset numbers and the updated tuples
metadata.

I wondered if there was any reason to do xl_btree_dedup deduplication
intervals.

No reason. It wouldn't be hard to cover xl_btree_dedup deduplication
intervals -- each element is a page offset number, and a corresponding
count of index tuples to merge together in the REDO routine. That's
slightly different to anything else, but not in a way that seems like
it requires very much additional effort.

I wanted to include at least a minimal example for those following along
with this thread that would cause creation of one of the record types
which I have enhanced, but I had a little trouble making a reliable
example.

Below is my strategy for getting a Heap PRUNE record with redirects, but
it occasionally doesn't end up working and I wasn't sure why (I can do
more investigation if we think that having some kind of test for this is
useful).

I'm not sure, but offhand I think that there could be a number of
annoying little implementation details that make it hard to come up
with a perfectly reliable test case. Perhaps try it while using VACUUM
VERBOSE, with the proviso that we should only expect the revised
example workflow to show a redirect record as intended when the
VERBOSE output confirms that VACUUM actually ran as expected, in
whatever way. For example, VACUUM can't have failed to acquire a
cleanup lock on a heap page due to the current phase of the moon.
VACUUM shouldn't have its "removable cutoff" held back by
who-knows-what when the test case is run, either.

Some of the tests for VACUUM use a temp table, since they conveniently
cannot have their "removable cutoff" held back -- not since commit
a7212be8. Of course, that strategy won't help you here. Getting VACUUM
to behave very predictably for testing purposes has proven tricky at
times.

I agree, in general, though long term the best approach is one that
has a configurable level of verbosity, with some kind of roughly
uniform definition of verbosity (kinda like DEBUG1 - DEBUG5, though
probably with only 2 or 3 distinct levels).

Given this comment and Robert's concern quoted below, I am wondering if
the consensus is that a lack of verbosity control is a dealbreaker for
adding offsets or not.

There are several different things that seem important to me
personally. These are in tension with each other, to a degree. These
are:

1. Like Andres, I'd really like to have some way of inspecting things
like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
detail. These record types happen to be very important in general, and
the ability to see detailed information about the WAL record would
definitely help with some debugging scenarios. I've really missed
stuff like this while debugging serious issues under time pressure.

2. To a lesser extent I would like to see similar detailed information
for nbtree's DELETE, VACUUM, and possibly DEDUPLICATE record types.
They might also come in handy for debugging, in about the same way.

3. More manageable verbosity.

I think that it would be okay to put off coming up with a solution to
3, for now. 1 and 2 seem more important than 3.

I think if there was a more structured output of rmgrdesc, then this
would also solve the verbosity level problem. Consumers could decide on
their verbosity level -- in various pg_walinspect function outputs, that
would probably just be column selection. For pg_waldump, I imagine that
some kind of parameter or flag would work.

Unless you are suggesting that we add a verbosity parameter to the
rmgrdesc function API now?

The verbosity problem will get somewhat worse if we do just my items 1
and 2, so it would be nice if we at least had a strategy in mind that
delivers on item 3/verbosity -- though the implementation can appear
in a later release. Maybe something simple would work, like promising
to output (say) 30 characters or less in terse mode, and making no
such promise otherwise. Terse mode wouldn't just truncate the output
of verbose mode -- it would never display information that could in
principle exceed the 30 character allowance, even with records that
happen to fall under the limit.

I can't feel too bad about putting this part off. A pager like pspg is
already table stakes when using pg_walinspect in any sort of serious
way. As I said upthread, absurdly wide output is already reasonably
common in most cases.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#19)
Re: Show various offset arrays for heap WAL records

On Mon, Mar 13, 2023 at 6:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

There are several different things that seem important to me
personally. These are in tension with each other, to a degree. These
are:

1. Like Andres, I'd really like to have some way of inspecting things
like heapam PRUNE, VACUUM, and FREEZE_PAGE records in significant
detail. These record types happen to be very important in general, and
the ability to see detailed information about the WAL record would
definitely help with some debugging scenarios. I've really missed
stuff like this while debugging serious issues under time pressure.

One problem that I often run into when performing analysis of VACUUM
using pg_walinspect is the issue of *who* pruned which heap page, for
any given PRUNE record. Was it VACUUM/autovacuum, or was it
opportunistic pruning? There is no way of knowing for sure right now.
You *cannot* rely on an xid of 0 as an indicator of a given PRUNE
record coming from VACUUM; it could just have been an opportunistic
prune operation that happened to take place when a SELECT query ran,
before any XID was ever allocated.

I think that we should do something like the attached, to completely
avoid this ambiguity. This patch adds a new XLOG_HEAP2 bit that's
similar to XLOG_HEAP_INIT_PAGE -- XLOG_HEAP2_BYVACUUM. This allows all
XLOG_HEAP2 record types to indicate that they took place during
VACUUM, by XOR'ing the flag with the record type/info when
XLogInsert() is called. For now this is only used by PRUNE records.
Tools like pg_walinspect will report a separate "Heap2/PRUNE+BYVACUUM"
record_type, as well as the unadorned Heap2/PRUNE record_type, which
we'll now know must have been opportunistic pruning.

The approach of using a bit in the style of the heapam init bit makes
sense to me, because the bit is available, and works in a way that is
minimally invasive. Also, one can imagine needing to resolve a similar
ambiguity in the future, when (say) opportunistic freezing is added.

I think that it makes sense to treat this within the scope of
Melanie's ongoing work to improve the instrumentation of these records
-- meaning that it's in scope for Postgres 16. Admittedly this is a
slightly creative interpretation, so if others disagree then I won't
argue. This is quite a small patch, though, which makes debugging
significantly easier. I think that there could be a great deal of
utility in being able to easily "pair up" corresponding
"Heap2/PRUNE+BYVACUUM" and "Heap2/VACUUM" records in debugging
scenarios. I can imagine linking these to "Heap2/FREEZE_PAGE" and
"Heap2/VISIBLE" records, too, since they're all closely related record
types.

--
Peter Geoghegan

Attachments:

v1-0001-Record-which-PRUNE-records-are-from-VACUUM.patchapplication/octet-stream; name=v1-0001-Record-which-PRUNE-records-are-from-VACUUM.patchDownload+26-13
In reply to: Peter Geoghegan (#20)
#22Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#21)
In reply to: Melanie Plageman (#22)
#24Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#23)
In reply to: Melanie Plageman (#24)
#26Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#25)
In reply to: Melanie Plageman (#26)
#28Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#27)
In reply to: Melanie Plageman (#28)
In reply to: Peter Geoghegan (#29)
In reply to: Peter Geoghegan (#30)
#32Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#31)
In reply to: Melanie Plageman (#32)
#34Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#33)
In reply to: Melanie Plageman (#34)
#36Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#35)
In reply to: Melanie Plageman (#36)
In reply to: Peter Geoghegan (#37)
In reply to: Peter Geoghegan (#38)
#40Melanie Plageman
melanieplageman@gmail.com
In reply to: Peter Geoghegan (#39)
In reply to: Melanie Plageman (#40)
#42Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#41)
In reply to: Heikki Linnakangas (#42)
In reply to: Peter Geoghegan (#43)
#45Melanie Plageman
melanieplageman@gmail.com
In reply to: Heikki Linnakangas (#42)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Melanie Plageman (#45)
In reply to: Peter Geoghegan (#20)
#48Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Peter Geoghegan (#47)
In reply to: Heikki Linnakangas (#48)