Add pg_walinspect function with block info columns
Hi,
When using pg_walinspect, and calling functions like
pg_get_wal_records_info(), I often wish that the various information in
the block_ref column was separated out into columns so that I could
easily access them and pass them to various other functions to add
information -- like getting the relname from pg_class like this:
SELECT n.nspname, c.relname, wal_info.*
FROM pg_get_wal_records_extended_info(:start_lsn, :end_lsn) wal_info
JOIN pg_class c
ON wal_info.relfilenode = pg_relation_filenode(c.oid) AND
wal_info.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace;
This has been mentioned in [1]/messages/by-id/CAH2-Wz=acGKoP8cZ+6Af2inoai0N5cZKCY13DaqXCwQNupK8qg@mail.gmail.com amongst other places.
So, attached is a patch with pg_get_wal_records_extended_info(). I
suspect the name is not very good. Also, it is nearly a direct copy of
pg_get_wal_fpi_infos() except for the helper called to fill in the
tuplestore, so it might be worth doing something about that.
However, I am mainly looking for feedback about whether or not others
would find this useful, and, if so, what columns they would like to see
in the returned tuplestore.
Note that I didn't include the cumulative fpi_len for all the pages
since pg_get_wal_fpi_info() now exists. I noticed that
pg_get_wal_fpi_info() doesn't list compression information (which is in
the block_ref column of pg_get_wal_records_info()). I don't know if this
is worth including in my proposed function
pg_get_wal_records_extended_info().
- Melanie
[1]: /messages/by-id/CAH2-Wz=acGKoP8cZ+6Af2inoai0N5cZKCY13DaqXCwQNupK8qg@mail.gmail.com
Attachments:
v1-0001-Add-extended-block-info-function-to-pg_walinspect.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-extended-block-info-function-to-pg_walinspect.patchDownload+138-1
On Wed, Mar 1, 2023 at 12:51 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
When using pg_walinspect, and calling functions like
pg_get_wal_records_info(), I often wish that the various information in
the block_ref column was separated out into columns so that I could
easily access them and pass them to various other functions to add
information -- like getting the relname from pg_class like this:SELECT n.nspname, c.relname, wal_info.*
FROM pg_get_wal_records_extended_info(:start_lsn, :end_lsn) wal_info
JOIN pg_class c
ON wal_info.relfilenode = pg_relation_filenode(c.oid) AND
wal_info.reldatabase IN (0, (SELECT oid FROM pg_database
WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace;This has been mentioned in [1] amongst other places.
So, attached is a patch with pg_get_wal_records_extended_info(). I
suspect the name is not very good. Also, it is nearly a direct copy of
pg_get_wal_fpi_infos() except for the helper called to fill in the
tuplestore, so it might be worth doing something about that.However, I am mainly looking for feedback about whether or not others
would find this useful, and, if so, what columns they would like to see
in the returned tuplestore.Note that I didn't include the cumulative fpi_len for all the pages
since pg_get_wal_fpi_info() now exists. I noticed that
pg_get_wal_fpi_info() doesn't list compression information (which is in
the block_ref column of pg_get_wal_records_info()). I don't know if this
is worth including in my proposed function
pg_get_wal_records_extended_info().
Thinking about this more, it could make sense to have a function which
gives you this extended block information and has a parameter like
with_fpi which would include the information returned by
pg_get_wal_fpi_info(). It might be nice to have it still include the
information about the record itself as well.
I don't know if it would be instead of pg_get_wal_fpi_info(), though.
The way I would use this is when I want to see the record level
information but with some additional information aggregated across the
relevant blocks. For example, I could group by the record information
and relfilenode and using the query in my example above, see all the
information for the record along with the relname (when possible).
- Melanie
On Thu, Mar 02, 2023 at 11:17:05AM -0500, Melanie Plageman wrote:
Thinking about this more, it could make sense to have a function which
gives you this extended block information and has a parameter like
with_fpi which would include the information returned by
pg_get_wal_fpi_info(). It might be nice to have it still include the
information about the record itself as well.
Hmm. I am OK if you want to include more information about the
blocks, and it may be nicer to not bloat the interface with more
functions than necessary.
I don't know if it would be instead of pg_get_wal_fpi_info(), though.
The way I would use this is when I want to see the record level
information but with some additional information aggregated across the
relevant blocks. For example, I could group by the record information
and relfilenode and using the query in my example above, see all the
information for the record along with the relname (when possible).
As far as I know, a block reference could have some data or a FPW, so
it is true that pg_get_wal_fpi_info() is not extensive enough if you
want to get more information about the blocks in use for each record,
especially if there is some data, and grouping the information about
whole set of blocks into a single function call can some time.
In order to satisfy your case, why not having one function that does
everything, looping over the blocks of a single record as long as
XLogRecHasBlockRef() is satisfied, returning the FPW if the block
includes an image (or NULL if !XLogRecHasBlockImage()), as well as its
data in bytea if XLogRecGetData() gives something (?).
I am not sure that this should return anything about the record itself
except its ReadRecPtr, though, as ReadRecPtr would be enough to
cross-check with the information provided by GetWALRecordInfo() with a
join. Hence, I guess that we could update the existing FPI function
with:
- the addition of some of the flags of bimg_info, like the compression
type, if they apply, with a text[].
- the addition of bimg_len, if the block has a FPW, or NULL if none.
- the addition of apply_image, if the block has a FPW, or NULL if
none.
- the addition of the block data, if any, or NULL if there is no
data.
- an update for the FPW handling, where we would return NULL if there
is no FPW references in the block, but still return the full,
decompressed 8kB image if it is there.
--
Michael
On Thu, Mar 2, 2023 at 9:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
However, I am mainly looking for feedback about whether or not others
would find this useful, and, if so, what columns they would like to see
in the returned tuplestore.
IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
me as it outputs most of the columns that are already given by
pg_get_wal_records_info.What I think the best way at this point is to
make it return the following:
lsn pg_lsn
block_id int8
spcOid oid
dbOid oid
relNumber oid
forkNames text
fpi bytea
fpi_info text
So, there can be multiple columns for the same record LSN, which
means, essentially (lsn, block_id) can be a unique value for the row.
If a block has FPI, fpi and fpi_info are non-null, otherwise, nulls.
If needed, this output can be joined with pg_get_wal_records_info on
lsn, to get all the record level details.What do you think? Will this
serve your purpose?
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Mar 2, 2023 at 9:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:However, I am mainly looking for feedback about whether or not others
would find this useful, and, if so, what columns they would like to see
in the returned tuplestore.IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
me as it outputs most of the columns that are already given by
pg_get_wal_records_info.What I think the best way at this point is to
make it return the following:
lsn pg_lsn
block_id int8
spcOid oid
dbOid oid
relNumber oid
forkNames text
fpi bytea
fpi_info text
Wouldn't it be more useful to have
type PgXlogRecordBlock
( block_id int
...
, blockimage bytea
, data bytea
)
type PgXlogRecord
( start lsn
, ...
, blocks PgXlogRecordBlock[] -- array of record's registered blocks,
c.q. DecodedBkpBlock->blocks
, main_data bytea
)
which is returned by one sql function, and then used and processed
(unnest()ed) in the relevant views? It would allow anyone to build
their own processing on pg_walinspect where they want or need it,
without us having to decide what the user wants, and without having to
associate blocks with the main xlog record data through the joining of
several (fairly expensive) xlog decoding passes.
The basic idea is to create a single entrypoint to all relevant data
from DecodedXLogRecord in SQL, not multiple.
Kind regards,
Matthias van de Meent
On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
me as it outputs most of the columns that are already given by
pg_get_wal_records_info.What I think the best way at this point is to
make it return the following:
lsn pg_lsn
block_id int8
spcOid oid
dbOid oid
relNumber oid
forkNames text
fpi bytea
fpi_info text
I would add the length of the block data (without the hole and
compressed, as the FPI data should always be presented as
uncompressed), and the block data if any (without the block data
length as one can guess it based on the bytea data length). Note that
a block can have both a FPI and some data assigned to it, as far as I
recall.
The basic idea is to create a single entrypoint to all relevant data
from DecodedXLogRecord in SQL, not multiple.
While I would agree with this principle on simplicity's ground in
terms of minimizing the SQL interface and the pg_wal/ lookups, I
disagree about it on unsability ground, because we can avoid extra SQL
tweaks with more functions. One recent example I have in mind is
partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
on the catalogs. There are of course various degrees of complexity,
and perhaps unnest() cannot qualify as one, but having two functions
returning normalized records (one for the record information, and a
second for the block information), is a rather good balance between
usability and interface complexity, in my experience. If you have two
functions, a JOIN is enough to cross-check the block data and the
record data, while an unnest() heavily bloats the main function output
(aka byteas of FPIs in a single array).
--
Michael
At Tue, 7 Mar 2023 09:34:24 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
me as it outputs most of the columns that are already given by
pg_get_wal_records_info.What I think the best way at this point is to
make it return the following:
lsn pg_lsn
block_id int8
spcOid oid
dbOid oid
relNumber oid
forkNames text
fpi bytea
fpi_info textI would add the length of the block data (without the hole and
compressed, as the FPI data should always be presented as
uncompressed), and the block data if any (without the block data
length as one can guess it based on the bytea data length). Note that
a block can have both a FPI and some data assigned to it, as far as I
recall.
+1
The basic idea is to create a single entrypoint to all relevant data
from DecodedXLogRecord in SQL, not multiple.While I would agree with this principle on simplicity's ground in
terms of minimizing the SQL interface and the pg_wal/ lookups, I
disagree about it on unsability ground, because we can avoid extra SQL
tweaks with more functions. One recent example I have in mind is
partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
on the catalogs. There are of course various degrees of complexity,
and perhaps unnest() cannot qualify as one, but having two functions
returning normalized records (one for the record information, and a
second for the block information), is a rather good balance between
usability and interface complexity, in my experience. If you have two
functions, a JOIN is enough to cross-check the block data and the
record data, while an unnest() heavily bloats the main function output
(aka byteas of FPIs in a single array).
FWIW, my initial thought about the proposal was similar to Matthias,
and tried a function that would convert (for simplicity) the block_ref
string to a json object. Although this approach did work, I was not
satisfied with its limited usability and poor performance (mainly the
poor performance is due to text->json conversion, though)..
Finally, I realized that the initial discomfort I experienced stemmed
from the name of the function, which suggests that it returns
information of "records". This discomfort would disappear if the
function were instead named pg_get_wal_blockref_info() or something
similar.
Thus I'm inclined to agree with Michael's suggestion of creating a new
normalized set-returning function that returns information of
"blocks".
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote:
Thus I'm inclined to agree with Michael's suggestion of creating a new
normalized set-returning function that returns information of
"blocks".
Just to be clear here, I am not suggesting to add a new function for
only the block information, just a rename of the existing
pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that
includes both the FPI (if any or NULL if none) and the block data (if
any or NULL is none) so as all of them are governed by the same lookup
at pg_wal/. The fpi information (aka compression type) is displayed
if there is a FPI in the block.
--
Michael
At Tue, 7 Mar 2023 14:44:49 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Tue, Mar 07, 2023 at 11:17:45AM +0900, Kyotaro Horiguchi wrote:
Thus I'm inclined to agree with Michael's suggestion of creating a new
normalized set-returning function that returns information of
"blocks".Just to be clear here, I am not suggesting to add a new function for
only the block information, just a rename of the existing
pg_get_wal_fpi_info() to something like pg_get_wal_block_info() that
includes both the FPI (if any or NULL if none) and the block data (if
any or NULL is none) so as all of them are governed by the same lookup
at pg_wal/. The fpi information (aka compression type) is displayed
if there is a FPI in the block.
Ah. Yes, that expansion sounds sensible.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
Ah. Yes, that expansion sounds sensible.
Okay, so, based on this idea, I have hacked on this stuff and finish
with the attached that shows block data if it exists, as well as FPI
stuff if any. bimg_info is showed as a text[] for its flags.
I guess that I'd better add a test that shows correctly a record with
some block data attached to it, on top of the existing one for FPIs..
Any suggestions? Perhaps just a heap/heap2 record?
Thoughts?
--
Michael
Attachments:
0001-Rework-pg_walinspect-to-retrieve-more-block-informat.patchtext/x-diff; charset=us-asciiDownload+145-71
At Tue, 7 Mar 2023 16:18:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
Ah. Yes, that expansion sounds sensible.
Okay, so, based on this idea, I have hacked on this stuff and finish
with the attached that shows block data if it exists, as well as FPI
stuff if any. bimg_info is showed as a text[] for its flags.
# The naming convetion looks inconsistent between
# pg_get_wal_records_info and pg_get_wal_block_info but it's not an
# issue of this patch..
I guess that I'd better add a test that shows correctly a record with
some block data attached to it, on top of the existing one for FPIs..
Any suggestions? Perhaps just a heap/heap2 record?Thoughts?
I thought that we needed a test for block data when I saw the patch.
I don't have great idea but a single insert should work.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Mar 7, 2023 at 12:48 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
Ah. Yes, that expansion sounds sensible.
Okay, so, based on this idea, I have hacked on this stuff and finish
with the attached that shows block data if it exists, as well as FPI
stuff if any. bimg_info is showed as a text[] for its flags.
+1.
I guess that I'd better add a test that shows correctly a record with
some block data attached to it, on top of the existing one for FPIs..
Any suggestions? Perhaps just a heap/heap2 record?Thoughts?
That would be a lot better. Not just the test, but also the
documentation can have it. Simple way to generate such a record (both
block data and FPI) is to just change the wal_level to logical in
walinspect.conf [1], see code around REGBUF_KEEP_DATA and
RelationIsLogicallyLogged in heapam.c
I had the following comments and fixed them in the attached v2 patch:
1. Still a trace of pg_get_wal_fpi_info in docs, removed it.
2. Used int4 instead of int for fpilen just to be in sync with
fpi_length of pg_get_wal_record_info.
3. Changed to be consistent and use just FPI or "F/full page".
/* FPI flags */
/* No full page image, so store NULLs for all its fields */
/* Full-page image */
/* Full page exists, so let's save it. */
* and end LSNs. This produces information about the full page images with
* to a record. Decompression is applied to the full-page images, if
4. I think we need to free raw_data, raw_page and flags as we loop
over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can
be a problem if we have many blocks assocated with a single WAL
record.
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
Also, we will leak all CStringGetTextDatum memory in the block_id for loop.
Another way is to use and reset temp memory context in the for loop
over block_ids. I prefer this approach over multiple pfree()s in
block_id for loop.
5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so
I changed it to the following. Feel free to ignore this if you think
it's not required.
if (blk->apply_image)
flags[cnt++] = CStringGetTextDatum("APPLY");
else
flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION");
6. Did minor wordsmithing.
7. Added test case which shows both block data and fpi in the documentation.
8. Changed wal_level to logical in walinspect.conf to test case with block data.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patchapplication/octet-stream; name=v2-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patchDownload+189-87
On Tue, 7 Mar 2023 at 01:34, Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
me as it outputs most of the columns that are already given by
pg_get_wal_records_info.What I think the best way at this point is to
make it return the following:
lsn pg_lsn
block_id int8
spcOid oid
dbOid oid
relNumber oid
forkNames text
fpi bytea
fpi_info textI would add the length of the block data (without the hole and
compressed, as the FPI data should always be presented as
uncompressed), and the block data if any (without the block data
length as one can guess it based on the bytea data length). Note that
a block can have both a FPI and some data assigned to it, as far as I
recall.The basic idea is to create a single entrypoint to all relevant data
from DecodedXLogRecord in SQL, not multiple.While I would agree with this principle on simplicity's ground in
terms of minimizing the SQL interface and the pg_wal/ lookups, I
disagree about it on unsability ground, because we can avoid extra SQL
tweaks with more functions. One recent example I have in mind is
partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
on the catalogs.
Correct, but in that case the user would build the same query (or at
least with the same complexity) as what we're executing under the
hood, right?
There are of course various degrees of complexity,
and perhaps unnest() cannot qualify as one, but having two functions
returning normalized records (one for the record information, and a
second for the block information), is a rather good balance between
usability and interface complexity, in my experience.
I would agree, if it weren't for the reasons written below.
If you have two
functions, a JOIN is enough to cross-check the block data and the
record data,
Joins are expensive on large datasets; and because WAL is one of the
largest datasets in the system, why would we want to force the user to
JOIN them if we can produce the data in one pre-baked data structure
without a need to join?
while an unnest() heavily bloats the main function output
(aka byteas of FPIs in a single array).
I don't see how that would be bad. You can select a subset of columns
without much issue, which can allow you to ignore any and all bloat.
It is also not easy to imagine that we'd have arguments in the
function that determine whether it includes the largest fields (main
data, blocks, block data, and block images) or leaves them NULL so
that we need to pass less data around if the user doesn't want the
data.
Matthias van de Meent
On Tue, Mar 07, 2023 at 03:56:22PM +0530, Bharath Rupireddy wrote:
That would be a lot better. Not just the test, but also the
documentation can have it. Simple way to generate such a record (both
block data and FPI) is to just change the wal_level to logical in
walinspect.conf [1], see code around REGBUF_KEEP_DATA and
RelationIsLogicallyLogged in heapam.c
I don't agree that we need to go down to wal_level=logical for this.
The important part is to check that the non-NULL and NULL paths for
the block data and FPI data are both taken, making 4 paths to check.
So we need two tests at minimum, which would be either:
- One SQL generating no FPI with no block data and a second generating
a FPI with block data. v2 was doing that but did not cover the first
case.
- One SQL generating a FPI with no block data and a second generating
no FPI with block data.
So let's just geenrate a heap record on an UPDATE, for example, like
in the version attached.
2. Used int4 instead of int for fpilen just to be in sync with
fpi_length of pg_get_wal_record_info.
Okay.
3. Changed to be consistent and use just FPI or "F/full page".
/* FPI flags */
/* No full page image, so store NULLs for all its fields */
/* Full-page image */
/* Full page exists, so let's save it. */
* and end LSNs. This produces information about the full page images with
* to a record. Decompression is applied to the full-page images, if
Fine by me.
4. I think we need to free raw_data, raw_page and flags as we loop
over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can
be a problem if we have many blocks assocated with a single WAL
record.
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
Also, we will leak all CStringGetTextDatum memory in the block_id for loop.
Another way is to use and reset temp memory context in the for loop
over block_ids. I prefer this approach over multiple pfree()s in
block_id for loop.
I disagree, this was on purpose in the last version. This version
finishes by calling AllocSetContextCreate() and MemoryContextDelete()
once per *record*, which will not be free, and we are arguing about
resetting the memory context after scanning up to XLR_MAX_BLOCK_ID
blocks, or 32 blocks which would go up to 32kB per page in the worst
case. That's not going to matter in a large scan for each record, but
the extra AllocSet*() calls could. And we basically do the same thing
on HEAD.
5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so
I changed it to the following. Feel free to ignore this if you think
it's not required.
if (blk->apply_image)
flags[cnt++] = CStringGetTextDatum("APPLY");
else
flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION");
Disagreed here as well. WAL_VERIFICATION does not map with any of the
internal flags, and actually it may be finished by not being used
at replay if the LSN of the page read if higher than what the WAL
stores.
7. Added test case which shows both block data and fpi in the
documentation.
Okay on that.
8. Changed wal_level to logical in walinspect.conf to test case with block data.
This change is not necessary, per the argument above.
Any comments?
--
Michael
Attachments:
v3-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patchtext/x-diff; charset=us-asciiDownload+186-80
On Wed, Mar 8, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 07, 2023 at 03:56:22PM +0530, Bharath Rupireddy wrote:
That would be a lot better. Not just the test, but also the
documentation can have it. Simple way to generate such a record (both
block data and FPI) is to just change the wal_level to logical in
walinspect.conf [1], see code around REGBUF_KEEP_DATA and
RelationIsLogicallyLogged in heapam.cI don't agree that we need to go down to wal_level=logical for this.
The important part is to check that the non-NULL and NULL paths for
the block data and FPI data are both taken, making 4 paths to check.
So we need two tests at minimum, which would be either:
- One SQL generating no FPI with no block data and a second generating
a FPI with block data. v2 was doing that but did not cover the first
case.
- One SQL generating a FPI with no block data and a second generating
no FPI with block data.So let's just geenrate a heap record on an UPDATE, for example, like
in the version attached.
Yup, that should work too because block data gets logged [1]needs_backup = (page_lsn <= RedoRecPtr);.
4. I think we need to free raw_data, raw_page and flags as we loop
over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can
be a problem if we have many blocks assocated with a single WAL
record.
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
Also, we will leak all CStringGetTextDatum memory in the block_id for loop.
Another way is to use and reset temp memory context in the for loop
over block_ids. I prefer this approach over multiple pfree()s in
block_id for loop.I disagree, this was on purpose in the last version. This version
finishes by calling AllocSetContextCreate() and MemoryContextDelete()
once per *record*, which will not be free, and we are arguing about
resetting the memory context after scanning up to XLR_MAX_BLOCK_ID
blocks, or 32 blocks which would go up to 32kB per page in the worst
case. That's not going to matter in a large scan for each record, but
the extra AllocSet*() calls could. And we basically do the same thing
on HEAD.
It's not just 32kB per page right? 32*8KB on HEAD (no block data,
flags and CStringGetTextDatum there). With the patch, the number of
pallocs for each block_id = 6 CStringGetTextDatum + BLCKSZ (8KB) +
flags (5*size of ptr) + block data_len. In the worst case, all
XLR_MAX_BLOCK_ID can have both FPIs and block data. Furthermore,
imagine if someone initialized their cluster with a higher BLCKSZ (>=
8KB), then the memory leak happens noticeably on a lower-end system.
I understand that performance is critical here but we need to ensure
memory is used wisely. Therefore, I'd still vote to free at least the
major contributors here, that is, pfree(raw_data);, pfree(raw_page);
and pfree(flags); right after they are done using. I'm sure pfree()s
don't hurt more than resetting memory context for every block_id.
Any comments?
I think we need to output block data length (blk->data_len) similar to
fpilen to save users from figuring out how to get the length of a
bytea column. This will also keep block data in sync with FPI info.
[1]: needs_backup = (page_lsn <= RedoRecPtr);
needs_backup = (page_lsn <= RedoRecPtr);
(gdb) p page_lsn
$2 = 21581544
(gdb) p RedoRecPtr
$3 = 21484808
(gdb) p needs_backup
$4 = false
(gdb)
(gdb) bt
#0 XLogRecordAssemble (rmid=10 '\n', info=64 '@',
RedoRecPtr=21484808, doPageWrites=true, fpw_lsn=0x7ffde118d640,
num_fpi=0x7ffde118d634, topxid_included=0x7ffde118d633) at xloginsert.c:582
#1 0x00005598cd9c3ef7 in XLogInsert (rmid=10 '\n', info=64 '@') at
xloginsert.c:497
#2 0x00005598cd930452 in log_heap_update (reln=0x7f4a4c7cd808,
oldbuf=136, newbuf=136, oldtup=0x7ffde118d820,
newtup=0x5598d00cb098, old_key_tuple=0x0,
all_visible_cleared=false, new_all_visible_cleared=false)
at heapam.c:8473
#3 0x00005598cd92876e in heap_update (relation=0x7f4a4c7cd808,
otid=0x7ffde118dab2, newtup=0x5598d00cb098, cid=0,
crosscheck=0x0, wait=true, tmfd=0x7ffde118db60,
lockmode=0x7ffde118da74) at heapam.c:3741
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote:
I understand that performance is critical here but we need to ensure
memory is used wisely. Therefore, I'd still vote to free at least the
major contributors here, that is, pfree(raw_data);, pfree(raw_page);
and pfree(flags); right after they are done using. I'm sure pfree()s
don't hurt more than resetting memory context for every block_id.
Okay by me to have intermediate pfrees between each block scanned if
you feel strongly about it.
I think we need to output block data length (blk->data_len) similar to
fpilen to save users from figuring out how to get the length of a
bytea column. This will also keep block data in sync with FPI info.
length() works fine on bytea, so it can be used on the block data.
fpilen is a very different matter as it would be the length of a page
without a hole, or just something compressed.
--
Michael
On Wed, Mar 8, 2023 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote:
I understand that performance is critical here but we need to ensure
memory is used wisely. Therefore, I'd still vote to free at least the
major contributors here, that is, pfree(raw_data);, pfree(raw_page);
and pfree(flags); right after they are done using. I'm sure pfree()s
don't hurt more than resetting memory context for every block_id.Okay by me to have intermediate pfrees between each block scanned if
you feel strongly about it.
Thanks. Attached v4 with that change.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v4-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patchapplication/x-patch; name=v4-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patchDownload+196-80
At Wed, 8 Mar 2023 20:18:06 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
On Wed, Mar 8, 2023 at 4:23 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Mar 08, 2023 at 04:01:56PM +0530, Bharath Rupireddy wrote:
I understand that performance is critical here but we need to ensure
memory is used wisely. Therefore, I'd still vote to free at least the
major contributors here, that is, pfree(raw_data);, pfree(raw_page);
and pfree(flags); right after they are done using. I'm sure pfree()s
don't hurt more than resetting memory context for every block_id.Okay by me to have intermediate pfrees between each block scanned if
you feel strongly about it.Thanks. Attached v4 with that change.
Although I'm not strongly opposed to pfreeing them, I'm not sure I
like the way the patch frees them. The life times of all of raw_data,
raw_page and flags are within a block. They can be freed
unconditionally after they are actually used and the scope of the
pointer variables can be properly narowed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Mar 09, 2023 at 09:46:12AM +0900, Kyotaro Horiguchi wrote:
Although I'm not strongly opposed to pfreeing them, I'm not sure I
like the way the patch frees them. The life times of all of raw_data,
raw_page and flags are within a block. They can be freed
unconditionally after they are actually used and the scope of the
pointer variables can be properly narowed.
The thing is that you cannot keep them inside each individual blocks
because they have to be freed once their values are stored in the
tuplestore, which is why I guess Bharath has done things this way.
After sleeping on that, I tend to prefer the simplicity of v3 where we
keep track of the block and fpi data in each of their respective
blocks. It means that we lose track of them each time we go to a
different block, but the memory context reset done after each record
means that scanning through a large WAL history will not cause a leak
across the function call.
The worst scenario with v3 is a record that makes use of all the 32
blocks with a hell lot of block data in each one of them, which is
possible in theory, but very unlikely in practice except if someone
uses a custom RGMR to generate crazily-shaped WAL records. I am aware
of the fact that it is possible to generate such records if you are
really willing to do so, aka this thread:
/messages/by-id/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
In short, my choice would still be simplicity here with v3, I guess.
--
Michael
At Thu, 9 Mar 2023 10:15:39 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Thu, Mar 09, 2023 at 09:46:12AM +0900, Kyotaro Horiguchi wrote:
Although I'm not strongly opposed to pfreeing them, I'm not sure I
like the way the patch frees them. The life times of all of raw_data,
raw_page and flags are within a block. They can be freed
unconditionally after they are actually used and the scope of the
pointer variables can be properly narowed.The thing is that you cannot keep them inside each individual blocks
because they have to be freed once their values are stored in the
tuplestore, which is why I guess Bharath has done things this way.
Ugh.. Right.
After sleeping on that, I tend to prefer the simplicity of v3 where we
keep track of the block and fpi data in each of their respective
blocks. It means that we lose track of them each time we go to a
different block, but the memory context reset done after each record
means that scanning through a large WAL history will not cause a leak
across the function call.The worst scenario with v3 is a record that makes use of all the 32
blocks with a hell lot of block data in each one of them, which is
possible in theory, but very unlikely in practice except if someone
uses a custom RGMR to generate crazily-shaped WAL records. I am aware
of the fact that it is possible to generate such records if you are
really willing to do so, aka this thread:
/messages/by-id/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
I agree to the view that that "leakage" for at-most 32 blocks and
typically 0 to 2 blcoks won't be a matter.
In short, my choice would still be simplicity here with v3, I guess.
FWIW, I slightly prefer v3 for the reason I mentioned above.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center