More speedups for tuple deformation

dgrowleyml@gmail.com

2 months ago

In reply to: David Rowley (#1)

Re: More speedups for tuple deformation

On Sun, 28 Dec 2025 at 22:04, David Rowley <dgrowleyml@gmail.com> wrote:

Things still to do:

* More benchmarking is needed. I've not yet completed the benchmarks
on my Zen4 machine. No Intel hardware has been tested at all. I don't
really have any good Intel hardware to test with. Maybe someone else
would like to help? Script is attached.

Please find attached an updated set of patches. A rebase was needed,
plus 0003 had a problem with an Assert not handling the bitmap being a
NULL pointer.

I've done some more performance tests after upgrading my Zen2 machine
to use newer versions of gcc and clang. I've also tested on an Intel
machine now. All the results are attached in a spreadsheet form in the
bzip file. There's also a pg_dump of the results and
analysis_schema.sql, which has an SQL function to extract the data in
a form that's compatible with the spreadsheet's format.

I'd say things are looking generally good for 0001 without the
OPTIMIZE_BYVAL stuff, but the results I got from clang on the
AMD7945hx don't look good at all. I'll run the tests on that again
tonight. The machine is a laptop and I did run the benchmarks on
master first to establish the baseline. I want to ensure there's no
thermal throttling going on. Aside from clang on the 7945hx, there are
a few cases where there's a slight regression in the 0 extra column
tests when a NULL is present. I wonder how much we should care about
this as 1) the regression is small; and, 2) IMO, there's less chance
of there being a NULL in a table with very few columns, in this case,
the table has 3 columns.

The "AMD3990x clang 20.1.8" results in the spreadsheet also look
strange for 0001. It looks good up to 20 columns, then the performance
trend breaks for 30 and 40 columns. I don't have an explanation for
this yet.

I've also attached an updated script to run the tests and output the
results in csv format so that it can be easily imported into Postgres
for analysis or processing.

* I've not looked at the JIT deforming code. At the moment the code
won't even compile with LLVM enabled because I've removed the
TTS_FLAG_SLOW flag. It's possible I'll have to adjust the JIT
deforming code or consider keeping TTS_FLAG_SLOW.

This part turned out to be easy. The JIT deformer does not pay
attention to the TTS_FLAG_SLOW flag, it just unconditionally turns it
on to force the non-jit deformer into using slow mode. I've deleted
the code that was setting it since slow mode no longer exists in the
patched code.

David

dgrowleyml@gmail.com

about 2 months ago

In reply to: David Rowley (#2)

Re: More speedups for tuple deformation

On Fri, 2 Jan 2026 at 18:58, David Rowley <dgrowleyml@gmail.com> wrote:

Please find attached an updated set of patches. A rebase was needed,
plus 0003 had a problem with an Assert not handling the bitmap being a
NULL pointer.

Another rebase and updates to some newly created missing calls to
TupleDescFinalize().

I've also attached another round of benchmarks after dipping into some
Azure machines to cover my lack of any Intel benchmark results. I
think these are somewhat noisy as I opted for low core-count instances
which will have L3 shared with workloads running for other people.
This is most evident in Xeon_E5-2673 with gcc where the patched run
was nearly twice as fast as unpatched for test 2 on 20 extra columns.
If you look at the raw results from that, you can see the times are
quite unstable between the 3 runs of each test, which makes me believe
that the machine was busy with other work when that test ran on
master. The AMD3990x and M2 machines are all sitting next to me and
were otherwise idle, so they should be much more stable.

Quite a few machines have a small regression for the 0 extra column
tests. There is a small amount of extra work being done in the
deforming function to check if the attnum < the first attribute
without an attcacheoff. This mostly only affects the tests that don't
do any deforming with a cached attcacheoff, e.g due to NULLs or
varlena types. The only way I've thought about to possibly reduce that
is to invent a new TupleTableSlotOps and pick the one that applies
when creating the TupleTableSlot. This doesn't appeal to me very much
as it requires modifying many callsites. But I do wonder if we should
try to come up with something here as technically we could use this to
eliminate alignment padding out of some MinimalTuples in some cases
where these were not directly derived from pre-formed HeapTuples. That
could allow a more compact tuple representation for sorting and
hashing, allowing us to do more with less memory in some cases.

The benchmark results also indicated that there wasn't much advantage
to the 0002+0003 patches, so I've removed those from the set. That
reduces some complexity around the benchmarks. I did still keep the
OPTIMIZE_BYVAL loop as separate results. It's not quite clear what's
best there as machines seem to vary on which they prefer.

Benchmark results attached in the bz2 file both in spreadsheet form
and the raw results pg_dumped.

David

li.evan.chao@gmail.com

about 2 months ago

In reply to: David Rowley (#3)

Re: More speedups for tuple deformation

On Jan 19, 2026, at 06:13, David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 2 Jan 2026 at 18:58, David Rowley <dgrowleyml@gmail.com> wrote:

Please find attached an updated set of patches. A rebase was needed,
plus 0003 had a problem with an Assert not handling the bitmap being a
NULL pointer.

Another rebase and updates to some newly created missing calls to
TupleDescFinalize().

I've also attached another round of benchmarks after dipping into some
Azure machines to cover my lack of any Intel benchmark results. I
think these are somewhat noisy as I opted for low core-count instances
which will have L3 shared with workloads running for other people.
This is most evident in Xeon_E5-2673 with gcc where the patched run
was nearly twice as fast as unpatched for test 2 on 20 extra columns.
If you look at the raw results from that, you can see the times are
quite unstable between the 3 runs of each test, which makes me believe
that the machine was busy with other work when that test ran on
master. The AMD3990x and M2 machines are all sitting next to me and
were otherwise idle, so they should be much more stable.

Quite a few machines have a small regression for the 0 extra column
tests. There is a small amount of extra work being done in the
deforming function to check if the attnum < the first attribute
without an attcacheoff. This mostly only affects the tests that don't
do any deforming with a cached attcacheoff, e.g due to NULLs or
varlena types. The only way I've thought about to possibly reduce that
is to invent a new TupleTableSlotOps and pick the one that applies
when creating the TupleTableSlot. This doesn't appeal to me very much
as it requires modifying many callsites. But I do wonder if we should
try to come up with something here as technically we could use this to
eliminate alignment padding out of some MinimalTuples in some cases
where these were not directly derived from pre-formed HeapTuples. That
could allow a more compact tuple representation for sorting and
hashing, allowing us to do more with less memory in some cases.

The benchmark results also indicated that there wasn't much advantage
to the 0002+0003 patches, so I've removed those from the set. That
reduces some complexity around the benchmarks. I did still keep the
OPTIMIZE_BYVAL loop as separate results. It's not quite clear what's
best there as machines seem to vary on which they prefer.

Benchmark results attached in the bz2 file both in spreadsheet form
and the raw results pg_dumped.

David
<v3-0001-Precalculate-CompactAttribute-s-attcacheoff.patch><deform_results2.tar.bz2>

Hi David,

I reviewed the patch and traced some basic workflows. But I haven’t done a load test to compare performance differences with and without this patch, I will do that if I get some bandwidth later. Here comes some review comments:

1 - tupmacs.h
```
+	/* Create a mask with all bits beyond natts's bit set to off */
+	mask = 0xFF & ((((uint8) 1) << (natts & 7)) - 1);
+	byte = (~bits[lastByte]) & mask;
```

When I read the code, I got an impression bits[lastByte] might overflow when natts % 8 == 0, so I traced the code, then I realized that, this function is only called when a row has null values, so that, when reaching here, natts % 8 != 0, otherwise it should return earlier within the for loop.

So, to avoid future reader’s same confusion, can we add a brief comment to explain that no overflow should happen here.

2 - After this patch, nocachegetattr() and nocache_index_getattr() strictly rely on tupleDesc->firstNonCachedOffAttr to work:
```
if (tupleDesc->firstNonCachedOffAttr >= 0)
{
startAttr = Min(tupleDesc->firstNonCachedOffAttr - 1, firstnullattr);
off = TupleDescCompactAttr(tupleDesc, startAttr)->attcacheoff;
}
else
{
startAttr = 0;
off = 0;
}
```

And tupleDesc->firstNonCachedOffAttr is only set by TupleDescFinalize(). So, assuming some code misses to call TupleDescFinalize(), looking at how TupleDesc is created, for example CreateTemplateTupleDesc():
```
desc = (TupleDesc) palloc(offsetof(struct TupleDescData, compact_attrs) +
natts * sizeof(CompactAttribute) +
natts * sizeof(FormData_pg_attribute));

/*
* Initialize other fields of the tupdesc.
*/
desc->natts = natts;
desc->constr = NULL;
desc->tdtypeid = RECORDOID;
desc->tdtypmod = -1;
desc->tdrefcount = -1; /* assume not reference-counted */

return desc;
```

It’s palloc and not palloc0, so desc->firstNonCachedOffAttr will initially hold a random value. As long as TupleDescFinalize() is missed, then that’s a bug.

From this perspective, I think we can set firstNonCachedOffAttr to -2 when in CreateTemplateTupleDesc() as well as other functions that create a TupleDesc. Then in nocachegetattr() and nocache_index_getattr(), we can Assert(desc->firstNonCachedOffAttr > -2).

3
```
+ firstNonCachedOffAttr = i + 1;
```

In TupleDescFinalize(), given firstNonCachedOffAttr = i + 1, firstNonCachedOffAttr will never be 0.

But in nocachegetattr(), it checks firstNonCachedOffAttr >= 0:
```
if (tupleDesc->firstNonCachedOffAttr >= 0)
{
startAttr = Min(tupleDesc->firstNonCachedOffAttr - 1, firstnullattr);
off = TupleDescCompactAttr(tupleDesc, startAttr)->attcacheoff;
}
```

This is kinda inconsistent, and may potentially lead to some confusion to code readers.

From the meaning of the variable name “firstNonCachedOffAttr”, when there is no cached attribute, firstNonCachedOffAttr feels better to be 0 rather than-1. From this perspective, TupleDescFinalize() can initialize desc->firstNonCachedOffAttr to 0. And for my comment 2, we can use -1 instead of -2, so that -1 indicates TupleDescFinalize() is not called, 0 means no cached attribute, >0 means some cached attributes.

4
```
+ * possibily could cache the first attlen == -2 attr. Worthwhile?
```

Typo: possibily -> possibly

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

dgrowleyml@gmail.com

about 2 months ago

In reply to: Chao Li (#4)

Re: More speedups for tuple deformation

On Mon, 19 Jan 2026 at 18:48, Chao Li <li.evan.chao@gmail.com> wrote:

I reviewed the patch and traced some basic workflows. But I haven’t done a load test to compare performance differences with and without this patch, I will do that if I get some bandwidth later. Here comes some review comments:
1 - tupmacs.h
```
+       /* Create a mask with all bits beyond natts's bit set to off */
+       mask = 0xFF & ((((uint8) 1) << (natts & 7)) - 1);
+       byte = (~bits[lastByte]) & mask;
```
When I read the code, I got an impression bits[lastByte] might overflow when natts % 8 == 0, so I traced the code, then I realized that, this function is only called when a row has null values, so that, when reaching here, natts % 8 != 0, otherwise it should return earlier within the for loop.

It certainly is possible to get to that part of the code when natts is
a multiple of 8 and the tuple contains NULLs after that (we may not be
deforming the entire tuple). The code you quoted that's setting "mask"
in that case will produce a zero mask, resulting in not finding any
NULLs. I don't quite see any risk of overflowing any of the types
here. If natts is 16 then effectively the code does 0xFF & ((1 << 0)
- 1); so no overflow. Just left shift by 0 bits and bitwise AND with
zero, resulting in the mask becoming zero.

How about if I write the comment as follows?

/*
* Create a mask with all bits beyond natts's bit set to off. The code
* below will generate a zero mask when natts & 7 == 0. When that happens
* all bytes that need to be checked were done so in the loop above. The
* code below will create an empty mask and end up returning natts. This
* has been done to avoid having to write a special case to check if we've
* covered all bytes already.
*/

In TupleDescFinalize(), given firstNonCachedOffAttr = i + 1, firstNonCachedOffAttr will never be 0.
But in nocachegetattr(), it checks firstNonCachedOffAttr >= 0:
This is kinda inconsistent, and may potentially lead to some confusion to code readers.

I've changed things around so that firstNonCachedOffAttr == 0 means
there are no attributes with cached offsets. -1 becomes uninitialised.
I've added Asserts to check for firstNonCachedOffAttr >= 0 with a
comment directing anyone who's facing debugging one of those failing
to add the missing call to TupleDescFinalize().

Thanks for reviewing.

I've attached the v4 patch, which also fixes the LLVM compiler warning
that I introduced.

David

li.evan.chao@gmail.com

about 2 months ago

In reply to: David Rowley (#5)

Re: More speedups for tuple deformation

On Jan 20, 2026, at 08:11, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 19 Jan 2026 at 18:48, Chao Li <li.evan.chao@gmail.com> wrote:
I reviewed the patch and traced some basic workflows. But I haven’t done a load test to compare performance differences with and without this patch, I will do that if I get some bandwidth later. Here comes some review comments:
1 - tupmacs.h
```
+       /* Create a mask with all bits beyond natts's bit set to off */
+       mask = 0xFF & ((((uint8) 1) << (natts & 7)) - 1);
+       byte = (~bits[lastByte]) & mask;
```
When I read the code, I got an impression bits[lastByte] might overflow when natts % 8 == 0, so I traced the code, then I realized that, this function is only called when a row has null values, so that, when reaching here, natts % 8 != 0, otherwise it should return earlier within the for loop.
It certainly is possible to get to that part of the code when natts is
a multiple of 8 and the tuple contains NULLs after that (we may not be
deforming the entire tuple). The code you quoted that's setting "mask"
in that case will produce a zero mask, resulting in not finding any
NULLs. I don't quite see any risk of overflowing any of the types
here. If natts is 16 then effectively the code does 0xFF & ((1 << 0)
- 1); so no overflow. Just left shift by 0 bits and bitwise AND with
zero, resulting in the mask becoming zero.

How about if I write the comment as follows?

/*
* Create a mask with all bits beyond natts's bit set to off. The code
* below will generate a zero mask when natts & 7 == 0. When that happens
* all bytes that need to be checked were done so in the loop above. The
* code below will create an empty mask and end up returning natts. This
* has been done to avoid having to write a special case to check if we've
* covered all bytes already.
*/

I’m sorry I didn’t express myself clearly, maybe I should have used “OOB” rather than “overflow". My real concern is about out-of-boundary read of bits[lastByte] when natts&7==0.

Say, natts is 16, then bits is 2 bytes long; lastByte = 16>>3 = 2, so bits[2] is a OOB read.

If first_null_attr() is only called when hasnulls==true, then it will never hit the OOB point, because it will return early from the “for” loop. In the current patch, which is true, so the OOB should never happen.

However, I don’t see any comment mentions something like “first_null_attr() should only be called when hasnulls is true. If in future one calls first_null_attr() in a situation where hasnulls == false, then the OOB will be triggered.

The comment you added explains that even if OOB happens, no matter what value is hold by bits[lastByte], because mask is 0, the final result is still correct, which is true, but OOB is still a concern. If the bits array happens to end exactly at the edge of a memory page, the OOB read bits[lastByte] may trigger a segment fault; and valgrind may detect the OOB and complain about it.

So, my original comment was that, we should at least add something to the header comment to mention “first_null_attr() should only be called when hasnulls is true. If we can add an Assert to ensure hasnulls is true, that would be even better.

But if we want first_null_attr() to be safe no matter hasnulls is true or false, I think we should avoid the OOB.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

li.evan.chao@gmail.com

about 2 months ago

In reply to: Chao Li (#6)

Re: More speedups for tuple deformation

On Jan 20, 2026, at 12:32, Chao Li <li.evan.chao@gmail.com> wrote:
On Jan 20, 2026, at 08:11, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 19 Jan 2026 at 18:48, Chao Li <li.evan.chao@gmail.com> wrote:
I reviewed the patch and traced some basic workflows. But I haven’t done a load test to compare performance differences with and without this patch, I will do that if I get some bandwidth later. Here comes some review comments:
1 - tupmacs.h
```
+       /* Create a mask with all bits beyond natts's bit set to off */
+       mask = 0xFF & ((((uint8) 1) << (natts & 7)) - 1);
+       byte = (~bits[lastByte]) & mask;
```
When I read the code, I got an impression bits[lastByte] might overflow when natts % 8 == 0, so I traced the code, then I realized that, this function is only called when a row has null values, so that, when reaching here, natts % 8 != 0, otherwise it should return earlier within the for loop.
It certainly is possible to get to that part of the code when natts is
a multiple of 8 and the tuple contains NULLs after that (we may not be
deforming the entire tuple). The code you quoted that's setting "mask"
in that case will produce a zero mask, resulting in not finding any
NULLs. I don't quite see any risk of overflowing any of the types
here. If natts is 16 then effectively the code does 0xFF & ((1 << 0)
- 1); so no overflow. Just left shift by 0 bits and bitwise AND with
zero, resulting in the mask becoming zero.

How about if I write the comment as follows?

/*
* Create a mask with all bits beyond natts's bit set to off. The code
* below will generate a zero mask when natts & 7 == 0. When that happens
* all bytes that need to be checked were done so in the loop above. The
* code below will create an empty mask and end up returning natts. This
* has been done to avoid having to write a special case to check if we've
* covered all bytes already.
*/
I’m sorry I didn’t express myself clearly, maybe I should have used “OOB” rather than “overflow". My real concern is about out-of-boundary read of bits[lastByte] when natts&7==0.

Say, natts is 16, then bits is 2 bytes long; lastByte = 16>>3 = 2, so bits[2] is a OOB read.

If first_null_attr() is only called when hasnulls==true, then it will never hit the OOB point, because it will return early from the “for” loop. In the current patch, which is true, so the OOB should never happen.

However, I don’t see any comment mentions something like “first_null_attr() should only be called when hasnulls is true. If in future one calls first_null_attr() in a situation where hasnulls == false, then the OOB will be triggered.

The comment you added explains that even if OOB happens, no matter what value is hold by bits[lastByte], because mask is 0, the final result is still correct, which is true, but OOB is still a concern. If the bits array happens to end exactly at the edge of a memory page, the OOB read bits[lastByte] may trigger a segment fault; and valgrind may detect the OOB and complain about it.

So, my original comment was that, we should at least add something to the header comment to mention “first_null_attr() should only be called when hasnulls is true. If we can add an Assert to ensure hasnulls is true, that would be even better.

But if we want first_null_attr() to be safe no matter hasnulls is true or false, I think we should avoid the OOB.

I also noticed one thing that, with running an arbitrary SQL statement, first_null_attr() might be called with natts=0, so maybe it can have a fast path to return 0 directly if natts==0.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

andres@anarazel.de

about 2 months ago

In reply to: David Rowley (#5)

Re: More speedups for tuple deformation

Hi,

On 2026-01-20 13:11:55 +1300, David Rowley wrote:

I've attached the v4 patch, which also fixes the LLVM compiler warning
that I introduced.

I wonder if it's possible to split the patch - it's big enough to be
nontrivial to review... Perhaps the finalization could be introduced
separately from the patch actually making use of it?

I wonder if we should somehow change the API of tupledesc creation, to make
old code that doesn't have TupleDescFinalize() fail to compile, instead of
just warn...

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index dcba3fb5473..2fdf5a341f6 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -174,6 +174,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
INT4OID, -1, 0);

+ TupleDescFinalize(tupledesc);
fctx->tupdesc = BlessTupleDesc(tupledesc);

Think it'd be worth adding an assertion to BlessTupleDesc that
TupleDescFinalize has been called, I think that'll lead to easier to
understand backtraces in a lot of cases. Particularly if you consider cases
where BlessTupleDesc() will create a tupdesc in shared memory, that could then
trigger an assertion failure in a parallel worker or such.

/*
* slot_deform_heap_tuple
*		Given a TupleTableSlot, extract data from the slot's physical tuple
@@ -1122,78 +1010,167 @@ static pg_attribute_always_inline void
slot_deform_heap_tuple(TupleTableSlot *slot, HeapTuple tuple, uint32 *offp,
int natts)
{
+	CompactAttribute *cattr;
+	TupleDesc	tupleDesc = slot->tts_tupleDescriptor;
bool		hasnulls = HeapTupleHasNulls(tuple);
+	HeapTupleHeader tup = tuple->t_data;
+	bits8	   *bp;				/* ptr to null bitmap in tuple */
int			attnum;
+	int			firstNonCacheOffsetAttr;
+
+#ifdef OPTIMIZE_BYVAL
+	int			firstByRefAttr;
+#endif
+	int			firstNullAttr;
+	Datum	   *values;
+	bool	   *isnull;
+	char	   *tp;				/* ptr to tuple data */
uint32		off;			/* offset in tuple data */
-	bool		slow;			/* can we use/set attcacheoff? */
+
+	/* Did someone forget to call TupleDescFinalize()? */
+	Assert(tupleDesc->firstNonCachedOffAttr >= 0);

/* We can only fetch as many attributes as the tuple has. */
-	natts = Min(HeapTupleHeaderGetNatts(tuple->t_data), natts);
+	natts = Min(HeapTupleHeaderGetNatts(tup), natts);
+	attnum = slot->tts_nvalid;
+	firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);
+
+	if (hasnulls)
+	{
+		bp = tup->t_bits;
+		firstNullAttr = first_null_attr(bp, natts);
+		firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
+	}
+	else
+	{
+		bp = NULL;
+		firstNullAttr = natts;
+	}
+
+#ifdef OPTIMIZE_BYVAL
+	firstByRefAttr = Min(firstNonCacheOffsetAttr, tupleDesc->firstByRefAttr);
+#endif
+	values = slot->tts_values;
+	isnull = slot->tts_isnull;
+	tp = (char *) tup + tup->t_hoff;
+
+#ifdef OPTIMIZE_BYVAL

/*
-	 * Check whether the first call for this tuple, and initialize or restore
-	 * loop state.
+	 * Many tuples have leading byval attributes, try and process as many of
+	 * those as possible with a special loop that can't handle byref types.
*/
-	attnum = slot->tts_nvalid;
-	if (attnum == 0)
+	if (attnum < firstByRefAttr)
+	{
+		/* Use do/while as we already know we need to loop at least once. */
+		do
+		{
+			cattr = TupleDescCompactAttr(tupleDesc, attnum);
+
+			Assert(cattr->attcacheoff >= 0);
+
+			/*
+			 * Hard code byval == true to allow the compiler to remove the
+			 * byval check when inlining fetch_att().
+			 */

Maybe add an assert for cattr->attbyval? Just to avoid a bad debugging
experience if somebody tries to extend this logic to
e.g. non-null-fixed-width-byref columns?

I also wonder if we could have assert-only crosschecking of the "real" offsets
against the cached ones?

Greetings,

Andres Freund

dgrowleyml@gmail.com

about 2 months ago

In reply to: Andres Freund (#8)

Re: More speedups for tuple deformation

On Wed, 21 Jan 2026 at 07:38, Andres Freund <andres@anarazel.de> wrote:

I wonder if it's possible to split the patch - it's big enough to be
nontrivial to review... Perhaps the finalization could be introduced
separately from the patch actually making use of it?

Seems reasonable. I've done that in the attached 0001, which contains
a dummy macro for TupleDescFinalize() and all the required calls to
it.

I wonder if we should somehow change the API of tupledesc creation, to make
old code that doesn't have TupleDescFinalize() fail to compile, instead of
just warn...

I don't have any ideas on how to do that. I could maybe imagine some
preprocessor magic if we always expected a CreateTupleDesc() and
TupleDescFinalize() in the same function, but the TupleDescFinalize()
may be required after any modification to the TupleDesc that could
invalidate the processing that's done within that function.

Think it'd be worth adding an assertion to BlessTupleDesc that
TupleDescFinalize has been called, I think that'll lead to easier to
understand backtraces in a lot of cases. Particularly if you consider cases
where BlessTupleDesc() will create a tupdesc in shared memory, that could then
trigger an assertion failure in a parallel worker or such.

Modified.

Maybe add an assert for cattr->attbyval? Just to avoid a bad debugging
experience if somebody tries to extend this logic to
e.g. non-null-fixed-width-byref columns?

I ended up removing the OPTIMIZE_BYVAL code in the attached. Over all
the machines I tested on, with the benchmark results I previously
shared, it seemed to cause a slowdown rather than a speedup. Perhaps
it can be refined and tried again later, but I've removed it for now
to reduce complexity.

I also wonder if we could have assert-only crosschecking of the "real" offsets
against the cached ones?

I've modified the code to do that. v5 patches attached.

Thanks for reviewing.

David

#10

andres@anarazel.de

about 2 months ago

In reply to: David Rowley (#9)

Re: More speedups for tuple deformation

Hi,

I haven't yet looked at the new version of the patch, but I ran your benchmark
from upthread (fwiw, I removed the sleep 10 to reduce runtimes, the results
seem stable enough anyway) on two intel machines, as you mentioned that you
saw a lot variation in Azure.

For both I disabled turbo boost, cpu idling and pinned the backend to a single
CPU core.

There's a bit of noise on "awork3" (basically an editor and an idle browser
window), but everything is pinned to the other socket. "awork4" is entirely
idle.

Looks like overall the results are quite impressive! Some of the extra_cols=0
runs saphire rapids are a bit slower, but the losses are much smaller than the
gains in other cases.

I think it'd be good to add a few test cases of "incremental deforming" to the
benchmark. E.g. a qual that accesses column 10, but projection then deforms up
to 20. I'm a bit worried that e.g. the repeated first_null_attr()
computations could cause regressions.

Greetings,

Andres Freund

#11

li.evan.chao@gmail.com

about 2 months ago

In reply to: Andres Freund (#10)

Re: More speedups for tuple deformation

On Jan 23, 2026, at 09:18, Andres Freund <andres@anarazel.de> wrote:

Hi,

I haven't yet looked at the new version of the patch, but I ran your benchmark
from upthread (fwiw, I removed the sleep 10 to reduce runtimes, the results
seem stable enough anyway) on two intel machines, as you mentioned that you
saw a lot variation in Azure.

For both I disabled turbo boost, cpu idling and pinned the backend to a single
CPU core.

There's a bit of noise on "awork3" (basically an editor and an idle browser
window), but everything is pinned to the other socket. "awork4" is entirely
idle.

Looks like overall the results are quite impressive! Some of the extra_cols=0
runs saphire rapids are a bit slower, but the losses are much smaller than the
gains in other cases.

I think it'd be good to add a few test cases of "incremental deforming" to the
benchmark. E.g. a qual that accesses column 10, but projection then deforms up
to 20. I'm a bit worried that e.g. the repeated first_null_attr()
computations could cause regressions.

Greetings,

Andres Freund
<deform_bench.csv>

Today I ran the benchmark on my MacBook M4 against 3 versions (all without assert and with -O2):

1) Master (f9a468c664a)
2) Master + v4
3) Master + v4 + My tweak (first_null_attr immediately returns 0 when natts == 0)

Overall, v4 shows significant improvements across most configuration combinations. In the best case, v4 is about 43% faster than master.

The tweak version is only slightly faster than v4. In the best case, the tweak achieves an additional ~3.5% improvement over v4.

Note that the MacBook is my working laptop. I didn’t actively work on it while the tests were running, but it was still not fully idle, as some other applications (Email, VScode, etc.) were running in the background. That said, I suppose this is still fair for the three rounds of test runs.

See the attached Excel sheet for details.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#12

andres@anarazel.de

about 2 months ago

In reply to: Andres Freund (#10)

Re: More speedups for tuple deformation

Hi,

On 2026-01-22 20:18:21 -0500, Andres Freund wrote:

I haven't yet looked at the new version of the patch, but I ran your benchmark
from upthread (fwiw, I removed the sleep 10 to reduce runtimes, the results
seem stable enough anyway) on two intel machines, as you mentioned that you
saw a lot variation in Azure.

For both I disabled turbo boost, cpu idling and pinned the backend to a single
CPU core.

There's a bit of noise on "awork3" (basically an editor and an idle browser
window), but everything is pinned to the other socket. "awork4" is entirely
idle.

Looks like overall the results are quite impressive! Some of the extra_cols=0
runs saphire rapids are a bit slower, but the losses are much smaller than the
gains in other cases.

I think it'd be good to add a few test cases of "incremental deforming" to the
benchmark. E.g. a qual that accesses column 10, but projection then deforms up
to 20. I'm a bit worried that e.g. the repeated first_null_attr()
computations could cause regressions.

The overhead of the aggregation etc makes it harder to see efficiency changes
in deformation speed:

I think it'd be worth replacing the SUM(a) with WHERE a < 0 (filtering all
rows), to reduce the cost of the executor dispatch.

Here's a profile of the SUM(a):

-   99.90%     0.00%  postgres         postgres           [.] standard_ExecutorRun
   - standard_ExecutorRun
      - 96.83% ExecAgg
         - 49.86% ExecInterpExpr
            - 28.30% slot_getsomeattrs_int
                 tts_buffer_heap_getsomeattrs
              0.67% tts_buffer_heap_getsomeattrs
            + 0.02% asm_sysvec_apic_timer_interrupt
         - 37.44% fetch_input_tuple
            - 31.42% ExecSeqScan
               + 20.58% heap_getnextslot
                 3.58% MemoryContextReset
                 0.52% heapgettup_pagemode
                 0.32% ExecStoreBufferHeapTuple
              0.99% heap_getnextslot
              0.79% MemoryContextReset
           2.81% int4_sum
           1.39% MemoryContextReset

Which takes ~93ms on average for the first generated bench.sql

-   99.88%     0.00%  postgres  postgres           [.] standard_ExecutorRun
   - standard_ExecutorRun
      - 95.78% ExecSeqScanWithQual
         - 57.65% ExecInterpExpr
            - 29.08% slot_getsomeattrs_int
                 tts_buffer_heap_getsomeattrs
              0.49% tts_buffer_heap_getsomeattrs
         - 25.40% heap_getnextslot
            + 15.00% heapgettup_pagemode
            + 4.71% ExecStoreBufferHeapTuple
              0.05% UnlockBuffer
           1.80% MemoryContextReset
           0.77% int4lt
           0.52% heapgettup_pagemode
           0.47% ExecStoreBufferHeapTuple
           0.37% slot_getsomeattrs_int
        2.11% heap_getnextslot
        1.49% ExecInterpExpr
        0.50% MemoryContextReset

Same data, but with a WHERE a < 0, takes on average ~74m.

I wonder if it's worth writing a C helper to test deformation in a bit more
targeted way.

Looking at the profile of ExecSeqScanWithQual() made me a bit sad, turns out
that some of the generated code isn't great :(. I'll start a separate thread
about that.

Greetings,

Andres Freund

#13

dgrowleyml@gmail.com

about 1 month ago

In reply to: Andres Freund (#12)

Re: More speedups for tuple deformation

On Sat, 24 Jan 2026 at 05:33, Andres Freund <andres@anarazel.de> wrote:

I wonder if it's worth writing a C helper to test deformation in a bit more
targeted way.

Good idea. I've written a test module called "deform_bench". You can
do: "select deform_bench('tablename'::regclass, '{10,20}');" which
will deform up to attnum=10, then in a 2nd pass deform up to
attnum=20. This is in the 0003 patch. (Requires "ninja
install-test-files"). 0003 is intended for testing, not commit.

There are also 2 scripts attached, one which sets up all the tables
for the benchmark, and one to run it. This saves creating the same
tables again when trying other branches or compilers.

I've also included a slightly revised patch. I made a small change to
the first_null_attr() to get rid of the masking of higher attnums and
also now making use of __builtin_ctz to find the first NULL attnum in
the byte. For compilers that don't support that, I've included a
pg_rightmost_*zero*_pos table. I didn't want to use the pg_bitutils
table for the rightmost *one* pos as it meant having to special-case
what happens when using index 255, as that would return 0, and I want
8. I'll make the MSVC version use _BitScanForward() in the next patch.
Using __builtin_ctz() seems to help reduce the small regression I was
seeing with the 0 extra column test. It's still there, but it is very
small. It's more pronounced because of the deform_bench module due to
the reduction of the other execution overheads.

Technically, the first_null_attr() function *could* contain slightly
fewer checks. It should be guaranteed that we'll find a byte not set
to 255, as there wouldn't be a bitmask there if there were no 0s. So
technically, the first for loop could be a while (byte[bytenum] ==
0xFF) bytenum++;. I just felt that might be too dangerous to do that
as the code would walk off the end of the bitmask if the tuple was
corrupted in the right way.

With the reduced overhead using deform_bench, the Apple M2 results are
looking quite good. Test 5 with 20 extra columns is 128% faster than
master and averages ~25% faster than master over all tests. My results
are in the attached spreadsheet.

David

#14

andres@anarazel.de

about 1 month ago

In reply to: David Rowley (#13)

Re: More speedups for tuple deformation

Hi,

On 2026-01-28 02:34:26 +1300, David Rowley wrote:

On Sat, 24 Jan 2026 at 05:33, Andres Freund <andres@anarazel.de> wrote:

I wonder if it's worth writing a C helper to test deformation in a bit more
targeted way.

Good idea. I've written a test module called "deform_bench". You can
do: "select deform_bench('tablename'::regclass, '{10,20}');" which
will deform up to attnum=10, then in a 2nd pass deform up to
attnum=20. This is in the 0003 patch. (Requires "ninja
install-test-files"). 0003 is intended for testing, not commit.

Nice! I am trying very hard to restrain myself from playing with it right
now, because I really need to get some other things done first...

/*
* slot_deform_heap_tuple
*		Given a TupleTableSlot, extract data from the slot's physical tuple
@@ -1122,78 +1010,140 @@ static pg_attribute_always_inline void
slot_deform_heap_tuple(TupleTableSlot *slot, HeapTuple tuple, uint32 *offp,
int natts)
{
+	CompactAttribute *cattr;
+	TupleDesc	tupleDesc = slot->tts_tupleDescriptor;
bool		hasnulls = HeapTupleHasNulls(tuple);
+	HeapTupleHeader tup = tuple->t_data;
+	bits8	   *bp;				/* ptr to null bitmap in tuple */
int			attnum;
+	int			firstNonCacheOffsetAttr;
+	int			firstNullAttr;
+	Datum	   *values;
+	bool	   *isnull;
+	char	   *tp;				/* ptr to tuple data */
uint32		off;			/* offset in tuple data */
-	bool		slow;			/* can we use/set attcacheoff? */
+
+	/* Did someone forget to call TupleDescFinalize()? */
+	Assert(tupleDesc->firstNonCachedOffAttr >= 0);

/* We can only fetch as many attributes as the tuple has. */
-	natts = Min(HeapTupleHeaderGetNatts(tuple->t_data), natts);
+	natts = Min(HeapTupleHeaderGetNatts(tup), natts);
+	attnum = slot->tts_nvalid;
+	firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);

FWIW, in a few experiments on my cascade lake systems, this branch (well, it
ends up as a cmov) ends up causing a surprisingly large performance
bottleneck. I don't really see a way around that, but I thought I'd mention it.

On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
AttrNumber? Not that it'll make a difference perf or space wise, just for
clarity.

Hm, I guess natts isn't an AttrNumber either. Not sure why?

+	if (hasnulls)
+	{
+		bp = tup->t_bits;
+		firstNullAttr = first_null_attr(bp, natts);
+		firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
+	}
+	else
+	{
+		bp = NULL;
+		firstNullAttr = natts;
+	}
+
+	values = slot->tts_values;
+	isnull = slot->tts_isnull;
+	tp = (char *) tup + tup->t_hoff;

Another stall I see is due to the t_hoff computation - which makes sense, it's
in the tuple header and none of the deforming can happen without knowing the
address. I think in the !hasnulls case, the only influence on it is
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
that?

Separately, sometimes - I haven't figured out when - gcc seems to think it's
smart to actually compute the `tp + cattr->attcacheoff` below using tup and
tup->t_hoff stored in registers (i.e. doing multiple adds). When the code is
generated that way, I see substantially worse performance. Have you seen
that?

+ else if (attnum == 0)
{
/* Start from the first attribute */
off = 0;
- slow = false;
}
else
{
/* Restore state from previous execution */
off = *offp;
- slow = TTS_SLOW(slot);
}

Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
attnum == 0 case?

-	if (!slow)
+	for (; attnum < firstNullAttr; attnum++)
{
[...]
+		cattr = TupleDescCompactAttr(tupleDesc, attnum);
+
+		/* align the offset for this attribute */
+		off = att_pointer_alignby(off,
+								  cattr->attalignby,
+								  cattr->attlen,
+								  tp + off);
+
+		values[attnum] = fetchatt(cattr, tp + off);
+		isnull[attnum] = false;
+
+		/* move the offset beyond this attribute */
+		off = att_addlength_pointer(off, cattr->attlen, tp + off);
}

A few thoughts / suggestions:

1) We should not update values[]/isnull[] between fetchatt() and
att_addlength_pointer(). The compiler can't figure out that no fields in
cattr or *(tp + off) are being affected by those stores.

Changing this on master improves performance quite noticeably. I see a 13%
improvement in a test with deforming 5 not-null byval columns.

2) I sometime see performance benefits due to moving the isnull[attnum] =
false; to the beginning of the loop. Which makes some sense, starting the
store earlier allows it to complete earlier, and it doesn't depend on
fetching cattr, aligning the pointer, fetching the attribute and adjusting
the offset.

3) I briefly experimented with this code, and I think we may be able to
optimize the combination of att_pointer_alignby(), fetch_att() and
att_addlength_pointer(). They all do quite related work, and for byvalue
types, we know at compile time what the alignment requirement for each of
the supported attlen is.

+	/*
+	 * Now handle any remaining tuples, this time include NULL checks as we're
+	 * now at the first NULL attribute.
+	 */
+	for (; attnum < natts; attnum++)
{
-		/* XXX is it worth adding a separate call when hasnulls is false? */
-		attnum = slot_deform_heap_tuple_internal(slot,
-												 tuple,
-												 attnum,
-												 natts,
-												 true,	/* slow */
-												 hasnulls,
-												 &off,
-												 &slow);
+		if (att_isnull(attnum, bp))
+		{
+			values[attnum] = (Datum) 0;
+			isnull[attnum] = true;
+			continue;
+		}

Have you experimented setting isnull[] in a dedicated loop if there are nulls
and then in this loop just checking isnull[attnum]? Seems like that could
perhaps be combined with the work in first_null_attr() and be more efficient
than doing an att_isnull() separately for each column.

Greetings,

Andres Freund

#15

dgrowleyml@gmail.com

about 1 month ago

In reply to: Andres Freund (#14)

Re: More speedups for tuple deformation

Thank you for looking at this again.

On Thu, 29 Jan 2026 at 05:26, Andres Freund <andres@anarazel.de> wrote:

On 2026-01-28 02:34:26 +1300, David Rowley wrote:

+ firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);

FWIW, in a few experiments on my cascade lake systems, this branch (well, it
ends up as a cmov) ends up causing a surprisingly large performance
bottleneck. I don't really see a way around that, but I thought I'd mention it.

Yeah, I believe this is the primary reason that I'm fighting the small
regression on the 0 extra column test. I thought it might be because
the mov has a dependency and wait on natts being calculated, which
needs to access fields in the tuple header. I wonder if there's some
reason the compiler to CPU can't defer calculating
firstNonCacheOffsetAttr until later. Maybe I should try moving it
later in the code to see if that helps.

On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
AttrNumber? Not that it'll make a difference perf or space wise, just for
clarity.

Hm, I guess natts isn't an AttrNumber either. Not sure why?

I noticed that too, but took the path of least resistance and made
firstNonCachedOffAttr an int too. I did wonder why natts wasn't an
AttrNumber. If they both were AttrNumbers, I wouldn't need to make the
TupleDesc struct bigger. Right now, I've enlarged it by 8 bytes by
adding firstNonCachedOffAttr.

One problem is that a bunch of functions that accept int;
CreateTemplateTupleDesc(int natts), CreateTupleDesc(int natts,
Form_pg_attribute *attrs). Then BuildDescFromLists() sets natts based
on list_length(). Maybe CreateTemplateTupleDesc() could Assert or
throw an error if natts does not fit in 16-bits.

+     if (hasnulls)
+     {
+             bp = tup->t_bits;
+             firstNullAttr = first_null_attr(bp, natts);
+             firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
+     }
+     else
+     {
+             bp = NULL;
+             firstNullAttr = natts;
+     }
+
+     values = slot->tts_values;
+     isnull = slot->tts_isnull;
+     tp = (char *) tup + tup->t_hoff;
Another stall I see is due to the t_hoff computation - which makes sense, it's
in the tuple header and none of the deforming can happen without knowing the
address. I think in the !hasnulls case, the only influence on it is
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
that?

hmm. I wonder why it even needs to exist. If the null bitmap is there,
you can calculate how many bytes from natts. I tried doing "tp = (char
*) tup + MAXALIGN(offsetof(HeapTupleHeaderData, t_bits));" for the
!hasnulls case and it's hard to tell if it helps. See the 0004 patch.
I'm somewhat hesitant to go against the grain here on how to calculate
where the tuple data starts.

Separately, sometimes - I haven't figured out when - gcc seems to think it's
smart to actually compute the `tp + cattr->attcacheoff` below using tup and
tup->t_hoff stored in registers (i.e. doing multiple adds). When the code is
generated that way, I see substantially worse performance. Have you seen
that?

I've not noticed that.

+ else if (attnum == 0)
{
/* Start from the first attribute */
off = 0;
- slow = false;
}
else
{
/* Restore state from previous execution */
off = *offp;
- slow = TTS_SLOW(slot);
}

Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
attnum == 0 case?

I see tts_heap_clear() zeros it, so I think it should be ok. Doesn't
feel quite as robust, however.

A few thoughts / suggestions:

1) We should not update values[]/isnull[] between fetchatt() and
att_addlength_pointer(). The compiler can't figure out that no fields in
cattr or *(tp + off) are being affected by those stores.

Changing this on master improves performance quite noticeably. I see a 13%
improvement in a test with deforming 5 not-null byval columns.

That's easy enough. I've moved it up in the v7 patch before the cattr
assignment.

2) I sometime see performance benefits due to moving the isnull[attnum] =
false; to the beginning of the loop. Which makes some sense, starting the
store earlier allows it to complete earlier, and it doesn't depend on
fetching cattr, aligning the pointer, fetching the attribute and adjusting
the offset.

3) I briefly experimented with this code, and I think we may be able to
optimize the combination of att_pointer_alignby(), fetch_att() and
att_addlength_pointer(). They all do quite related work, and for byvalue
types, we know at compile time what the alignment requirement for each of
the supported attlen is.

Is this true? Isn't there some nearby discussion about AIX having
4-byte double alignment?

I've taken a go at implementing a function called
align_fetch_then_add(), which rolls all the macros into one (See
0004). I just can't see any improvements with it. Maybe I've missed
something that could be more optimal. I did even ditch one of the
cases from the switch(attlen). It might be ok to do that now as we can
check for invalid attlens for byval types when we populate the
CompactAttribute.

Have you experimented setting isnull[] in a dedicated loop if there are nulls
and then in this loop just checking isnull[attnum]? Seems like that could
perhaps be combined with the work in first_null_attr() and be more efficient
than doing an att_isnull() separately for each column.

Yes. I experiment with that quite a bit. I wasn't able to make it any
faster than setting the isnull element in the same loop as the
tts_values element. What I did try was having a dedicated tight loop
like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
but the compiler would always try to optimise that into an inlined
memset which would result in poorly performing code in cases with a
small number of columns due to the size and alignment prechecks. I had
given up on it as I was already fighting some performance regressions
for the 0 extra column test and this made those worse. However...

In the attached 0004 patch I've experimented with this again. This
time, I wrote a function that converts the null bitmap into the isnull
array using a lookup table. I spent a bit of time trying to figure out
a way to do this without the lookup table and only came up with a
method that requires AVX512 instructions. I coded that up, but it
requires building with -march=x86-64-v4, which will likely cause many
other reasons for the performance to vary.

The machine that likes 0004 the most (using the lookup table method of
setting the isnull array) is the Apple M2. All the tests apart from
the 0 extra column test became 30-90% faster. Previously the tests
that had to do att_isnull didn't improve very much. The 0 extra column
test regressed quite a bit. 50% slower on all but test 1 and 5 (the
ones without NULLs). See the attached graph. The Zen2 machine also
perhaps quite likes it, but not for the 0 extra column test. I'm
struggling to get stable performance results from that machine right
now. My Zen 4 laptop isn't a fan of it, but also not getting very
stable performance results from that either.

I'm curious to see what your Intel machines think of 0004 vs not having it.

Right now, I'm really only getting stable performance out of my Apple
M2 machine, so I'm not too sure what parts of 0004 I should include in
0002 and which ones I should throw away.

David

#16

andres@anarazel.de

about 1 month ago

In reply to: David Rowley (#15)

Re: More speedups for tuple deformation

Hi,

On 2026-01-31 00:10:42 +1300, David Rowley wrote:

Thank you for looking at this again.

On Thu, 29 Jan 2026 at 05:26, Andres Freund <andres@anarazel.de> wrote:

On 2026-01-28 02:34:26 +1300, David Rowley wrote:

+ firstNonCacheOffsetAttr = Min(tupleDesc->firstNonCachedOffAttr, natts);

FWIW, in a few experiments on my cascade lake systems, this branch (well, it
ends up as a cmov) ends up causing a surprisingly large performance
bottleneck. I don't really see a way around that, but I thought I'd mention it.

Yeah, I believe this is the primary reason that I'm fighting the small
regression on the 0 extra column test. I thought it might be because
the mov has a dependency and wait on natts being calculated, which
needs to access fields in the tuple header.

I agree that it's related to that. You have a chain of multiple computations
that lead to a higher aggregate latency.

natts depends on HeapTupleHeaderGetNatts(tup), firstNonCachedOffsetAttr
depends on both tupleDesc->firstNonCachedOffAttr and natts.

Afaict the dependencies are like the following:

0) a memory load (for tuple->t_data), likely to be cached
1) a memory load (for HeapTupleHeaderGetNatts(tup)), likely to miss memory,
depends on 0)
2) some bit-shiftery to compute natts, depends on 1)
3) a conditional move (the Min()), depends on 3)
4) a memory load (for slot->tupdesc), no dependencies, likely cached
5) a memory load (for tupdesc->firstNonCachedOfAttr), depends on 4), cached
6) a conditional move (the Min()), depends on 3) and 5)
7) a memory load (for HeapTupleHasNulls()), likely to miss memory, but can
piggy back on the cacheline load from 1), depends on 0)
8) a conditional branch (if (hasnulls), depends on 7)

And that's without even taking the hasnulls == true case into account.

Before all of those are completely executed ("retired"), none of the
speculatively executed instructions, e.g. the contents of the if (attnum <
firstNonCacheOffsetAttr) branch, can be retired.

This is why I like the idea of keeping track of whether we can rely on NOT
NULL columns to be present (I think that means we're evaluating expressions
other than constraint checks for new rows). It allows the leading NOT NULL
fixed-width columns to be decoded without having to wait for a good chunk of
the computations above. That's a performance boon even if we later have
nullable or varlength columns.

I wonder if there's some reason the compiler to CPU can't defer calculating
firstNonCacheOffsetAttr until later. Maybe I should try moving it later in
the code to see if that helps.

I think it's the opposite, you want to have it as early as possible so it has
the lowest latency impact by the time the value is required, by starting the
first elements in the dependency chain before the rest.

On the topic of tupleDesc->firstNonCachedOffAttr - shouldn't that be an
AttrNumber? Not that it'll make a difference perf or space wise, just for
clarity.

Hm, I guess natts isn't an AttrNumber either. Not sure why?

I noticed that too, but took the path of least resistance and made
firstNonCachedOffAttr an int too.

Probably reasonable.

I did wonder why natts wasn't an AttrNumber. If they both were AttrNumbers,
I wouldn't need to make the TupleDesc struct bigger. Right now, I've
enlarged it by 8 bytes by adding firstNonCachedOffAttr.

One related thing: I noticed a speedup a few days ago when making sure that
both the slot and the descriptor are aligned to a cacheline boundary,
presumably because it reduces the number of cachelines that need to be in L1
for good performance. But the results were somewhat inconsistent.

Separately I was seeing performance changes in cases where I shouldn't really
see then on the cascade lake system. After a while I noticed that the
differences I was seeing were due to differences in how effective the uop
cache is. A bunch of searching lead to information about the "jcc eratum",
which in turn can be mitigated via "-Wa,-mbranches-within-32B-boundaries". I
am not actually clear whether my CPU is directly affected by that eratum, but
nonetheless, it improved performance quite substantially.

One problem is that a bunch of functions that accept int;
CreateTemplateTupleDesc(int natts), CreateTupleDesc(int natts,
Form_pg_attribute *attrs). Then BuildDescFromLists() sets natts based on
list_length(). Maybe CreateTemplateTupleDesc() could Assert or throw an
error if natts does not fit in 16-bits.

We'd probably fail with larger values anyway, so it might be a good idea to
detect that regardless of changing the argument type.

+     if (hasnulls)
+     {
+             bp = tup->t_bits;
+             firstNullAttr = first_null_attr(bp, natts);
+             firstNonCacheOffsetAttr = Min(firstNonCacheOffsetAttr, firstNullAttr);
+     }
+     else
+     {
+             bp = NULL;
+             firstNullAttr = natts;
+     }
+
+     values = slot->tts_values;
+     isnull = slot->tts_isnull;
+     tp = (char *) tup + tup->t_hoff;
Another stall I see is due to the t_hoff computation - which makes sense, it's
in the tuple header and none of the deforming can happen without knowing the
address. I think in the !hasnulls case, the only influence on it is
MAXALIGN(offsetof(HeapTupleHeaderData, t_bits)), so we could just hardcode
that?
hmm. I wonder why it even needs to exist.

There are a lot of questions like that in our on-disk format. There's a lot of
decisions in there that make fast decoding way harder than it has to be...

If the null bitmap is there,
you can calculate how many bytes from natts. I tried doing "tp = (char
*) tup + MAXALIGN(offsetof(HeapTupleHeaderData, t_bits));" for the
!hasnulls case and it's hard to tell if it helps. See the 0004 patch.

It helps here quite measurably interestingly.

I'm somewhat hesitant to go against the grain here on how to calculate where
the tuple data starts.

Hm. I don't think it's particularly risky. We could even add an error branch
for that assumption being violated - as long as the other computations don't
depend on the result of the conditional branch, the impact on performance
should be quite minimal.

+ else if (attnum == 0)
{
/* Start from the first attribute */
off = 0;
- slow = false;
}
else
{
/* Restore state from previous execution */
off = *offp;
- slow = TTS_SLOW(slot);
}

Do we actually need both of these branches? Shouldn't *offp be set to 0 in the
attnum == 0 case?

I see tts_heap_clear() zeros it, so I think it should be ok. Doesn't
feel quite as robust, however.

Should be easy enough to assert that it's correct. I don't think it's very
likely we'd just forget resetting it. It's already catastrophic if we set it
wrongly.

3) I briefly experimented with this code, and I think we may be able to
optimize the combination of att_pointer_alignby(), fetch_att() and
att_addlength_pointer(). They all do quite related work, and for byvalue
types, we know at compile time what the alignment requirement for each of
the supported attlen is.

Is this true? Isn't there some nearby discussion about AIX having
4-byte double alignment?

The AIX stuff is just bonkers. Having different alignment based on where in an
aggregate a double is is just insane.

We could probably just accomodate that with a compile-time ifdef.

I've taken a go at implementing a function called align_fetch_then_add(),
which rolls all the macros into one (See 0004). I just can't see any
improvements with it. Maybe I've missed something that could be more
optimal. I did even ditch one of the cases from the switch(attlen). It might
be ok to do that now as we can check for invalid attlens for byval types
when we populate the CompactAttribute.

You could move the TYPEALIGN(attalignby, *off) calls into the attlen switch
and call it with a constant argument.

Have you experimented setting isnull[] in a dedicated loop if there are nulls
and then in this loop just checking isnull[attnum]? Seems like that could
perhaps be combined with the work in first_null_attr() and be more efficient
than doing an att_isnull() separately for each column.

Yes. I experiment with that quite a bit. I wasn't able to make it any
faster than setting the isnull element in the same loop as the
tts_values element. What I did try was having a dedicated tight loop
like; for (int i = attnum; i < firstNullAttr; i++) isnull[i] = false;,
but the compiler would always try to optimise that into an inlined
memset which would result in poorly performing code in cases with a
small number of columns due to the size and alignment prechecks.

Yea, that kind of transformation is pretty annoying and makes little sense
here :(.

I was thinking of actually computing the value of isnull[] based on the null
bitmap (as you also try below).

In the attached 0004 patch I've experimented with this again. This
time, I wrote a function that converts the null bitmap into the isnull
array using a lookup table.

Oh. I was just thinking of something roughly like

int i nullbyte_i = attnum >> 3;
for (int nullcol = attnum; nullcol < natts; nullcol += 8)
{
bits8 nullbyte = bp[nullbyte_i++];

for (int onebyte = 0; onebyte < 8; onebyte++)
{
if (nullcol < natts)
tts_isnull[nullcol] = nullbyte & 0x01;
nullbyte >>= 1;
nullcol++;
}
}

This isn't quite right, as we'd need to deal with starting to deform at a
attribute that's not % 8 = 0 (I'd probably just do the whole byte even if we'd
redo a few column)). And probably lots of other stuff.

With a bit of care the inner loop should be unrollable with all the moves as
conditional moves depending on nullcol < natts. Or such.

Or we could just make sure tts_isnull is always sized to be divisible by 8,
that'd presumably allow considerably better code to be generated.

I'd hope this would be more efficient than doing

static inline bool
att_isnull(int ATT, const bits8 *BITS)
{
return !(BITS[ATT >> 3] & (1 << (ATT & 0x07)));
}

for each column.

I spent a bit of time trying to figure out a way to do this without the
lookup table and only came up with a method that requires AVX512
instructions. I coded that up, but it requires building with
-march=x86-64-v4, which will likely cause many other reasons for the
performance to vary.

Yea, I doubt we want that... Too many tuples will be too short to benefit.

The machine that likes 0004 the most (using the lookup table method of
setting the isnull array) is the Apple M2. All the tests apart from
the 0 extra column test became 30-90% faster. Previously the tests
that had to do att_isnull didn't improve very much. The 0 extra column
test regressed quite a bit. 50% slower on all but test 1 and 5 (the
ones without NULLs). See the attached graph. The Zen2 machine also
perhaps quite likes it, but not for the 0 extra column test. I'm
struggling to get stable performance results from that machine right
now. My Zen 4 laptop isn't a fan of it, but also not getting very
stable performance results from that either.

Hm. A 2kB lookup table in the middle of deforming is probably not great for
L1...

I'm curious to see what your Intel machines think of 0004 vs not having it.

Scheduling an experiment with it.

Greetings,

Andres Freund

#17

andres@anarazel.de

about 1 month ago

In reply to: Andres Freund (#16)

Re: More speedups for tuple deformation

Hi,

On 2026-01-30 12:11:44 -0500, Andres Freund wrote:

In the attached 0004 patch I've experimented with this again. This
time, I wrote a function that converts the null bitmap into the isnull
array using a lookup table.

Oh. I was just thinking of something roughly like

int i nullbyte_i = attnum >> 3;
for (int nullcol = attnum; nullcol < natts; nullcol += 8)
{
bits8 nullbyte = bp[nullbyte_i++];

for (int onebyte = 0; onebyte < 8; onebyte++)
{
if (nullcol < natts)
tts_isnull[nullcol] = nullbyte & 0x01;
nullbyte >>= 1;
nullcol++;
}
}

This isn't quite right, as we'd need to deal with starting to deform at a
attribute that's not % 8 = 0 (I'd probably just do the whole byte even if we'd
redo a few column)). And probably lots of other stuff.

With a bit of care the inner loop should be unrollable with all the moves as
conditional moves depending on nullcol < natts. Or such.

Or we could just make sure tts_isnull is always sized to be divisible by 8,
that'd presumably allow considerably better code to be generated.

I'd hope this would be more efficient than doing

static inline bool
att_isnull(int ATT, const bits8 *BITS)
{
return !(BITS[ATT >> 3] & (1 << (ATT & 0x07)));
}

for each column.

I couldn't quite let go of this, thinking there had to be a non-simd way to do
this efficiently (by basically doing SWAR). And I think there is:

We can multiply one byte of the null bitmap with a carefully chosen value that
spreads each bit into the higher bytes. E.g.

0b111 * (1 << 0) = 0b 111
0b111 * (1 << 7) = 0b 11_10000000
0b111 * ((1 << 0) | (1 << 7))
= 0b 11_10000111
0b111 * (1 << 14) = 0b1_11000000_00000000
0b111 * ((1 << 0) | (1 << 7) | ( << 14))
= 0b1_11000011_10000111
...

Then, we can fairly easily mask out the unnecessary bits.

However, as maybe apparent above, this won't work for the input 0xff, as the
carry from each byte's multiplications will "overflow its byte" and flip the
next bit to 0. But that's not hard to handle:

a) A single branch would probably be fine?

b) Compute the low 4 bits and high 4 bits of the bitmap in parallel and or
the result together. By just looking at 4 null bits, no carry overflow is
possible. Due to being branchless, that's propbably better.

Something like

/*
* The bits array has 1s for values and 0s for NULLs. Bit-flip that to
* get 1s for NULLs.
*/
uint8_t nullbyte = ~bp[nullbyte_i++];
/* 8 bytes where each byte is 0 or 1 depending on whether null bitmap is set */
uint64_t isnull_8;

/*
* Multiplier ensuring that input bit 0 is reflected in output bit 0, input bit 1 at output bit 8, etc.
* Other bits also will often be set and need to be masked away.
*/
uint32_t spread_bits_32 = (1U << 0) | (1U << 7) | (1U << 14) | (1U << 21);
uint64_t mask_bits_64 = 0x0101010101010101ULL;

/* convert the lower 4 bits of null bitmap word into 32 bit int */
isnull_8 = (nullbyte & 0xf) * spread_bits_32;
/* convert the upper 4 bits of null bitmap word into 32 bit int, shift into the upper 32 bit */
isnull_8 |= ((uint64_t)((nullbyte >> 4) * spread_bits_32)) << 32;

/* mask out all the bogus bits (could also be done as a 32bit op?)*/
isnull_8 &= mask_bits_64;

memcpy(&tts_isnull[nullcol], &isnull_8, sizeof(isnull_8));

should, I think, be better than a 2kB array. Perhaps not quite as fast as the
AVX512 method, but it should be decent on most hardware...

Greetings,

Andres Freund

#18

John Naylor

john.naylor@enterprisedb.com

about 1 month ago

In reply to: David Rowley (#13)

Re: More speedups for tuple deformation

On Tue, Jan 27, 2026 at 8:34 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've also included a slightly revised patch. I made a small change to
the first_null_attr() to get rid of the masking of higher attnums and
also now making use of __builtin_ctz to find the first NULL attnum in
the byte. For compilers that don't support that, I've included a
pg_rightmost_*zero*_pos table. I didn't want to use the pg_bitutils
table for the rightmost *one* pos as it meant having to special-case
what happens when using index 255, as that would return 0, and I want
8. I'll make the MSVC version use _BitScanForward() in the next patch.

I don't get why we'd need to special-case 255 in only one place.

+ /* Process all bytes up to just before the byte for the natts index */
+ for (bytenum = 0; bytenum < lastByte; bytenum++)
+ {
+   /* break if there's any NULL attrs (a 0 bit) */
+   if (bits[bytenum] != 0xFF)
+   break;
+ }
+
+ res = bytenum << 3;
+
+#ifdef HAVE__BUILTIN_CTZ
+   res += __builtin_ctz(~bits[bytenum]);
+#else
+   res += pg_rightmost_zero_pos[bits[bytenum]];
+#endif

If bits[bytenum] is 255, then __builtin_ctz(0) is undefined. The top
of the function says

+ * We expect that 'bits' contains at least one 0 bit somewhere in the mask,
+ * not necessarily < natts.

...in which case it should be well defined everywhere. Am I missing
something? If we need to handle the 255 case, this should work:

pg_rightmost_one_pos32(~((uint32) bits[bytenum]))

--
John Naylor
Amazon Web Services

#19

dgrowleyml@gmail.com

about 1 month ago

In reply to: John Naylor (#18)

Re: More speedups for tuple deformation

On Sat, 31 Jan 2026 at 15:48, John Naylor <johncnaylorls@gmail.com> wrote:

+ res += __builtin_ctz(~bits[bytenum]);

If bits[bytenum] is 255, then __builtin_ctz(0) is undefined. The top
of the function says

Oops, I forgot to cast the byte to uint32 before the bitwise-not. I've
fixed locally. Still processing Andres' comments.

+ * We expect that 'bits' contains at least one 0 bit somewhere in the mask,
+ * not necessarily < natts.
...in which case it should be well defined everywhere. Am I missing
something? If we need to handle the 255 case, this should work:

pg_rightmost_one_pos32(~((uint32) bits[bytenum]))

I'd rather handle that in a single byte as the fallback path in that
function requires byte-at-a-time processing.

David

#20

John Naylor

john.naylor@enterprisedb.com

about 1 month ago

In reply to: David Rowley (#19)

Re: More speedups for tuple deformation

On Sat, Jan 31, 2026 at 10:45 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, 31 Jan 2026 at 15:48, John Naylor <johncnaylorls@gmail.com> wrote:

pg_rightmost_one_pos32(~((uint32) bits[bytenum]))

I'd rather handle that in a single byte as the fallback path in that
function requires byte-at-a-time processing.

I spot checked some less common animals in the buildfarm and i686,
ppc64le, riscv64, loongarch64, and s390x all have the builtin. Of
these, only riscv64 seems to lack a single instruction implementation.
It's good to keep the old fallback path around just in case, but I
doubt a new fallback would get any coverage at all.

--
John Naylor
Amazon Web Services

#21

dgrowleyml@gmail.com

about 1 month ago

In reply to: Andres Freund (#16)

#22