Make tuple deformation faster

Started by David Rowleyalmost 2 years ago41 messageshackers

dgrowleyml@gmail.com

almost 2 years ago

Currently, TupleDescData contains the descriptor's attributes in a
variable length array of FormData_pg_attribute allocated within the
same allocation as the TupleDescData. According to my IDE,
sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
to attname being 64 bytes. The TupleDescData.attrs[] array could end
up quite large on tables with many columns and that could result in
poor CPU cache hit ratios when deforming tuples.

Instead, we could make TupleDescData contain an out-of-line pointer to
the array of FormData_pg_attribute and have a much more compact
inlined array of some other struct that much more densely contains the
fields required for tuple deformation. attname and many of the other
fields are not required to deform a tuple.

I've attached a patch series which does this.

0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
missed by me, apparently :-( )
0002: Adjusts the TupleDescData.attrs array to make it out of line. I
wanted to make sure nothing weird happened by doing this before doing
the bulk of the other changes to add the new struct.
0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
which can be used for tuple deformation. 8 columns fits on a 64-byte
cacheline rather than 13 cachelines.
0004: Adjusts the attalign to change it from char to uint8. See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

The state of this patch series is "proof of concept". I think the
current state should be enough to get an idea of the rough amount of
code churn this change would cause and also an idea of the expected
performance of the change. It certainly isn't in a finished state.
I've not put much effort into updating comments or looking at READMEs
to see what's now outdated.

I also went with trying to patch a bunch of additional boolean columns
from pg_attribute so they just take up 1 bit of space in the attflags
field in the new struct. I've not tested the performance of expanding
this out so these use 1 bool field each. That would make the struct
bigger than 8 bytes. Having the struct be a power-of-2 size is also
beneficial as it allows fast bit-shifting to be used to get the array
element address rather than a more complex (and slower) LEA
instruction. I could try making the struct 16 bytes and see if there
are any further wins by avoiding the bitwise AND on the
TupleDescDeformAttr.attflags field.

To test the performance of this, I tried using the attached script
which creates a table where the first column is a variable length
column and the final column is an int. The query I ran to test the
performance inserted 1 million rows into this table and performed a
sum() on the final column. The attached graph shows that the query is
30% faster than master with 15 columns between the first and last
column. For fewer columns, the speedup is less. This is quite a
deform-heavy query so it's not like it speeds up every table with that
column arrangement by 30%, but certainly, some queries could see that
much gain and even more seems possible. I didn't go to a great deal of
trouble to find the most deform-heavy workload.

I'll stick this in the July CF. It would be good to get some feedback
on the idea and feedback on whether more work on this is worthwhile.

As mentioned, the 0001 patch just fixes up the missing usages of the
TupleDescAttr() macro. I see no reason not to commit this now.

Thanks

David

Andy Fan

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: David Rowley (#1)

Re: Make tuple deformation faster

David Rowley <dgrowleyml@gmail.com> writes:

Currently, TupleDescData contains the descriptor's attributes in a
variable length array of FormData_pg_attribute allocated within the
same allocation as the TupleDescData. According to my IDE,
sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
to attname being 64 bytes. The TupleDescData.attrs[] array could end
up quite large on tables with many columns and that could result in
poor CPU cache hit ratios when deforming tuples.

...

To test the performance of this, I tried using the attached script
which creates a table where the first column is a variable length
column and the final column is an int. The query I ran to test the
performance inserted 1 million rows into this table and performed a
sum() on the final column. The attached graph shows that the query is
30% faster than master with 15 columns between the first and last
column.

Yet another a wonderful optimization! I just want to know how did you
find this optimization (CPU cache hit) case and think it worths some
time. because before we invest our time to optimize something, it is
better know that we can get some measurable improvement after our time
is spend. At for this case, 30% is really a huge number even it is a
artificial case.

Another case is Andrew introduced NullableDatum 5 years ago and said using
it in TupleTableSlot could be CPU cache friendly, I can follow that, but
how much it can improve in an ideal case, is it possible to forecast it
somehow? I ask it here because both cases are optimizing for CPU cache..

--
Best Regards
Andy Fan

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Andy Fan (#2)

Re: Make tuple deformation faster

On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1213@163.com> wrote:

Yet another a wonderful optimization! I just want to know how did you
find this optimization (CPU cache hit) case and think it worths some
time. because before we invest our time to optimize something, it is
better know that we can get some measurable improvement after our time
is spend. At for this case, 30% is really a huge number even it is a
artificial case.

Another case is Andrew introduced NullableDatum 5 years ago and said using
it in TupleTableSlot could be CPU cache friendly, I can follow that, but
how much it can improve in an ideal case, is it possible to forecast it
somehow? I ask it here because both cases are optimizing for CPU cache..

Have a look at:

perf stat --pid=<backend pid>

On my AMD Zen4 machine running the 16 extra column test from the
script in my last email, I see:

$ echo master && perf stat --pid=389510 sleep 10
master

Performance counter stats for process id '389510':

9990.65 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
0 page-faults:u # 0.000 /sec
49407204156 cycles:u # 4.945 GHz
18529494 stalled-cycles-frontend:u # 0.04% frontend
cycles idle
8505168 stalled-cycles-backend:u # 0.02% backend cycles idle
165442142326 instructions:u # 3.35 insn per cycle
# 0.00 stalled
cycles per insn
39409877343 branches:u # 3.945 G/sec
146350275 branch-misses:u # 0.37% of all branches

10.001012132 seconds time elapsed

$ echo patched && perf stat --pid=380216 sleep 10
patched

Performance counter stats for process id '380216':

9989.14 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
0 page-faults:u # 0.000 /sec
49781280456 cycles:u # 4.984 GHz
22922276 stalled-cycles-frontend:u # 0.05% frontend
cycles idle
24259785 stalled-cycles-backend:u # 0.05% backend cycles idle
213688149862 instructions:u # 4.29 insn per cycle
# 0.00 stalled
cycles per insn
44147675129 branches:u # 4.420 G/sec
14282567 branch-misses:u # 0.03% of all branches

10.005034271 seconds time elapsed

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses. This should have
helped contribute to the "insn per cycle" increase. 4.29 is quite
good for postgres. I often see that around 0.5. According to [1]https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.

It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.

David

[1]: https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: David Rowley (#1)

Re: Make tuple deformation faster

On Mon, 1 Jul 2024 at 10:56, David Rowley <dgrowleyml@gmail.com> wrote:

Currently, TupleDescData contains the descriptor's attributes in a
variable length array of FormData_pg_attribute allocated within the
same allocation as the TupleDescData. According to my IDE,
sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
to attname being 64 bytes. The TupleDescData.attrs[] array could end
up quite large on tables with many columns and that could result in
poor CPU cache hit ratios when deforming tuples.

Instead, we could make TupleDescData contain an out-of-line pointer to
the array of FormData_pg_attribute and have a much more compact
inlined array of some other struct that much more densely contains the
fields required for tuple deformation. attname and many of the other
fields are not required to deform a tuple.

I've attached a patch series which does this.

0001: Just fixes up some missing usages of TupleDescAttr(). (mostly
missed by me, apparently :-( )
0002: Adjusts the TupleDescData.attrs array to make it out of line. I
wanted to make sure nothing weird happened by doing this before doing
the bulk of the other changes to add the new struct.
0003: Adds a very compact 8-byte struct named TupleDescDeformAttr,
which can be used for tuple deformation. 8 columns fits on a 64-byte
cacheline rather than 13 cachelines.

Cool, that's similar to, but even better than, my patch from 2021 over at [0]/messages/by-id/CAEze2Wh8-metSryZX_Ubj-uv6kb+2YnzHAejmEdubjhmGusBAg@mail.gmail.com.

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

Did you test for performance gains (or losses) with an out-of-line
TupleDescDeformAttr array? One benefit from this would be that we
could reuse the deform array for suffix truncated TupleDescs, reuse of
which currently would require temporarily updating TupleDesc->natts
with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
we could reuse these arrays between TupleDescs if one is shorter than
the other, but has otherwise fully matching attributes. I know that
btree split code would benefit from this, as it wouldn't have to
construct a full new TupleDesc when it creates a suffix-truncated
tuple during page splits.

0004: Adjusts the attalign to change it from char to uint8. See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

+1, that's something I'd missed in my patches, and is probably the
largest contributor to the speedup.

I'll stick this in the July CF. It would be good to get some feedback
on the idea and feedback on whether more work on this is worthwhile.

Do you plan to remove the ->attcacheoff catalog field from the
FormData_pg_attribute, now that (with your patch) it isn't used
anymore as a placeholder field for fast (de)forming of tuples?

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/CAEze2Wh8-metSryZX_Ubj-uv6kb+2YnzHAejmEdubjhmGusBAg@mail.gmail.com

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#4)

Re: Make tuple deformation faster

On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Cool, that's similar to, but even better than, my patch from 2021 over at [0].

I'll have a read of that. Thanks for pointing it out.

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch. That's only 4 out of 8 bytes, however. I think in most cases
due to FormData_pg_attribute being so huge, the aset.c power-of-2
roundup behaviour will be unlikely to cross a power-of-2 boundary.

The following demonstrates which column counts that actually makes a
difference with:

select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
where position('1' in old_bytes::bit(32)::text) != position('1' in
new_bytes::bit(32)::text);

That returns just 46 column counts out of 1024 where we cross a power
of 2 boundaries with the patched code that we didn't cross in master.
Of course, larger pallocs will result in a malloc() directly, so
perhaps that's not a good measure. At least for smaller column counts
it should be mainly the same amount of memory used. There are only 6
rows in there for column counts below 100. I think if we were worried
about memory there are likely 100 other things we could do to reclaim
some. It would only take some shuffling of fields in RelationData. I
count 50 bytes of holes in that struct out of the 488 bytes. There are
probably a few that could be moved without upsetting the
struct-field-order-lords too much.

Did you test for performance gains (or losses) with an out-of-line
TupleDescDeformAttr array? One benefit from this would be that we
could reuse the deform array for suffix truncated TupleDescs, reuse of
which currently would require temporarily updating TupleDesc->natts
with a smaller value; but with out-of-line ->attrs and ->deform_attrs,
we could reuse these arrays between TupleDescs if one is shorter than
the other, but has otherwise fully matching attributes. I know that
btree split code would benefit from this, as it wouldn't have to
construct a full new TupleDesc when it creates a suffix-truncated
tuple during page splits.

No, but it sounds easy to test as patch 0002 moves that out of line
and does nothing else.

0004: Adjusts the attalign to change it from char to uint8. See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

+1, that's something I'd missed in my patches, and is probably the
largest contributor to the speedup.

I think so too and I did consider if we should try and do that to
pg_attribute, renaming the column to attalignby. I started but didn't
finish a patch for that.

I'll stick this in the July CF. It would be good to get some feedback
on the idea and feedback on whether more work on this is worthwhile.

Do you plan to remove the ->attcacheoff catalog field from the
FormData_pg_attribute, now that (with your patch) it isn't used
anymore as a placeholder field for fast (de)forming of tuples?

Yes, I plan to do that once I get more confidence I'm on to a winner here.

Thanks for having a look at this.

David

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: David Rowley (#5)

Re: Make tuple deformation faster

On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch. That's only 4 out of 8 bytes, however.

FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
doesn't have any fields that require 8-byte alignment? Only on 64-bit
systems we align the tuples on pages with 8-byte alignment, but
in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

I think in most cases
due to FormData_pg_attribute being so huge, the aset.c power-of-2
roundup behaviour will be unlikely to cross a power-of-2 boundary.

ASet isn't the only allocator, but default enough for this to make sense, yes.

The following demonstrates which column counts that actually makes a
difference with:

select c as n_cols,old_bytes, new_bytes from (select c,24+104*c as
old_bytes, 32+100*c+8*c as new_bytes from generate_series(1,1024) c)
where position('1' in old_bytes::bit(32)::text) != position('1' in
new_bytes::bit(32)::text);

That returns just 46 column counts out of 1024 where we cross a power
of 2 boundaries with the patched code that we didn't cross in master.
Of course, larger pallocs will result in a malloc() directly, so
perhaps that's not a good measure. At least for smaller column counts
it should be mainly the same amount of memory used. There are only 6
rows in there for column counts below 100. I think if we were worried
about memory there are likely 100 other things we could do to reclaim
some. It would only take some shuffling of fields in RelationData. I
count 50 bytes of holes in that struct out of the 488 bytes. There are
probably a few that could be moved without upsetting the
struct-field-order-lords too much.

I'd love for RelationData to be split into IndexRelation,
TableRelation, ForeignTableRelation, etc., as there's a lot of wastage
caused by exclusive fields, too.

0004: Adjusts the attalign to change it from char to uint8. See below.

The 0004 patch changes the TupleDescDeformAttr.attalign to a uint8
rather than a char containing 'c', 's', 'i' or 'd'. This allows much
more simple code in the att_align_nominal() macro. What's in master is
quite a complex expression to evaluate every time we deform a column
as it much translate: 'c' -> 1, 's' -> 2, 'i' -> 4, 'd' -> 8. If we
just store that numeric value in the struct that macro can become a
simple TYPEALIGN() so the operation becomes simple bit masking rather
than a poorly branch predictable series of compare and jump.

+1, that's something I'd missed in my patches, and is probably the
largest contributor to the speedup.

I think so too and I did consider if we should try and do that to
pg_attribute, renaming the column to attalignby. I started but didn't
finish a patch for that.

I'm not sure we have a pg_type entry that that supports numeric
attalign values without increasing the size of the field, as the one
single-byte integer-like type (char) is always used as a printable
character, and implied to always be printable through its usage in
e.g. nodeToString infrastructure.

I'd love to have a better option, but we don't seem to have one yet.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#6)

Re: Make tuple deformation faster

On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch. That's only 4 out of 8 bytes, however.

FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
doesn't have any fields that require 8-byte alignment? Only on 64-bit
systems we align the tuples on pages with 8-byte alignment, but
in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

Yeah, 4-byte alignment. "out of 8 bytes" I was talking about is
sizeof(TupleDescDeformAttr), which I believe is the same "another 8
bytes" you had mentioned. What I meant was that deleting attcacheoff
only reduces FormData_pg_attribute by 4 bytes per column and adding
TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes
per column with the patch.

I really doubt the 4 bytes extra memory is a big concern here. It
would be more concerning for patch that wanted to do something like
change NAMEDATALEN to 128, but I think the main concern with that
would be even slower tuple deforming. Additional memory would also be
concerning, but I doubt that's more important than the issue of making
all queries slower due to slower tuple deformation, which is what such
a patch would result in.

I think in most cases
due to FormData_pg_attribute being so huge, the aset.c power-of-2
roundup behaviour will be unlikely to cross a power-of-2 boundary.

ASet isn't the only allocator, but default enough for this to make sense, yes.

It's the only allocator we use for allocating TupleDescs, so other
types and their behaviour are not relevant here.

I'm not sure we have a pg_type entry that that supports numeric
attalign values without increasing the size of the field, as the one
single-byte integer-like type (char) is always used as a printable
character, and implied to always be printable through its usage in
e.g. nodeToString infrastructure.

I'd love to have a better option, but we don't seem to have one yet.

yeah, select typname from pg_Type where typalign = 'c' and typlen=1;
has just bool and char.

I'm happy to keep going with this version of the patch unless someone
points out some good reason that we should go with the alternative
instead.

David

Andy Fan

zhihui.fan1213@gmail.com

almost 2 years ago

In reply to: David Rowley (#3)

Re: Make tuple deformation faster

David Rowley <dgrowleyml@gmail.com> writes:

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses. This should have
helped contribute to the "insn per cycle" increase. 4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.

Thanks for the answer, just another area desvers to exploring.

It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.

I have two kind of CPUs.

a). Intel Xeon Processor (Icelake) for my ECS
b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac.

My ECS reports "<not supported> branch-misses", probabaly because it
runs in virtualization software , and Mac doesn't support perf yet :(

--
Best Regards
Andy Fan

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 2 years ago

In reply to: David Rowley (#7)

Re: Make tuple deformation faster

On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 1 Jul 2024 at 23:42, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 1 Jul 2024 at 12:49, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 1 Jul 2024 at 22:07, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

One thing I'm slightly concerned about is that this allocates another
8 bytes for each attribute in the tuple descriptor. While that's not a
lot when compared with the ->attrs array, it's still quite a lot when
we might not care at all about this data; e.g. in temporary tuple
descriptors during execution, in intermediate planner nodes.

I've not done it in the patch, but one way to get some of that back is
to ditch pg_attribute.attcacheoff. There's no need for it after this
patch. That's only 4 out of 8 bytes, however.

FormData_pg_attribute as a C struct has 4-byte alignment; AFAIK it
doesn't have any fields that require 8-byte alignment? Only on 64-bit
systems we align the tuples on pages with 8-byte alignment, but
in-memory arrays of the struct wouldn't have to deal with that, AFAIK.

Yeah, 4-byte alignment. "out of 8 bytes" I was talking about is
sizeof(TupleDescDeformAttr), which I believe is the same "another 8
bytes" you had mentioned. What I meant was that deleting attcacheoff
only reduces FormData_pg_attribute by 4 bytes per column and adding
TupleDescDeformAttr adds 8 per column, so we still use 4 more bytes
per column with the patch.

I see I was confused, thank you for clarifying. As I said, the
concerns were only small; 4 more bytes only in memory shouldn't matter
much in the grand scheme of things.

I'm happy to keep going with this version of the patch

+1, go for it.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#10

John Naylor

john.naylor@enterprisedb.com

almost 2 years ago

In reply to: David Rowley (#3)

Re: Make tuple deformation faster

On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

cycles idle
8505168 stalled-cycles-backend:u # 0.02% backend cycles idle
165442142326 instructions:u # 3.35 insn per cycle
# 0.00 stalled
cycles per insn
39409877343 branches:u # 3.945 G/sec
146350275 branch-misses:u # 0.37% of all branches

patched

cycles idle
24259785 stalled-cycles-backend:u # 0.05% backend cycles idle
213688149862 instructions:u # 4.29 insn per cycle
# 0.00 stalled
cycles per insn
44147675129 branches:u # 4.420 G/sec
14282567 branch-misses:u # 0.03% of all branches

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses. This should have

Nice!

helped contribute to the "insn per cycle" increase. 4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."

#11

David Rowley

dgrowleyml@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#9)

Re: Make tuple deformation faster

On Tue, 16 Jul 2024 at 00:13, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 2 Jul 2024 at 02:23, David Rowley <dgrowleyml@gmail.com> wrote:

I'm happy to keep going with this version of the patch

+1, go for it.

I've attached an updated patch series which are a bit more polished
than the last set. I've attempted to document the distinction between
FormData_pg_attribute and the abbreviated struct and tried to give an
indication of which one should be used.

Apart from that, changes include:

* I pushed the v1-0001 patch, so that's removed from the patch series.
* Rename TupleDescDeformAttr struct. It's now called CompactAttribute.
* Rename TupleDescDeformAttr() macro. It's now called TupleDescCompactAttr()
* Other macro renaming. e.g. ATT_IS_PACKABLE_FAST to COMPACT_ATTR_IS_PACKABLE
* In 0003, renamed CompactAttribute.attalign to attalignby to make it
easier to understand the distinction between the align char and the
number of bytes.
* Added 0004 patch to remove pg_attribute.attcacheoff.

There are a few more things that could be done to optimise a few more
things. For example, a bunch of places still use att_align_nominal().
With a bit more work, these could use att_nominal_alignby(). I'm not
yet sure of the cleanest way to do this. Adding alignby to the
typecache might be one way, or maybe just a function that converts the
attalign to the number of bytes. This would be useful in all places
where att_align_nominal() is used in loops, as converting the char to
the number of bytes would only be done once rather than once per loop.
I feel like this patch series is probably big enough for now, so I'd
like to opt for those improvements to take place as follow-on work.

I'll put this up for the CF bot to run with for a bit as the patch has
needed a rebase since I pushed the v1-0001 patch.

David

#12

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: David Rowley (#11)

Re: Make tuple deformation faster

On Tue, 6 Aug 2024 at 13:11, David Rowley <dgrowleyml@gmail.com> wrote:

I'll put this up for the CF bot to run with for a bit as the patch has
needed a rebase since I pushed the v1-0001 patch.

I've been doing more work on this patch set as I'd been concerned
there wasn't any validation to ensure the TupleDesc's
FormData_pg_attribute and the CompactAttribute are kept in sync when
the TupleDesc is altered in various places around the codebase. To
make this more robust, in USE_ASSERT_CHECKING builds, I made it so the
TupleDescCompactAttr() macro is turned into an inline function with an
Assert to validate the stored CompactAttribute vs one freshly
populated from the FormData_pg_attribute. Doing this caused me to find
a missed call to populate_compact_attribute(), so was worth the
effort. There's no apparent performance difference when running all
the tests with and without this extra checking.

I also spent time doing performance tests using 3 different machines.
I didn't document the previous performance tests, but I expect I ran
them on my AMD 7945hx laptop. On testing again today, I used that
Zen4 laptop plus an AMD 3990x (Zen2) and a 10-core Apple M2. I found
that it was only the 7945hx laptop that was showing any decent gains
from this patch :(. After thinking for a bit, I decided to expand the
CompactAttribute.attflags where I'd been bit packing in 5 boolean
fields from pg_attribute and expand those into bool fields. This made
the performance much better. The 0005 contains this change
independently.

Please see the attached "patches-0001-0005_results.png". This shows
the test query running 25% faster on the 7945hx laptop with gcc. The
least gains were from the Apple M2 at about a 9-10% increase.

The "patches-0001-0004_results.png" shows the results with the smaller
bit-packed CompactAttriubute struct. You can see that without the 0005
patch, there are some performance regressions, so I propose including
0005 which widens CompactAttribute from 8 bytes to 16 bytes.

David

#13

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: David Rowley (#12)

Re: Make tuple deformation faster

Hi,

A few weeks ago David and I discussed this patch. We were curious *why* the
flags approach was slower. It turns out that, at least on my machine, this is
just a compiler optimization issue. Putting a pg_compiler_barrier() just
after:

for (; attnum < natts; attnum++)
{
-		Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);
+		CompactAttribute *thisatt = TupleDescCompactAttr(tupleDesc, attnum);

addressed the issue. The problem basically is that instead of computing the
address of thisatt once, gcc for some reason instead uses complex addressing
lea instructions, which are slower on some machines.

Not sure what a good way to deal with that is. I haven't tried, but it could
be that just advancing thisatt using++ would do the trick?

I think this generally looks quite good and the performance wins are quite
nice! I'm a bit concerned about the impact on various extensions - I haven't
looked, but I wouldn't be surprised if this requires a bunch of of changes.
Perhaps we could reduce that a bit?

Could it make sense to use bitfields instead of flag values, to reduce the
impact?

From 35a6cdc2056decc31d67ad552826b573d2b66073 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 28 May 2024 20:10:50 +1200
Subject: [PATCH v3 2/5] Introduce CompactAttribute array in TupleDesc

This array stores a subset of the fields of FormData_pg_attribute,
primarily the ones for deforming tuples, but since we have additional
space, pack a few additional boolean columns in the attflags field.

Many areas of the code can get away with only accessing the
CompactAttribute array, which because the details of each attribute is
stored much more densely than FormData_pg_attribute, many operations can
be performed accessing fewer cachelines which can improve performance.

This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it.

Yay.

diff --git a/src/include/access/tupdesc.h b/src/include/access/tupdesc.h
index 2c435cdcb2..478ebbe1f4 100644
--- a/src/include/access/tupdesc.h
+++ b/src/include/access/tupdesc.h
@@ -45,6 +45,46 @@ typedef struct TupleConstr
bool		has_generated_stored;
} TupleConstr;

+/*
+ * CompactAttribute
+ *		Cut-down version of FormData_pg_attribute for faster access for tasks
+ *		such as tuple deformation.
+ */
+typedef struct CompactAttribute
+{
+	int32		attcacheoff;	/* fixed offset into tuple, if known, or -1 */

For a short while I thought we could never have a large offset here, due to
toast, but that's only when tuples come from a table...

+	int16		attlen;			/* attr len in bytes or -1 = varlen, -2 =
+								 * cstring */

It's hard to imagine fixed-dith types longer than 256 bits long making sense,
but even if we wanted to change that, it'd not be sensible to change that in
this patch...

+#ifdef USE_ASSERT_CHECKING
+
+/*
+ * Accessor for the i'th CompactAttribute of tupdesc.  In Assert enabled
+ * builds we verify that the CompactAttribute is populated correctly.
+ * This helps find bugs in places such as ALTER TABLE where code makes changes
+ * to the FormData_pg_attribute but forgets to call populate_compact_attribute
+ */
+static inline CompactAttribute *
+TupleDescCompactAttr(TupleDesc tupdesc, int i)
+{
+	CompactAttribute snapshot;
+	CompactAttribute *cattr = &tupdesc->compact_attrs[i];
+
+	/*
+	 * Take a snapshot of how the CompactAttribute is now before calling
+	 * populate_compact_attribute to make it up-to-date with the
+	 * FormData_pg_attribute.
+	 */
+	memcpy(&snapshot, cattr, sizeof(CompactAttribute));
+
+	populate_compact_attribute(tupdesc, i);
+
+	/* reset attcacheoff back to what it was */
+	cattr->attcacheoff = snapshot.attcacheoff;
+
+	/* Ensure the snapshot matches the freshly populated CompactAttribute */
+	Assert(memcmp(&snapshot, cattr, sizeof(CompactAttribute)) == 0);
+
+	return cattr;
+}
+
+#else
+/* Accessor for the i'th CompactAttribute of tupdesc */
+#define TupleDescCompactAttr(tupdesc, i) (&(tupdesc)->compact_attrs[(i)])
+#endif

Personally I'd have the ifdef inside the static inline function, rather than
changing between a static inline and a macro depending on
USE_ASSERT_CHECKING..

#ifndef FRONTEND
/*
- * Given a Form_pg_attribute and a pointer into a tuple's data area,
+ * Given a CompactAttribute pointer and a pointer into a tuple's data area,
* return the correct value or pointer.
*
* We return a Datum value in all cases.  If the attribute has "byval" false,
@@ -43,7 +43,7 @@ att_isnull(int ATT, const bits8 *BITS)
*
* Note that T must already be properly aligned for this to work correctly.
*/
-#define fetchatt(A,T) fetch_att(T, (A)->attbyval, (A)->attlen)
+#define fetchatt(A, T) fetch_att(T, CompactAttrByVal(A), (A)->attlen)

Stuff like this seems like it might catch some extensions unaware. I think it
might make sense to change macros like this to be a static inline, so that you
get proper type mismatch errors, rather than errors about invalid casts or
nonexistant fields.

From c75d072b8bc43ba4f9b7fbe2f99b65edc7421a15 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 29 May 2024 12:19:03 +1200
Subject: [PATCH v3 3/5] Optimize alignment calculations in tuple form/deform

This converts CompactAttribute.attalign from a char which is directly
derived from pg_attribute.attalign into a uint8 which specifies the
number of bytes to align the column by. Also, rename the field to
attalignby to make the distinction more clear in code.

This removes the complexity of checking each char value and transforming
that into the appropriate alignment call. This can just be a simple
TYPEALIGN passing in the number of bytes.

I like this a lot.

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 08772de39f..b66eb178b9 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1592,7 +1592,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
/* Skip non-varlena values, but update offset first */
if (thisatt->attlen != -1)
{
-		ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
+		ctx->offset = att_nominal_alignby(ctx->offset, thisatt->attalignby);
ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
tp + ctx->offset);
if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)

A bit confused about the change in naming policy here...

From d1ec19a46a480b0c75f9df468b2765ad4e51dce2 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 3 Sep 2024 14:05:30 +1200
Subject: [PATCH v3 5/5] Try a larger CompactAttribute struct without flags

Benchmarks have shown that making the CompactAttribute struct larger and
getting rid of the flags to reduce the bitwise-ANDing requirements makes
things go faster.

I think we have some way of not needing this. I don't like my compiler barrier
hack, but I'm sure we can hold the hands of the compiler sufficiently to
generate useful code.

But this made me wonder:

---
src/backend/access/common/tupdesc.c | 21 ++++++----------
src/include/access/tupdesc.h | 39 ++++++++++-------------------
2 files changed, 20 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 74f22cffb9..95c92e6585 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -67,20 +67,13 @@ populate_compact_attribute(TupleDesc tupdesc, int i)
dst->attcacheoff = -1;
dst->attlen = src->attlen;

-	dst->attflags = 0;
-
-	if (src->attbyval)
-		dst->attflags |= COMPACT_ATTR_FLAG_BYVAL;
-	if (src->attstorage != TYPSTORAGE_PLAIN)
-		dst->attflags |= COMPACT_ATTR_FLAG_IS_PACKABLE;
-	if (src->atthasmissing)
-		dst->attflags |= COMPACT_ATTR_FLAG_HAS_MISSING;
-	if (src->attisdropped)
-		dst->attflags |= COMPACT_ATTR_FLAG_IS_DROPPED;
-	if (src->attgenerated)
-		dst->attflags |= COMPACT_ATTR_FLAG_IS_GENERATED;
-	if (src->attnotnull)
-		dst->attflags |= COMPACT_ATTR_FLAG_IS_NOTNULL;
+
+	dst->attbyval = src->attbyval;
+	dst->attispackable = (src->attstorage != TYPSTORAGE_PLAIN);
+	dst->atthasmissing = src->atthasmissing;
+	dst->attisdropped = src->attisdropped;
+	dst->attgenerated = src->attgenerated;
+	dst->attnotnull = src->attnotnull;

It'd sure be nice if we could condense some of these fields in pg_attribute
too. It obviously shouldn't be a requirement for this patch, don't get me
wrong!

Production systems have often very large pg_attribute tables, which makes this
a fairly worthwhile optimization target.

I wonder if we could teach the bootstrap and deform logic about bool:1 or such
and have it generate the right code for that.

Greetings,

Andres Freund

#14

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Andres Freund (#13)

Re: Make tuple deformation faster

On Fri, 4 Oct 2024 at 10:23, Andres Freund <andres@anarazel.de> wrote:

A few weeks ago David and I discussed this patch. We were curious *why* the
flags approach was slower. It turns out that, at least on my machine, this is
just a compiler optimization issue. Putting a pg_compiler_barrier() just
after:
for (; attnum < natts; attnum++)
{
-             Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);
+             CompactAttribute *thisatt = TupleDescCompactAttr(tupleDesc, attnum);
addressed the issue. The problem basically is that instead of computing the
address of thisatt once, gcc for some reason instead uses complex addressing
lea instructions, which are slower on some machines.

Not sure what a good way to deal with that is. I haven't tried, but it could
be that just advancing thisatt using++ would do the trick?

I'm not seeing the same thing when I test with GCC. Subsequent calls
to TupleDescCompactAttr() are just adding 8 to the pointer value in
the register. Here's perf annotation output using gcc11.4.0:

patches 0001-0004 applied:

│ for (; attnum < natts; attnum++)
1.02 │ f7: add $0x1,%r12d
0.02 │ add $0x8,%r15
2.81 │ add $0x1,%r14
11.02 │ add $0x8,%rbp

and with 0005 applied:

│ for (; attnum < natts; attnum++)
4.45 │ e9: add $0x1,%r12d
0.09 │ add $0x8,%r15
0.09 │ add $0x1,%r13
9.09 │ add $0x10,%rbx

You can see the final add switches to adding 16.

With clang, it's a different story. I'm unsure what happened to the
debug symbols, but I think it's this part:

0.30 │ ↓ jne 220
1.10 │1e9: mov 0x30(%rsp),%rcx
0.62 │ mov %rax,(%rcx,%r15,8)

and with 0005:

0.70 │ movzbl %bpl,%ebp
1.34 │ cmovle %r8d,%ebp
1.95 │ d1: add $0x1,%r15
1.20 │ add $0x10,%r13

With clang, the performance does improve quite a bit after applying
0005. 152 tps with 0001-0004 and 167 tps with 0005. See attached
bench_v4.txt. I tried pg_compiler_barrer() after the
TupleDescCompactAttr() in slot_deform_heap_tuple() and it does not
switch using "add".

I think this generally looks quite good and the performance wins are quite
nice! I'm a bit concerned about the impact on various extensions - I haven't
looked, but I wouldn't be surprised if this requires a bunch of of changes.
Perhaps we could reduce that a bit?

Could it make sense to use bitfields instead of flag values, to reduce the
impact?

Yeah. That's a good idea. Using bitflags allows me to get rid of the
macros I had which were extracting the bool values from attflags.
That allows me to leave the fetchatt() macro along so it still uses
the attbyval field. With bitflags both Form_pg_attribute and
CompactAttribute have that field, so the macro works for both. I think
this is likely the easiest way to reduce the number of changes and not
break existing users of that macro.

+#else
+/* Accessor for the i'th CompactAttribute of tupdesc */
+#define TupleDescCompactAttr(tupdesc, i) (&(tupdesc)->compact_attrs[(i)])
+#endif
Personally I'd have the ifdef inside the static inline function, rather than
changing between a static inline and a macro depending on
USE_ASSERT_CHECKING..

Yeah, that's a good idea. I must have been trying to keep it a macro
the same as TupleDescAttr().

-#define fetchatt(A,T) fetch_att(T, (A)->attbyval, (A)->attlen)
+#define fetchatt(A, T) fetch_att(T, CompactAttrByVal(A), (A)->attlen)
Stuff like this seems like it might catch some extensions unaware. I think it
might make sense to change macros like this to be a static inline, so that you
get proper type mismatch errors, rather than errors about invalid casts or
nonexistant fields.

That's no longer required after switching to bitflags since the
CompactAttribute has an attbyval field again so I've left this macro
untouched in the latest patch.

From c75d072b8bc43ba4f9b7fbe2f99b65edc7421a15 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 29 May 2024 12:19:03 +1200
Subject: [PATCH v3 3/5] Optimize alignment calculations in tuple form/deform

This converts CompactAttribute.attalign from a char which is directly
derived from pg_attribute.attalign into a uint8 which specifies the
number of bytes to align the column by. Also, rename the field to
attalignby to make the distinction more clear in code.

This removes the complexity of checking each char value and transforming
that into the appropriate alignment call. This can just be a simple
TYPEALIGN passing in the number of bytes.

I like this a lot.

Great.

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 08772de39f..b66eb178b9 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1592,7 +1592,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
/* Skip non-varlena values, but update offset first */
if (thisatt->attlen != -1)
{
-             ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
+             ctx->offset = att_nominal_alignby(ctx->offset, thisatt->attalignby);
ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
tp + ctx->offset);
if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)

A bit confused about the change in naming policy here...

The idea here is that after making CompactAttribute's attalign a uint8
of bytes instead of the char attalign, I renamed it "attalignby" so
that any code using a field named attalign wouldn't compile. That
gives me a bit more confidence when changing FormData_pg_attributes
into CompactAttributes that nothing is looking at the uint8 and
treating it like the attalign char. The reason I swapped the order of
the words around is because att_alignby_nominal() didn't really sound
correct. I was wondering if I'd keep that macro at all and not just
use TYPEALIGN() directly...

From d1ec19a46a480b0c75f9df468b2765ad4e51dce2 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 3 Sep 2024 14:05:30 +1200
Subject: [PATCH v3 5/5] Try a larger CompactAttribute struct without flags

Benchmarks have shown that making the CompactAttribute struct larger and
getting rid of the flags to reduce the bitwise-ANDing requirements makes
things go faster.

I think we have some way of not needing this. I don't like my compiler barrier
hack, but I'm sure we can hold the hands of the compiler sufficiently to
generate useful code.

I'm not sure what to do about this part. I did quite a bit of staring
at the attached benchmark results. Using gcc13.2 on my Zen2 machine,
the 0005 patch is quite a bit faster (138 tps vs 127 tps) than with
just 0004 and both versions use an "add" instruction to bump to the
next CompactAttribute element. So, with that CPU, maybe it's just
faster from getting rid of the bitwise-AND code to extract the
booleans.

It'd sure be nice if we could condense some of these fields in pg_attribute
too. It obviously shouldn't be a requirement for this patch, don't get me
wrong!

Production systems have often very large pg_attribute tables, which makes this
a fairly worthwhile optimization target.

I wonder if we could teach the bootstrap and deform logic about bool:1 or such
and have it generate the right code for that.

It would be good to do something about this. I think the reduction in
the sizeof(FormData_pg_attribute) would be good for using less memory
in RelationData.rd_att.

David

#15

Victor Yegorov

vyegorov@gmail.com

over 1 year ago

In reply to: David Rowley (#14)

Re: Make tuple deformation faster

ср, 9 окт. 2024 г. в 17:00, David Rowley <dgrowleyml@gmail.com>:

I'm not sure what to do about this part. I did quite a bit of staring
at the attached benchmark results. Using gcc13.2 on my Zen2 machine,
the 0005 patch is quite a bit faster (138 tps vs 127 tps) than with
just 0004 and both versions use an "add" instruction to bump to the
next CompactAttribute element. So, with that CPU, maybe it's just
faster from getting rid of the bitwise-AND code to extract the
booleans.

Hey.
I've been testing this patch for the last week, I have M3 and i7 based MBP
around.

I've modified deform_test.sh to have an equal number of columns for all
cases, so that table size is always
the same, and also increased the extra count to 48. Attached.

i7 shows nice and stable results, there's a clear win from this patch (see
pic-1).
M3 behaves very strangely, I cannot explain the first birst and the
following drop in TPS. But it's reproducible.
Also, making tests do some IO yields better TPS up to 20 extra columns (see
pic-2). I tend to just ignore M3
results, as it's smth to do with the HW, not the patch.
Combined results for in-memory sets are on pic-3.

Construct
sizeof(FormData_pg_attribute) * (src)->natts
is used in 7 places (in various forms), I thought it might be good
to use a macro here, say TupleArraySize(natts).

In v4-0002-Introduce-CompactAttribute-array-in-TupleDesc.patch

+#define COMPACT_ATTR_IS_PACKABLE(att) \
+>  ((att)->attlen == -1 && att->attispackable)

Seems second att needs parenthesis around it.

Although I haven't seen 30% speedup, I find this change very good to have.

--
Victor Yegorov

#16

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Victor Yegorov (#15)

Re: Make tuple deformation faster

On Sat, 30 Nov 2024 at 02:54, Victor Yegorov <vyegorov@gmail.com> wrote:

I've been testing this patch for the last week, I have M3 and i7 based MBP around.

Thanks for having a look at this and running the benchmarks.

Construct
sizeof(FormData_pg_attribute) * (src)->natts
is used in 7 places (in various forms), I thought it might be good
to use a macro here, say TupleArraySize(natts).

I ended up adjusting the code here so that TupleDescSize() returns the
full size and TupleDescAttrAddress() manually calculates the offset to
start the FormData_pg_attribute array. That allows
TupleDescFullSize() to be deleted. I changed how TupleDescCopy()
works as it used to perform the memcpy in 2 parts. I've changed that
to now perform a single memcpy() and reset the ->attrs field after the
memcpy so that it correctly points to the address for its own
TupleDesc rather than the one from the source.

In v4-0002-Introduce-CompactAttribute-array-in-TupleDesc.patch
+#define COMPACT_ATTR_IS_PACKABLE(att) \
+>  ((att)->attlen == -1 && att->attispackable)
Seems second att needs parenthesis around it.

Adjusted. Thanks.

Although I haven't seen 30% speedup, I find this change very good to have.

I think there's a bit more juice to squeeze out still. I started
another thread for a much different approach to increasing the tuple
deform performance over in [1]/messages/by-id/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com. The benchmarks I showed over there
show the results with all the v4 patches on this thread plus that
patch, and also another set of results from just the v4 patches from
here. My Apple M2 very much likes the patch from the other thread. I
don't have any decent Intel hardware to test on.

I've attached a v5 set of patches, which I think addresses everything
you mentioned. I've also shuffled the patches around a little to how
I think they should be committed. Here's a summary:

v5-0001: Adds the CompactAttribute struct, includes it in TupleDesc
and adds all the code to populate it. Also includes a very small
number of users of CompactAttribute.

v5-0002: Adjusts dozens of locations to use CompactAttribute struct
instead of the Form_pg_attribute struct. Lots of churn, but not
complex changes. Separated out from v5-0001 so it's easier to see the
important changes 0001 is making.

v5-0003: Change CompactAttribute.attalign char field to attalignby to
uint8 field to optimise alignment calculations and remove branching.

v5-0004: Delete the now unused pg_attribute.attcacheoff column and
Form_pg_attribute field.

v5-0005: This is the patch from [1]/messages/by-id/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com rebased atop of this patch set.
I'll pick up the discussion on that thread, but offering a rebased
version here in case you'd like to try that one.

I spent all day today reviewing and fixing up a few missing comments
for the v5 patch series. I'm quite happy with these now. If nobody
else wants to look or test, I plan on pushing these tomorrow (Tuesday
UTC+13). If anyone wants me to delay so they can look, they better let
me know soon.

David

[1]: /messages/by-id/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com

#17

Victor Yegorov

vyegorov@gmail.com

over 1 year ago

In reply to: David Rowley (#16)

Re: Make tuple deformation faster

пн, 2 дек. 2024 г. в 13:24, David Rowley <dgrowleyml@gmail.com>:

I ended up adjusting the code here so that TupleDescSize() returns the
full size and TupleDescAttrAddress() manually calculates the offset to
start the FormData_pg_attribute array. That allows
TupleDescFullSize() to be deleted. I changed how TupleDescCopy()
works as it used to perform the memcpy in 2 parts. I've changed that
to now perform a single memcpy() and reset the ->attrs field after the
memcpy so that it correctly points to the address for its own
TupleDesc rather than the one from the source.

Nice!

I've attached a v5 set of patches, which I think addresses everything
you mentioned. I've also shuffled the patches around a little to how
I think they should be committed.

I'm glad that the patch from “More tuple deformation speedups” is moved
here, I wanted
to mention that both patches should be committed together.

All is good, and the benefits are clearly visible (same setup used).

--
Victor Yegorov

#18

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: David Rowley (#16)

Re: Make tuple deformation faster

On Mon, 2 Dec 2024 at 23:24, David Rowley <dgrowleyml@gmail.com> wrote:

I spent all day today reviewing and fixing up a few missing comments
for the v5 patch series. I'm quite happy with these now. If nobody
else wants to look or test, I plan on pushing these tomorrow (Tuesday
UTC+13). If anyone wants me to delay so they can look, they better let
me know soon.

After making some last-minute cosmetic adjustments, I've just pushed
the 0001 patch.

I planning on pushing up to 0004 over the next 8 hours, pending
feedback from the buildfarm.

David

#19

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: David Rowley (#18)

Re: Make tuple deformation faster

On Tue, 3 Dec 2024 at 16:54, David Rowley <dgrowleyml@gmail.com> wrote:

After making some last-minute cosmetic adjustments, I've just pushed
the 0001 patch.

So that commit broke all of the debug_parallel_query = regress
buildfarm animals. This was due to how I'd changed the
TupleDescData.attrs array to a pointer which I was trying to keep set
to point to the FormData_pg_attribute array which came after the
variable length CompactAttribute array in the TupleDesc. That turned
out to be hard to maintain as in typcache.c, we can get TupleDescs
from shared memory. When that happened, I was adjusting the ->attrs
pointer to the local process's address after the call to
dsa_get_address(), but I didn't consider the fact that that was
changing the shared TupleDesc :-(. Other processes (of course) didn't
like that, that's why the regression tests failed with
debug_parallel_query = regress. The bad attrs address caused issues
during lookups of the RecordCacheHash hash table.

The following was enough to break it:

set debug_parallel_query = regress;
SELECT pg_input_is_valid('34', 'int2');
SELECT pg_input_is_valid('asdf', 'int2');
SELECT pg_input_is_valid('50000', 'int2');
SELECT * FROM pg_input_error_info('50000', 'int2');

-- While we're here, check int2vector as well
SELECT pg_input_is_valid(' 1 3 5 ', 'int2vector');
SELECT * FROM pg_input_error_info('1 asdf', 'int2vector');

In the attached v7-0001 patch, I've now got rid of the
TupleDescData->attrs field. Figuring out the base address of the
FormData_pg_attribute array is now done by the TupleDescAttr() inline
function (this used to be a macro). Because that function is called
often in a tight loop, I wanted to ensure compilers wouldn't calculate
the base address of the array on each iteration of the loop. Looking
at [1]https://godbolt.org/z/dvnGfnqxz, it seems gcc and clang are smart about this and calculate the
base address before the loop and add sizeof(FormData_pg_attribute) to
the register that's being used for the element pointer.

Changing this then caused some other issues... The GIST code was doing
the following to get a TupleDesc without the INCLUDE columns of the
index:

giststate->nonLeafTupdesc = CreateTupleDescCopyConstr(index->rd_att);
giststate->nonLeafTupdesc->natts =
IndexRelationGetNumberOfKeyAttributes(index);

Since I'm calculating the base address of the FormData_pg_attribute
array in TupleDesc by looking at natts, when this code changes natts
on the fly, that means calls to TupleDescAttr end up looking in the
wrong place for the required FormData_pg_attribute element. To fix
this I invented CreateTupleDescTruncatedCopy() and used it in all the
places that were fiddling with the natts field. I also didn't see any
reason to copy the constraints for the truncated TupleDesc. Indexes
can't have constraints.

The attached v7-0001 patch is the previous patch adjusted as per
above. v7-0002 is not for commit. That's me cheekily exploiting the
CFBot's CI testing machines to run in debug_parallel_query = regress
mode.

David

[1]: https://godbolt.org/z/dvnGfnqxz

#20

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: David Rowley (#19)

Re: Make tuple deformation faster

Hi,

On 2024-12-05 01:42:36 +1300, David Rowley wrote:

On Tue, 3 Dec 2024 at 16:54, David Rowley <dgrowleyml@gmail.com> wrote:

After making some last-minute cosmetic adjustments, I've just pushed
the 0001 patch.

So that commit broke all of the debug_parallel_query = regress
buildfarm animals.

Seems we need a query in the regression tests that trigger the issue even
without debug_parallel_query = regress.

In the attached v7-0001 patch, I've now got rid of the
TupleDescData->attrs field. Figuring out the base address of the
FormData_pg_attribute array is now done by the TupleDescAttr() inline
function (this used to be a macro).

Possibly stupid idea: Could we instead store the attributes *before* the main
TupleDescData, with increasing "distance" for increased attnos? That way we
wouldn't need to calculate any variable offsets. Of course the price would be
to have some slightly more complicated invocation of pfree(), but that's
comparatively rare.

Because that function is called often in a tight loop, I wanted to ensure
compilers wouldn't calculate the base address of the array on each iteration
of the loop. Looking at [1], it seems gcc and clang are smart about this
and calculate the base address before the loop and add
sizeof(FormData_pg_attribute) to the register that's being used for the
element pointer.

Changing this then caused some other issues... The GIST code was doing
the following to get a TupleDesc without the INCLUDE columns of the
index:

giststate->nonLeafTupdesc = CreateTupleDescCopyConstr(index->rd_att);
giststate->nonLeafTupdesc->natts =
IndexRelationGetNumberOfKeyAttributes(index);

That is somewhat ugly...

Since I'm calculating the base address of the FormData_pg_attribute
array in TupleDesc by looking at natts, when this code changes natts
on the fly, that means calls to TupleDescAttr end up looking in the
wrong place for the required FormData_pg_attribute element.

It's possible out-of-core code is doing that too, could we detect this in
assert enabled builds?

To fix this I invented CreateTupleDescTruncatedCopy() and used it in all the
places that were fiddling with the natts field.

Makes sense.

v7-0002 is not for commit. That's me cheekily exploiting the CFBot's CI
testing machines to run in debug_parallel_query = regress mode.

I think it might actually make sense to enable debug_parallel_query = regress
in one of the CI tasks... Perhaps FreeBSD? That's reasonably fast right now
(and a *LOT* cheaper to run than macos).

Greetings,

Andres Freund

#21

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Andres Freund (#20)

#22

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: David Rowley (#21)