Move pg_attribute.attcompression to earlier in struct for reduced size?
Hi,
pg_attribute is one of the biggest table in a new cluster, and often the
biggest table in production clusters. Its size is also quite relevant in
memory, due to all the TupleDescs we allocate.
I just noticed that the new attcompression increased the size not just
by 1 byte, but by 4, due to padding. While an increase from 112 to 116
bytes isn't the end of the world, it does seem worth considering using
existing unused bytes instead?
If we moved attcompression to all the other bool/char fields, we'd avoid
that size increase, as there's an existing 2 byte hole.
Of course there's the argument that we shouldn't change the column order
for existing SELECT * queries, but the existing placement already does
(the CATALOG_VARLEN columns follow).
Greetings,
Andres Freund
On Mon, May 17, 2021 at 01:48:03PM -0700, Andres Freund wrote:
pg_attribute is one of the biggest table in a new cluster, and often the
biggest table in production clusters. Its size is also quite relevant in
memory, due to all the TupleDescs we allocate.I just noticed that the new attcompression increased the size not just
by 1 byte, but by 4, due to padding. While an increase from 112 to 116
bytes isn't the end of the world, it does seem worth considering using
existing unused bytes instead?
+1
FYI: attcompression was an OID until a few weeks before the feature was merged,
and there were several issues related to that:
aa25d1089 - fixed two issues
226e2be38
--
Justin
Andres Freund <andres@anarazel.de> writes:
If we moved attcompression to all the other bool/char fields, we'd avoid
that size increase, as there's an existing 2 byte hole.
+1. Looks to me like its existing placement was according to the good
old "add new things at the end" anti-pattern. It certainly isn't
related to the adjacent fields.
Putting it just after attalign seems like a reasonably sane choice
from the standpoint of grouping things affecting physical storage;
and as you say, that wins from the standpoint of using up alignment
padding rather than adding more.
Personally I'd think the most consistent order in that area would
be attbyval, attalign, attstorage, attcompression; but perhaps it's
too late to swap the order of attstorage and attalign.
regards, tom lane
Hi,
On 2021-05-17 17:06:32 -0400, Tom Lane wrote:
Putting it just after attalign seems like a reasonably sane choice
from the standpoint of grouping things affecting physical storage;
and as you say, that wins from the standpoint of using up alignment
padding rather than adding more.
Makes sense to me.
Personally I'd think the most consistent order in that area would
be attbyval, attalign, attstorage, attcompression; but perhaps it's
too late to swap the order of attstorage and attalign.
Given that we've put in new fields in various positions on a fairly
regular basis, I don't think swapping around attalign, attstorage would
cause a meaningful amount of additional pain. Personally I don't have a
preference for how these are ordered.
Greetings,
Andres Freund
On Mon, May 17, 2021 at 02:28:57PM -0700, Andres Freund wrote:
On 2021-05-17 17:06:32 -0400, Tom Lane wrote:
Putting it just after attalign seems like a reasonably sane choice
from the standpoint of grouping things affecting physical storage;
and as you say, that wins from the standpoint of using up alignment
padding rather than adding more.Makes sense to me.
+1.
Personally I'd think the most consistent order in that area would
be attbyval, attalign, attstorage, attcompression; but perhaps it's
too late to swap the order of attstorage and attalign.Given that we've put in new fields in various positions on a fairly
regular basis, I don't think swapping around attalign, attstorage would
cause a meaningful amount of additional pain. Personally I don't have a
preference for how these are ordered.
If you switch attcompression, I'd say to go for the others while on
it. It would not be the first time in history there is a catalog
version bump between betas.
--
Michael
On Tue, May 18, 2021 at 10:24:36AM +0900, Michael Paquier wrote:
If you switch attcompression, I'd say to go for the others while on
it. It would not be the first time in history there is a catalog
version bump between betas.
This is still an open item. FWIW, I can get behind the reordering
proposed by Tom for the consistency gained with pg_type, leading to
the attached to reduce the size of FormData_pg_attribute from 116b to
112b.
--
Michael
Attachments:
pg-att-align.patchtext/x-diff; charset=us-asciiDownload+10-10
On Fri, May 21, 2021 at 12:02 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, May 18, 2021 at 10:24:36AM +0900, Michael Paquier wrote:
If you switch attcompression, I'd say to go for the others while on
it. It would not be the first time in history there is a catalog
version bump between betas.This is still an open item. FWIW, I can get behind the reordering
proposed by Tom for the consistency gained with pg_type, leading to
the attached to reduce the size of FormData_pg_attribute from 116b to
112b.
This makes sense, thanks for working on this.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Michael Paquier <michael@paquier.xyz> writes:
This is still an open item. FWIW, I can get behind the reordering
proposed by Tom for the consistency gained with pg_type, leading to
the attached to reduce the size of FormData_pg_attribute from 116b to
112b.
I think we need to do more than that. It's certainly not okay to
leave catalogs.sgml out of sync with reality. And maybe I'm just
an overly anal-retentive sort, but I think that code that manipulates
tuples ought to match the declared field order if there's not some
specific reason to do otherwise. So that led me to the attached.
It was a good thing I went through this code, too, because I noticed
one serious bug (attcompression not checked in equalTupleDescs) and
another thing that looks like a bug: there are two places that set
up attcompression depending on
if (rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
This seems fairly nuts; in particular, why are matviews excluded?
regards, tom lane
Attachments:
reorder-pg_attribute-2.patchtext/x-diff; charset=us-ascii; name=reorder-pg_attribute-2.patchDownload+66-64
Hi,
On 2021-05-21 11:01:03 -0400, Tom Lane wrote:
It was a good thing I went through this code, too, because I noticed
one serious bug (attcompression not checked in equalTupleDescs) and
another thing that looks like a bug: there are two places that set
up attcompression depending onif (rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)This seems fairly nuts; in particular, why are matviews excluded?
Yea, that doesn't seem right. I was confused why this appears to work at
all right now. It only does because REFRESH always inserts into a
staging table first - which is created as a normal table. For
non-concurrent refresh that relation's relfilenode is swapped with the
MV's. For concurrent refresh we actually do insert into the MV - but we
never need to compress a datum at that point, because it'll already have
been compressed during the insert into the temp table.
I think there might something slightly off with concurrent refresh - the
TEMPORARY diff table that is created doesn't use the matview's
compression settings. Which means all tuples need to be recompressed
unnecessarily, if default_toast_compression differs from a column in the
materialized view.
SET default_toast_compression = 'lz4';
DROP MATERIALIZED VIEW IF EXISTS wide_mv;
CREATE MATERIALIZED VIEW wide_mv AS SELECT 1::int4 AS key, random() || string_agg(i::text, '') data FROM generate_series(1, 10000) g(i);CREATE UNIQUE INDEX ON wide_mv(key);
ALTER MATERIALIZED VIEW wide_mv ALTER COLUMN data SET COMPRESSION pglz;
REFRESH MATERIALIZED VIEW CONCURRENTLY wide_mv;
With the SET COMPRESSION pglz I see the following compression calls:
1) pglz in refresh_matview_datafill
2) lz4 during temp table CREATE TEMP TABLE AS
3) pglz during the INSERT into the matview
Without I only see 1) and 2).
Greetings,
Andres Freund
Hi,
On 2021-05-21 11:01:03 -0400, Tom Lane wrote:
It was a good thing I went through this code, too, because I noticed
one serious bug (attcompression not checked in equalTupleDescs) and
another thing that looks like a bug:
Grepping for attcompression while trying to understand the issue Tom
reported I found a substantial, but transient, memory leak:
During VACUUM FULL reform_and_rewrite_tuple() detoasts the old value if
it was compressed with a different method, while in
TopTransactionContext. There's nothing freeing that until
TopTransactionContext ends - obviously not great for a large relation
being VACUUM FULLed.
SET default_toast_compression = 'lz4';
DROP TABLE IF EXISTS wide CASCADE;
CREATE TABLE wide(data text not null);
INSERT INTO wide(data) SELECT random() || (SELECT string_agg(i::text, '') data FROM generate_series(1, 100000) g(i)) FROM generate_series(1, 1000);
\c
SET client_min_messages = 'log';
SET log_statement_stats = on;
VACUUM FULL wide;
...
DETAIL: ! system usage stats:
! 0.836638 s user, 0.375344 s system, 1.268705 s elapsed
! [2.502369 s user, 0.961681 s system total]
! 18052 kB max resident size
! 0/1789088 [0/3530048] filesystem blocks in/out
! 0/277 [0/205655] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 22/1 [55/6] voluntary/involuntary context switches
LOCATION: ShowUsage, postgres.c:4886
VACUUM
Time: 1269.029 ms (00:01.269)
\c
ALTER TABLE wide ALTER COLUMN data SET COMPRESSION pglz;
SET client_min_messages = 'log';
SET log_statement_stats = on;
VACUUM FULL wide;
...
DETAIL: ! system usage stats:
! 19.816867 s user, 0.493233 s system, 20.320711 s elapsed
! [19.835995 s user, 0.493233 s system total]
! 491588 kB max resident size
! 0/656032 [0/656048] filesystem blocks in/out
! 0/287363 [0/287953] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 1/24 [13/26] voluntary/involuntary context switches
Note the drastically different "max resident size". This is with huge
pages (removing s_b from RSS), but it's visible even without.
Random fun note:
time for VACUUM FULL wide with recompression:
pglz->lz4: 3.2s
lz4->pglz: 20.3s
Greetings,
Andres Freund
I wrote:
I think we need to do more than that. It's certainly not okay to
leave catalogs.sgml out of sync with reality. And maybe I'm just
an overly anal-retentive sort, but I think that code that manipulates
tuples ought to match the declared field order if there's not some
specific reason to do otherwise. So that led me to the attached.
Pushed that after another round of review.
... there are two places that set
up attcompression depending on
if (rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
This seems fairly nuts; in particular, why are matviews excluded?
While I've not actually tested this, it seems to me that we could
just drop these relkind tests altogether. It won't hurt anything
to set up attcompression in relation descriptors where it'll never
be consulted.
However, the more I looked at that code the less I liked it.
I think the way that compression selection is handled for indexes,
ie consult default_toast_compression on-the-fly, is *far* saner
than what is currently implemented for tables. So I think we
should redefine attcompression as "ID of a compression method
to use, or \0 to select the prevailing default. Ignored if
attstorage does not permit the use of compression". This would
result in approximately 99.44% of all columns just having zero
attcompression, greatly simplifying the tupdesc setup code, and
also making it much easier to flip an installation over to a
different preferred compression method.
I'm happy to prepare a patch if that sketch sounds sane.
(Note that the existing comment claiming that attcompression
"Must be InvalidCompressionMethod if and only if typstorage is
'plain' or 'external'" is a flat out lie in any case; *both*
directions of that claim are wrong.)
regards, tom lane
On Fri, May 21, 2021 at 02:19:29PM -0700, Andres Freund wrote:
During VACUUM FULL reform_and_rewrite_tuple() detoasts the old value if
it was compressed with a different method, while in
TopTransactionContext. There's nothing freeing that until
TopTransactionContext ends - obviously not great for a large relation
being VACUUM FULLed.
Yeah, that's not good. The confusion comes from the fact that we'd
just overwrite the values without freeing them out if recompressed, so
something like the attached would be fine?
--
Michael
Attachments:
detoast-recompress.patchtext/x-diff; charset=us-asciiDownload+19-2
On Sun, May 23, 2021 at 12:25:10PM -0400, Tom Lane wrote:
While I've not actually tested this, it seems to me that we could
just drop these relkind tests altogether. It won't hurt anything
to set up attcompression in relation descriptors where it'll never
be consulted.
Wouldn't it be confusing to set up attcompression for relkinds without
storage, like views?
However, the more I looked at that code the less I liked it.
I think the way that compression selection is handled for indexes,
ie consult default_toast_compression on-the-fly, is *far* saner
than what is currently implemented for tables. So I think we
should redefine attcompression as "ID of a compression method
to use, or \0 to select the prevailing default. Ignored if
attstorage does not permit the use of compression". This would
result in approximately 99.44% of all columns just having zero
attcompression, greatly simplifying the tupdesc setup code, and
also making it much easier to flip an installation over to a
different preferred compression method.
Would there be any impact when it comes to CTAS or matviews where the
current code assumes that the same compression method as the one from
the original value gets used, making the creation of the new relation
cheaper because there is less de-toasting and re-toasting?
--
Michael
Michael Paquier <michael@paquier.xyz> writes:
On Sun, May 23, 2021 at 12:25:10PM -0400, Tom Lane wrote:
While I've not actually tested this, it seems to me that we could
just drop these relkind tests altogether. It won't hurt anything
to set up attcompression in relation descriptors where it'll never
be consulted.
Wouldn't it be confusing to set up attcompression for relkinds without
storage, like views?
No more so than setting up attstorage, surely.
... I think we
should redefine attcompression as "ID of a compression method
to use, or \0 to select the prevailing default. Ignored if
attstorage does not permit the use of compression". This would
result in approximately 99.44% of all columns just having zero
attcompression, greatly simplifying the tupdesc setup code, and
also making it much easier to flip an installation over to a
different preferred compression method.
Would there be any impact when it comes to CTAS or matviews where the
current code assumes that the same compression method as the one from
the original value gets used, making the creation of the new relation
cheaper because there is less de-toasting and re-toasting?
I'd still envision copying the source attcompression setting in such
cases. I guess the question is (a) does that code path actually
recompress values that have the "wrong" compression, and (b) if it
does, is that wrong? If you think (a) is correct behavior, then
I don't see why refreshing after changing default_toast_compression
shouldn't cause that to happen.
regards, tom lane
On Fri, May 21, 2021 at 8:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
if (rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)This seems fairly nuts; in particular, why are matviews excluded?
The matviews are excluded only in "ATExecAddColumn()" right? But we
can not ALTER TABLE ADD COLUMN to matviews right? I agree that even
if we don't skip matview it will not create any issue as matview will
not reach here. Am I missing something?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 24, 2021 at 9:39 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, May 21, 2021 at 02:19:29PM -0700, Andres Freund wrote:
During VACUUM FULL reform_and_rewrite_tuple() detoasts the old value if
it was compressed with a different method, while in
TopTransactionContext. There's nothing freeing that until
TopTransactionContext ends - obviously not great for a large relation
being VACUUM FULLed.Yeah, that's not good. The confusion comes from the fact that we'd
just overwrite the values without freeing them out if recompressed, so
something like the attached would be fine?
/* Be sure to null out any dropped columns */
for (i = 0; i < newTupDesc->natts; i++)
{
+ tup_values[i] = values[i];
+
if (TupleDescAttr(newTupDesc, i)->attisdropped)
isnull[i] = true;
I think you don't need to initialize tup_values[i] with the
values[i];, other than that looks fine to me.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 24, 2021 at 11:32:22AM +0530, Dilip Kumar wrote:
I think you don't need to initialize tup_values[i] with the
values[i];, other than that looks fine to me.
You mean because heap_deform_tuple() does this job, right? Sure.
--
Michael
On Mon, May 24, 2021 at 2:23 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, May 24, 2021 at 11:32:22AM +0530, Dilip Kumar wrote:
I think you don't need to initialize tup_values[i] with the
values[i];, other than that looks fine to me.You mean because heap_deform_tuple() does this job, right? Sure.
Sorry, I just noticed that my statement was incomplete in last mail,
what I wanted to say is that if the attisdropped then we can avoid
"tup_values[i] = values[i]", so in short we can move "tup_values[i] =
values[i]" in the else part of " if (TupleDescAttr(newTupDesc,
i)->attisdropped)" check.
Like this.
if (TupleDescAttr(newTupDesc, i)->attisdropped)
isnull[i] = true;
else
tup_values[i] = values[i];
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 24, 2021 at 02:46:11PM +0530, Dilip Kumar wrote:
Like this.
if (TupleDescAttr(newTupDesc, i)->attisdropped)
isnull[i] = true;
else
tup_values[i] = values[i];
That would work. Your suggestion, as I understood it first, makes the
code simpler by not using tup_values at all as the set of values[] is
filled when the values and nulls are extracted. So I have gone with
this simplification, and applied the patch (moved a bit the comments
while on it).
--
Michael
On Tue, 25 May 2021 at 11:16 AM, Michael Paquier <michael@paquier.xyz>
wrote:
On Mon, May 24, 2021 at 02:46:11PM +0530, Dilip Kumar wrote:
Like this.
if (TupleDescAttr(newTupDesc, i)->attisdropped)
isnull[i] = true;
else
tup_values[i] = values[i];That would work. Your suggestion, as I understood it first, makes the
code simpler by not using tup_values at all as the set of values[] is
filled when the values and nulls are extracted. So I have gone with
this simplification, and applied the patch (moved a bit the comments
while on it).
Perfect. That looks much better.