CLUSTER can change t_len

Started by Jeff Davisabout 15 years ago7 messages
#1Jeff Davis
pgsql@j-davis.com

I don't think that this is a bug exactly, but it seems inconsistent.

See case below. After the item length gets changed, then when reading
the tuple later you get a t_len that includes padding.

We should document in a comment that t_len can mean multiple things. Or,
we should fix raw_heap_insert() to be consistent with the rest of the
code, which doesn't MAXALIGN the t_len.

Regards,
Jeff Davis

create table foo(i int4 unique);
insert into foo values(1);
insert into foo values(2);
checkpoint;
select relfilenode from pg_class where relname = 'foo';

relfilenode
-------------
16413
(1 row)

--
-- Look at on-disk item lengths, which are 28 (0x38 >> 1)
-- on my machine
--

cluster foo using foo_i_key;
select relfilenode from pg_class where relname = 'foo';

relfilenode
-------------
16418
(1 row)

checkpoint;

--
-- Now look again. They are 32 (0x40 >> 1) on my machine.
--

#2Itagaki Takahiro
itagaki.takahiro@gmail.com
In reply to: Jeff Davis (#1)
Re: CLUSTER can change t_len

On Tue, Nov 9, 2010 at 12:44 PM, Jeff Davis <pgsql@j-davis.com> wrote:

See case below. After the item length gets changed, then when reading
the tuple later you get a t_len that includes padding.

We can easily find it with pageinspect:

\i pageinspect.sql
create table foo(i int4);
insert into foo values(1);
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
lp | lp_len
----+--------
1 | 28
VACUUM FULL foo;
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
lp | lp_len
----+--------
1 | 32

We should document in a comment that t_len can mean multiple things. Or,
we should fix raw_heap_insert() to be consistent with the rest of the
code, which doesn't MAXALIGN the t_len.

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.

diff --git a/src/backend/access/heap/rewriteheap.c
b/src/backend/access/heap/rewriteheap.c
index 0bd1865..0ed94ef 100644
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
*************** raw_heap_insert(RewriteState state, Heap
*** 586,592 ****
    else
        heaptup = tup;

! len = MAXALIGN(heaptup->t_len); /* be conservative */

    /*
     * If we're gonna fail for oversize tuple, do it right away
--- 586,592 ----
    else
        heaptup = tup;

! len = heaptup->t_len;

/*
* If we're gonna fail for oversize tuple, do it right away

--
Itagaki Takahiro

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Itagaki Takahiro (#2)
Re: CLUSTER can change t_len

On 09.11.2010 11:11, Itagaki Takahiro wrote:

On Tue, Nov 9, 2010 at 12:44 PM, Jeff Davis<pgsql@j-davis.com> wrote:

See case below. After the item length gets changed, then when reading
the tuple later you get a t_len that includes padding.

We can easily find it with pageinspect:

\i pageinspect.sql
create table foo(i int4);
insert into foo values(1);
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
lp | lp_len
----+--------
1 | 28
VACUUM FULL foo;
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
lp | lp_len
----+--------
1 | 32

We should document in a comment that t_len can mean multiple things. Or,
we should fix raw_heap_insert() to be consistent with the rest of the
code, which doesn't MAXALIGN the t_len.

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.

Hmm, the conservatism at that point affects the free space calculations.
I'm not sure if it makes any difference in practice, but I'm also not
sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

This would be more in line with what the main heap_insert code does:

--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -641,7 +641,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
  	}
  	/* And now we can insert the tuple into the page */
-	newoff = PageAddItem(page, (Item) heaptup->t_data, len,
+	newoff = PageAddItem(page, (Item) heaptup->t_data, heaptup->t_len,
  						 InvalidOffsetNumber, false, true);
  	if (newoff == InvalidOffsetNumber)
  		elog(ERROR, "failed to add tuple");

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Greg Stark
gsstark@mit.edu
In reply to: Heikki Linnakangas (#3)
Re: CLUSTER can change t_len

On Tue, Nov 9, 2010 at 10:20 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN
unit.

Hmm, the conservatism at that point affects the free space calculations. I'm
not sure if it makes any difference in practice, but I'm also not sure it
doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

This would be more in line with what the main heap_insert code does:

Doesn't this cause assertion failures in heap_fill_tuple when the data
size isn't what's expected? I guess we never actually use the t_len
for later tuple reconstructions, we just recompute the needed size?
--
greg

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Greg Stark (#4)
Re: CLUSTER can change t_len

On 09.11.2010 15:57, Greg Stark wrote:

On Tue, Nov 9, 2010 at 10:20 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN
unit.

Hmm, the conservatism at that point affects the free space calculations. I'm
not sure if it makes any difference in practice, but I'm also not sure it
doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

This would be more in line with what the main heap_insert code does:

Doesn't this cause assertion failures in heap_fill_tuple when the data
size isn't what's expected? I guess we never actually use the t_len
for later tuple reconstructions, we just recompute the needed size?

Right, the length from t_len or the item pointer is never passed to
heap_fill_tuple.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: CLUSTER can change t_len

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 09.11.2010 11:11, Itagaki Takahiro wrote:

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.

Hmm, the conservatism at that point affects the free space calculations.
I'm not sure if it makes any difference in practice, but I'm also not
sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

I tend to agree with Jeff's original point that the behavior should
match regular tuple insertion exactly. This isn't about saving space,
because it won't; it's about not confusing readers by doing the same
thing in randomly different ways. I will also note that the regular
path is FAR better tested than raw_heap_insert. If there are any bugs
here, it's 1000:1 they're in raw_heap_insert not the regular path.

regards, tom lane

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#6)
Re: CLUSTER can change t_len

On 09.11.2010 17:14, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 09.11.2010 11:11, Itagaki Takahiro wrote:

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.

Hmm, the conservatism at that point affects the free space calculations.
I'm not sure if it makes any difference in practice, but I'm also not
sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

I tend to agree with Jeff's original point that the behavior should
match regular tuple insertion exactly. This isn't about saving space,
because it won't; it's about not confusing readers by doing the same
thing in randomly different ways. I will also note that the regular
path is FAR better tested than raw_heap_insert. If there are any bugs
here, it's 1000:1 they're in raw_heap_insert not the regular path.

Agreed. I've committed my patch to make it behave like heap_insert.
Thank you, Itagaki, for the easy test case using pageinspect.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com