Remove xmin and cmin from frozen tuples
Hi Hackers,
I think it would be a waste to retain xmin and cmin for frozen tuples
because their values represent only 'visible for all transactions'.
Additionally, most tuples in database can be frozen potentially.
I wrote a makeshift patch to compress xmin and cmin (8bytes) to
1-bit flag, using tuple overlaping.
Is this idea worth trying?
Also, it will be useful to combine it and more aggressive freeze vacuum,
for example, a freezer integrated with bgwriter.
(The following is test of the attached patch)
* Test query
1. create table test (a int);
2. insert into test select * from generate_series(1, 100000);
3. update test set a = a where a % 100 = 0; # to force defrag
4. select * from pgstattuple('test');
5. vacuum freeze test;
6. select * from pgstattuple('test');
* Results of pgstattuple
-[ before vacuum ]-+--------
table_len | 3645440
tuple_count | 100000
tuple_len | 3200000
tuple_percent | 87.78
dead_tuple_count | 1000
dead_tuple_len | 32000
dead_tuple_percent | 0.88
free_space | 536
free_percent | 0.01
-[ 8.1beta1 orig ]-+--------
table_len | 3645440
tuple_count | 100000
tuple_len | 3200000
tuple_percent | 87.78
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 30772 <-- about 32byte * 1000 (dead tuples)
free_percent | 0.84
-[ patched ]-------+--------
table_len | 3645440
tuple_count | 100000
tuple_len | 3200000
tuple_percent | 87.78
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 823628 <-- + 8byte * 100000 (whole tuples)
free_percent | 22.59
---
ITAGAKI Takahiro
NTT Cyber Space Laboratories
Attachments:
xmincut.patchapplication/octet-stream; name=xmincut.patchDownload+46-45
On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote:
Hi,
I think it would be a waste to retain xmin and cmin for frozen tuples
because their values represent only 'visible for all transactions'.
Additionally, most tuples in database can be frozen potentially.
I think this is an interesting idea. I was thinking that when the tuple
needs to be obsoleted it would need to grow to accomodate the Xmax, but
you are not actually proposing to remove that, so it seems sensible. In
fact, it is perfectly reasonable to remove Xmin and Cmin, because after
the tuple is frozen, the Xmin never changes again.
Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space. VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless. I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the
tuple?
One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes. You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table. (Then again, you can do it today, using a
Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't
advocating a tool to do that. It is very hard to do with user-defined
types, but for BI/DW you mostly don't need those, do you?)
--
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
"Cuando no hay humildad las personas se degradan" (A. Christie)
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
I think it would be a waste to retain xmin and cmin for frozen tuples
because their values represent only 'visible for all transactions'.
True, but the hard part is getting rid of the storage for them.
I wrote a makeshift patch to compress xmin and cmin (8bytes) to
1-bit flag, using tuple overlaping.
Is this idea worth trying?
I think this is incredibly ugly :-(. It eliminates a fairly basic
assumption which is that items on a page don't overlap. The space
savings cannot be worth the loss in testability and reliability.
To take just one problem, it is no longer possible to check an item
offset for validity against pd_upper. If we're going to do this,
we need a more invasive patch that changes the structure of heaptuple
headers in a more fundamental way, and avoids breaking the page layout
representation. (Something like the way Oids are now handled might
work, although there are alignment issues to worry about, and it'd
take more work on VACUUM's part to convert a tuple to frozen state.)
I'm also less than enthused about using up our last infomask bit for
a relatively unimportant purpose. We might need that for something
bigger someday... though I can't presently guess what.
regards, tom lane
Alvaro,
One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes. You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table. (Then again, you can do it today, using a
Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't
advocating a tool to do that. It is very hard to do with user-defined
types, but for BI/DW you mostly don't need those, do you?)
Hmmm ... can you expand on this a little? We'd discussed "frozen partitions"
but hadn't thought to get around to them for a while, expecting the kind of
issues which Tom just raised.
--
Josh Berkus
Aglio Database Solutions
San Francisco
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space. VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless.
I agree. I think an good position of freezer is on bgwriter.
My idea is:
1. Just before bgwriter writes an dirty page in LRU order,
2. Freeze tuples in the page and repair fragmentation.
3. (Replace the fsm page that has least freespace.)
4. Flush the page.
I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the tuple?
Yes, defragmentation is performed after freezing, but the page has at least
one dead tuple. In current VACUUM implementation, pages that have no dead
tuples will not be defraged. So you cannot "compress" just after bulk-load.
---
ITAGAKI Takahiro
NTT Cyber Space Laboratories
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I think this is incredibly ugly :-(.
Yes, I think so, too :-( My patch is product of the thought that
I don't want to modify codes widely. So if we want to do it more cool way,
lots of changes are needed as you said.
I'm also less than enthused about using up our last infomask bit for
a relatively unimportant purpose. We might need that for something
bigger someday... though I can't presently guess what.
I think it is not a problem, because the header still has rooms for several
bits. I assume that the combination of HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID
has currently no meaning, right? If so, HEAP_FROZEN can be assigned here.
Also, t_natts is currently 16-bits, but it can be cut to 11-bits
because MaxTupleAttributeNumber is 1664 < 2^11.
---
ITAGAKI Takahiro
NTT Cyber Space Laboratories
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
I agree. I think an good position of freezer is on bgwriter.
My idea is:
1. Just before bgwriter writes an dirty page in LRU order,
2. Freeze tuples in the page and repair fragmentation.
3. (Replace the fsm page that has least freespace.)
4. Flush the page.
This is a bad idea. The bgwriter isn't the place to be doing freezing,
because there is no reasonable way for it to guarantee that all old
tuples in a table (or any larger unit) have been frozen. So you'd still
need VACUUM to ensure no wraparound. Plus, you can't do such changes
without emitting an XLOG record, which is something we don't want
happening in the bgwriter's inner loop. Even more to the point, you
can't do such changes without getting a superexclusive lock on the page
(not only locked, but no one else has it pinned), which is a real
nonstarter for the bgwriter, both for performance and possible deadlock
issues.
regards, tom lane
On Wed, Aug 31, 2005 at 09:14:42PM -0700, Josh Berkus wrote:
One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes. You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table. (Then again, you can do it today, using a
Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't
advocating a tool to do that. It is very hard to do with user-defined
types, but for BI/DW you mostly don't need those, do you?)Hmmm ... can you expand on this a little? We'd discussed "frozen partitions"
but hadn't thought to get around to them for a while, expecting the kind of
issues which Tom just raised.
What issues did he raise on this?
What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask. This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.
The "only" problem is figuring out how to lay the data in the tuples
themselves, w.r.t endianness and such. This is platform-dependent, so
you have to write code to do it correctly. In absence of user-defined
types, this should not be _too_ hard to do. Of course, such a program
would in general also be Postgres-version-dependent.
Note that this is a very different business from skipping the Xmin and
Cmin from the tuple header -- in fact, there's no relation to that at
all.
--
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
FOO MANE PADME HUM
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask. This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.
The "only" problem is figuring out how to lay the data in the tuples
themselves, w.r.t endianness and such. This is platform-dependent, so
you have to write code to do it correctly. In absence of user-defined
types, this should not be _too_ hard to do. Of course, such a program
would in general also be Postgres-version-dependent.
Of course, it's fair to ask whether such a program would be any faster
than binary-mode COPY by the time you got done ... or enough faster to
justify your effort, anyway.
THe only fundamental disadvantage that COPY labors under is having to
write WAL records. It might be interesting to do something similar to
the recent hacks for CREATE TABLE AS, so that a COPY into a table just
created in the current transaction would skip writing WAL and instead
fsync the table at the end.
regards, tom lane
On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:
On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote:
Hi,
I think it would be a waste to retain xmin and cmin for frozen tuples
because their values represent only 'visible for all transactions'.
Additionally, most tuples in database can be frozen potentially.I think this is an interesting idea.
Agreed, especially since it would avoid the need to vacuum altogether.
I was thinking that when the tuple
needs to be obsoleted it would need to grow to accomodate the Xmax, but
you are not actually proposing to remove that, so it seems sensible. In
fact, it is perfectly reasonable to remove Xmin and Cmin, because after
the tuple is frozen, the Xmin never changes again.
It's a good idea, but the Xmin is set to FrozenTransactionId, which is
how we know it is frozen, so how can we remove Xmin? The way to do this
is surely by using a row version id that is different for this format.
Getting 8 or 16 bytes per row back would be a very useful gain.
Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space. VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless. I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the
tuple?
Only if you do a FULL, which is currently incompatible with a FREEZE.
There's no point in compressing a block if you can't also redistribute
rows between blocks to fill up the spaces, so another reason why it has
to be a FULL. Unless you do this at load time, which is why I guess you
mention....
One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes. You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table. (Then again, you can do it today, using a
Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't
advocating a tool to do that. It is very hard to do with user-defined
types, but for BI/DW you mostly don't need those, do you?)
Loading a table using COPY with frozen bits set was suggested in May, so
yeh... it was suggested. At that time it was rejected, since earlier
transactions would then be able to see rows they ought not be able to
see. Thinking some more about this, this is only the inverse situation
of a TRUNCATE. With truncate we remove tuples that ought to still be
visible to pre-existing transactions. So there shouldn't really be an
issue with loading pre-frozen tuples - as long as you accept the
consequences for row visibility.
Externally writing blocks is possible, but it bypasses a lot of other
features. My current preference would be to have bulk_heap_insert()
function to add a whole page at a time rather than inserting rows one at
at a time. The main objective for a load is to make it disk bound; once
we've achieved that by some further tuning, writing an external file
would cost around the same as writing it internally from the DBMS.
Oracle (direct path loader) and Teradata (Fastload) load data in
complete blocks using a reduced code pathway, so I guess I was just
following on, but I'm genuinely open to further persuasion if there is a
better way.
Having a table marked as INSERT ONLY would allow us to save 8 bytes/row,
loading it pre-frozen (in some way) would save another 8 bytes/row and
allow us to permanently avoid VACUUMing the table. That would be even
better when we have per-table XID wrap avoidance.
Best Regards, Simon Riggs
On Thu, Sep 01, 2005 at 11:08:36AM -0400, Tom Lane wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask. This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.The "only" problem is figuring out how to lay the data in the tuples
themselves, w.r.t endianness and such. This is platform-dependent, so
you have to write code to do it correctly. In absence of user-defined
types, this should not be _too_ hard to do. Of course, such a program
would in general also be Postgres-version-dependent.Of course, it's fair to ask whether such a program would be any faster
than binary-mode COPY by the time you got done ... or enough faster to
justify your effort, anyway.
It may not be faster generating the data in the first place, but you
don't have to vacuum the table, nor you are subject to hint bits
changing, resulting in more unnecessary I/O.
This can't be avoided with COPY, because there's always the chance that
it will fail partway through, so you can't write frozen tuples. With an
external program, you can just dump the invalid line somewhere else and
continue with the rest.
--
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
"Just treat us the way you want to be treated + some extra allowance
for ignorance." (Michael Brusser)
Tom,
THe only fundamental disadvantage that COPY labors under is having to
write WAL records. It might be interesting to do something similar to
the recent hacks for CREATE TABLE AS, so that a COPY into a table just
created in the current transaction would skip writing WAL and instead
fsync the table at the end.
Yes, I thought we discussed doing this for empty tables -- it would be, per
our tests, a +10% to +30% boost to COPY.
But there was some problem the patch?
--
Josh Berkus
Aglio Database Solutions
San Francisco
Alvaro,
What issues did he raise on this?
On having no Xmin.
What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask. This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.The "only" problem is figuring out how to lay the data in the tuples
themselves, w.r.t endianness and such. This is platform-dependent, so
you have to write code to do it correctly. In absence of user-defined
types, this should not be _too_ hard to do. Of course, such a program
would in general also be Postgres-version-dependent.
So, bulk loading by file generation? So the idea is that you would generate
a properly formatted PostgreSQL table file, and then in one transaction
create the table and attach it?
Seems like this would have the additional limitation of being useful only for
loading new partitions/new tables. However, it would have some significant
advantages for bulk loading ... chiefly that the data page generation and
associated computations could be done *off* the database server. This might
help considerably in getting around the 100mb/s data computation ceiling
we're hitting ...
--
Josh Berkus
Aglio Database Solutions
San Francisco
On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote:
On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:
I was thinking that when the tuple
needs to be obsoleted it would need to grow to accomodate the Xmax, but
you are not actually proposing to remove that, so it seems sensible. In
fact, it is perfectly reasonable to remove Xmin and Cmin, because after
the tuple is frozen, the Xmin never changes again.It's a good idea, but the Xmin is set to FrozenTransactionId, which is
how we know it is frozen, so how can we remove Xmin? The way to do this
is surely by using a row version id that is different for this format.
Per Takahiro's patch, you don't need to set the Xmin to
FrozenTransactionId -- what you do instead is set a bit in the infomask.
Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space. VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless. I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the
tuple?Only if you do a FULL, which is currently incompatible with a FREEZE.
Well, if we are going to mess with what FREEZE is doing, we can as well
make it compress the page. Note that to compress the page you don't
need to touch the indexes.
I don't remember the exact reason why FULL is incompatible with FREEZE,
but AFAIR it's not fundamentally unsolvable (just very hard.)
There's no point in compressing a block if you can't also redistribute
rows between blocks to fill up the spaces, so another reason why it has
to be a FULL.
That's a good point.
One thing that comes to mind is that this makes somewhat easier to build
a tool to write pre-built tables, for bulk-loading purposes. You just
construct the binary file with the HEAP_FROZEN bit set, and then attach
the file to a dummy table.Loading a table using COPY with frozen bits set was suggested in May, so
yeh... it was suggested.
I'm not proposing to use COPY for that. It has loads of problems, which
is why the patch was rejected. Using an external program is a different
matter.
Externally writing blocks is possible, but it bypasses a lot of other
features.
Like what?
I don't really care for this feature, mind you -- I was merely
mentioning the idea as it crossed my mind.
--
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
(Paul Graham)
On Thu, Sep 01, 2005 at 09:20:48AM -0700, Josh Berkus wrote:
What I'm saying is that you can write a heap file, on which the tuples
would all have xmin=FrozenTransactionId, xmax=Invalid, and the
corresponding bits set in the infomask. This ensures that no matter the
state of the server, you can plug the file in and all tuples will be
valid.So, bulk loading by file generation? So the idea is that you would generate
a properly formatted PostgreSQL table file, and then in one transaction
create the table and attach it?
Exactly.
--
Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com
"Changing the world ... one keyboard at a time!"
(www.DVzine.org)
Josh Berkus <josh@agliodbs.com> writes:
THe only fundamental disadvantage that COPY labors under is having to
write WAL records. It might be interesting to do something similar to
the recent hacks for CREATE TABLE AS, so that a COPY into a table just
created in the current transaction would skip writing WAL and instead
fsync the table at the end.
Yes, I thought we discussed doing this for empty tables -- it would be, per
our tests, a +10% to +30% boost to COPY.
But there was some problem the patch?
I have seen no such patch AFAIR.
regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote:
On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote:
Now, one thing of note is that you need to "compress" the page in order
to actually be able to use the just-freed space. VACUUM could do that,
but maybe it would be better to do it on-line -- the freezing process is
going to have to write the page regardless. I wonder if with your patch
the page is compressed on the same VACUUM execution that freezes the
tuple?Only if you do a FULL, which is currently incompatible with a FREEZE.
Well, if we are going to mess with what FREEZE is doing, we can as well
make it compress the page.
Anyone looked at the code lately???
PageRepairFragmentation is part of any kind of vacuum. As long as you
don't reassign tuple IDs (which it doesn't) there's no impact on indexes.
regards, tom lane
On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
I agree. I think an good position of freezer is on bgwriter.
My idea is:
1. Just before bgwriter writes an dirty page in LRU order,
2. Freeze tuples in the page and repair fragmentation.
3. (Replace the fsm page that has least freespace.)
4. Flush the page.This is a bad idea. The bgwriter isn't the place to be doing freezing,
because there is no reasonable way for it to guarantee that all old
tuples in a table (or any larger unit) have been frozen. So you'd still
need VACUUM to ensure no wraparound. Plus, you can't do such changes
without emitting an XLOG record, which is something we don't want
happening in the bgwriter's inner loop. Even more to the point, you
can't do such changes without getting a superexclusive lock on the page
(not only locked, but no one else has it pinned), which is a real
nonstarter for the bgwriter, both for performance and possible deadlock
issues.
So is this something that another daemon could handle? Presumably one
that would operate on pages before they were written out by bgwriter.
Basically, right now any time someone thinks of something that could be
done in the background, bgwriter is the automatic candidate because it's
the only daemon in the backend. And it's often rejected for valid
technical reasons, but that doesn't mean we can't have additional
daemons that operate either before or after bgwriter.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
This is a bad idea. The bgwriter isn't the place to be doing freezing,
So is this something that another daemon could handle?
Possibly, but I'd be inclined to think of it as autovacuum's problem.
regards, tom lane
On Thu, Sep 01, 2005 at 11:22:07PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote:
This is a bad idea. The bgwriter isn't the place to be doing freezing,
So is this something that another daemon could handle?
Possibly, but I'd be inclined to think of it as autovacuum's problem.
Possibly, although what tends to make bgwriter interesting for these
things is that we want to perform some operation on pages between when
they get modified and when they get written out to disk. AFAIK
autovacuum wouldn't currently serve that purpose (though I could be
wrong). In any case, the big point is that there are ideas out there
that might warrant an additional daemon besides bgwriter (my
recollection is that this isn't the first proposal that's been objected
to on the basis of bgwriter being the wrong place to do something).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461