64 bit transaction id

Started by Павел Ерёминabout 6 years ago20 messages
#1Павел Ерёмин
Павел Ерёмин
shnoor111gmail@yandex.ru

<div><div>Hi.</div><div>sorry for my English.</div><div>I want to once again open the topic of 64 bit transaction id. I did not manage to find in the archive of the option that I want to discuss, so I write. If I searched poorly, then please forgive me.</div><div>The idea is not very original and probably has already been considered, again I repeat - I did not find it. Therefore, please do not scold me severely.</div><div>In discussions of 64-bit transaction id, I did not find mention of an algorithm for storing them, as it was done, for example, in MS SQL Server.</div><div>What if instead of 2 fields (xmin and xmax) with a total length of 64 bits - use 1 field (let's call it xid) with a length of 64 bits in tuple header? In this field store the xid of the transaction that created the version. In this case, the new transaction in order to understand whether the read version is suitable for it or not, will have to read the next version as well. Those. The downside of such  decision is of course an increase in I / O. Transactions will have to read the +1 version. On the plus side, the title of the tuple remains the same length.</div><div> </div><div> </div><div>regards, Eremin Pavel.</div></div>

#2Pavel Stehule
Pavel Stehule
pavel.stehule@gmail.com
In reply to: Павел Ерёмин (#1)
Re: 64 bit transaction id

Hi

pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru>
napsal:

Hi.
sorry for my English.
I want to once again open the topic of 64 bit transaction id. I did not
manage to find in the archive of the option that I want to discuss, so I
write. If I searched poorly, then please forgive me.
The idea is not very original and probably has already been considered,
again I repeat - I did not find it. Therefore, please do not scold me
severely.
In discussions of 64-bit transaction id, I did not find mention of an
algorithm for storing them, as it was done, for example, in MS SQL Server.
What if instead of 2 fields (xmin and xmax) with a total length of 64 bits
- use 1 field (let's call it xid) with a length of 64 bits in tuple header?
In this field store the xid of the transaction that created the version. In
this case, the new transaction in order to understand whether the read
version is suitable for it or not, will have to read the next version as
well. Those. The downside of such decision is of course an increase in I /
O. Transactions will have to read the +1 version. On the plus side, the
title of the tuple remains the same length.

is 32 bit tid really problem? Why you need to know state of last 2^31
transactions? Is not problem in too low usage (or maybe too high overhead)
of VACUUM FREEZE.

I am not sure if increasing this range can has much more fatal problems
(maybe later)

Pavel

Show quoted text

regards, Eremin Pavel.

#3Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Pavel Stehule (#2)
Re: 64 bit transaction id

On Fri, Nov 01, 2019 at 10:25:17AM +0100, Pavel Stehule wrote:

Hi

pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин <shnoor111gmail@yandex.ru>
napsal:

Hi.
sorry for my English.
I want to once again open the topic of 64 bit transaction id. I did not
manage to find in the archive of the option that I want to discuss, so I
write. If I searched poorly, then please forgive me.
The idea is not very original and probably has already been considered,
again I repeat - I did not find it. Therefore, please do not scold me
severely.
In discussions of 64-bit transaction id, I did not find mention of an
algorithm for storing them, as it was done, for example, in MS SQL Server.
What if instead of 2 fields (xmin and xmax) with a total length of 64 bits
- use 1 field (let's call it xid) with a length of 64 bits in tuple header?
In this field store the xid of the transaction that created the version. In
this case, the new transaction in order to understand whether the read
version is suitable for it or not, will have to read the next version as
well. Those. The downside of such decision is of course an increase in I /
O. Transactions will have to read the +1 version. On the plus side, the
title of the tuple remains the same length.

is 32 bit tid really problem? Why you need to know state of last 2^31
transactions? Is not problem in too low usage (or maybe too high overhead)
of VACUUM FREEZE.

It certainly can be an issue for large and busy systems, that may need
anti-wraparoud vacuum every couple of days. If that requires rewriting a
couple of TB of data, it's not particularly nice. That's why 64-bit XIDs
were discussed repeatedly in the past, and it's likely to get even more
pressing as the systems get larger.

I am not sure if increasing this range can has much more fatal problems
(maybe later)

Well, not fatal, but naive approaches can increase per-tuple overhead.
And we already have plenty of that, hence there were proposals to use
page epochs and so on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Павел Ерёмин (#1)
Re: 64 bit transaction id

On Fri, Nov 01, 2019 at 12:05:12PM +0300, Павел Ерёмин wrote:

Hi.
sorry for my English.
I want to once again open the topic of 64 bit transaction id. I did not
manage to find in the archive of the option that I want to discuss, so I
write. If I searched poorly, then please forgive me.
The idea is not very original and probably has already been considered,
again I repeat - I did not find it. Therefore, please do not scold me
severely.
In discussions of 64-bit transaction id, I did not find mention of an
algorithm for storing them, as it was done, for example, in MS SQL Server.
What if instead of 2 fields (xmin and xmax) with a total length of 64 bits
- use 1 field (let's call it xid) with a length of 64 bits in tuple
header? In this field store the xid of the transaction that created the
version. In this case, the new transaction in order to understand whether
the read version is suitable for it or not, will have to read the next
version as well. Those. The downside of such  decision is of course an
increase in I / O. Transactions will have to read the +1 version. On the
plus side, the title of the tuple remains the same length.
 

I think that assumes we can easily identify the next version of a tuple,
and I don't think we can do that. We may be able to do that for for HOT
chains, but that only works when the next version fits onto the same
page (and does not update indexed columns). But when we store the new
version on a separate page, we don't have any link between those tuples.
And adding it may easily mean more overhead than the 8B we'd save by
only storing a single XID.

IMO the most promising solution to this is the "page epoch" approach
discussed some time ago (1-2 years?).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Павел Ерёмин
Павел Ерёмин
shnoor111gmail@yandex.ru
In reply to: Tomas Vondra (#3)
Re: 64 bit transaction id

<div> </div><div>The proposed option does not need to change the length of either the page header or tuple header. Therefore, you will not need to physically change the data.</div><div> </div><div><span style="background:#ffffff;color:#000000;float:none;font:400 15px 'arial' , sans-serif;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:pre-wrap;word-spacing:0px">regards</span></div><div><br /></div><div><br /></div><div>01.11.2019, 20:10, "Tomas Vondra" &lt;tomas.vondra@2ndquadrant.com&gt;:</div><blockquote><p>On Fri, Nov 01, 2019 at 10:25:17AM +0100, Pavel Stehule wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">Hi<br /><br />pá 1. 11. 2019 v 10:11 odesílatel Павел Ерёмин &lt;<a href="mailto:shnoor111gmail@yandex.ru">shnoor111gmail@yandex.ru</a>&gt;<br />napsal:<br /><br /><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote"> Hi.<br /> sorry for my English.<br /> I want to once again open the topic of 64 bit transaction id. I did not<br /> manage to find in the archive of the option that I want to discuss, so I<br /> write. If I searched poorly, then please forgive me.<br /> The idea is not very original and probably has already been considered,<br /> again I repeat - I did not find it. Therefore, please do not scold me<br /> severely.<br /> In discussions of 64-bit transaction id, I did not find mention of an<br /> algorithm for storing them, as it was done, for example, in MS SQL Server.<br /> What if instead of 2 fields (xmin and xmax) with a total length of 64 bits<br /> - use 1 field (let's call it xid) with a length of 64 bits in tuple header?<br /> In this field store the xid of the transaction that created the version. In<br /> this case, the new transaction in order to understand whether the read<br /> version is suitable for it or not, will have to read the next version as<br /> well. Those. The downside of such decision is of course an increase in I /<br /> O. Transactions will have to read the +1 version. On the plus side, the<br /> title of the tuple remains the same length.<br /><br /></blockquote><br />is 32 bit tid really problem? Why you need to know state of last 2^31<br />transactions? Is not problem in too low usage (or maybe too high overhead)<br />of VACUUM FREEZE.<br /><br /></blockquote><p><br />It certainly can be an issue for large and busy systems, that may need<br />anti-wraparoud vacuum every couple of days. If that requires rewriting a<br />couple of TB of data, it's not particularly nice. That's why 64-bit XIDs<br />were discussed repeatedly in the past, and it's likely to get even more<br />pressing as the systems get larger.<br /><br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">I am not sure if increasing this range can has much more fatal problems<br />(maybe later)<br /><br /></blockquote><p><br />Well, not fatal, but naive approaches can increase per-tuple overhead.<br />And we already have plenty of that, hence there were proposals to use<br />page epochs and so on.<br /><br /><br />regards<br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/&quot;&gt;http://www.2ndQuadrant.com&lt;/a&gt;&lt;br />PostgreSQL Development, 24x7 Support, Remote DBA, Training &amp; Services <br /></span></blockquote>

#6Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Павел Ерёмин (#5)
Re: 64 bit transaction id

On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote:

 
The proposed option does not need to change the length of either the page
header or tuple header. Therefore, you will not need to physically change
the data.
 

So how do you link the tuple versions together? Clearly, that has to be
stored somewhere ...

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Павел Ерёмин
Павел Ерёмин
shnoor111gmail@yandex.ru
In reply to: Tomas Vondra (#6)
Re: 64 bit transaction id

<div><div>The proposed option is not much different from what it is now.</div><div>We are not trying to save some space - we will reuse the existing one. We just work in 64 bit transaction counters. Correct me if I'm wrong - the address of the next version of the line is stored in the 6 byte field t_cid in the tuple header - which is not attached to the current page in any way - and can be stored anywhere in the table. Nothing changes.</div><div>I often explain things very poorly, but I will try.</div><div>for example</div><div>Each transaction is assigned a unique 64-bit xid.</div><div>In the tuple header, we replace 32-bit xmin and xmax - with one 64-bit field - let's call it xid.</div><div>Suppose</div><div>Transaction 1 does INSERT</div><div>The first version is created (Tuple1).</div><div>Tuple1. Tuple_header.xid = Transacrion1.xid and Tuple1. Tuple_header. t_cid = 0;</div><div>Transaction 3 (started after transaction 1) does UPDATE</div><div>The second version is created (Tuple2).</div><div>Tuple1. Tuple_header. t_cid = (address) Tuple2;</div><div>Tuple2. Tuple_header.xid = Transacrion3.xid and Tuple2. Tuple_header. t_cid = 0;</div><div>Transaction 2 (started between transaction1 and transaction2) makes SELECT</div><div>Reads Tuple1. Transaction 2 sees that Tuple1.Tuple_header.xid &lt;Transacrion2.xid, sees that the address Tuple1 is filled. Tuple_header. t_cid, follow it and read the version of Tuple2.</div></div><div><br /></div><div><br /></div><div>02.11.2019, 21:15, "Tomas Vondra" &lt;tomas.vondra@2ndquadrant.com&gt;:</div><blockquote><p>On Sat, Nov 02, 2019 at 07:07:17PM +0300, Павел Ерёмин wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">    <br />   The proposed option does not need to change the length of either the page<br />   header or tuple header. Therefore, you will not need to physically change<br />   the data.<br />    <br /></blockquote><p><br />So how do you link the tuple versions together? Clearly, that has to be<br />stored somewhere ...<br /><br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/&quot;&gt;http://www.2ndQuadrant.com&lt;/a&gt;&lt;br />PostgreSQL Development, 24x7 Support, Remote DBA, Training &amp; Services <br /></span></blockquote>

#8Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Павел Ерёмин (#7)
Re: 64 bit transaction id

On Sat, Nov 02, 2019 at 11:35:09PM +0300, Павел Ерёмин wrote:

The proposed option is not much different from what it is now.
We are not trying to save some space - we will reuse the existing one. We
just work in 64 bit transaction counters. Correct me if I'm wrong - the
address of the next version of the line is stored in the 6 byte field
t_cid in the tuple header - which is not attached to the current page in
any way - and can be stored anywhere in the table. Nothing changes.

I think you mean t_ctid, not t_cid (which is a 4-byte CommandId, not any
sort of item pointer).

I think this comment from htup_details.h explains the issue:

* ... Beware however that VACUUM might
* erase the pointed-to (newer) tuple before erasing the pointing (older)
* tuple. Hence, when following a t_ctid link, it is necessary to check
* to see if the referenced slot is empty or contains an unrelated tuple.
* Check that the referenced tuple has XMIN equal to the referencing tuple's
* XMAX to verify that it is actually the descendant version and not an
* unrelated tuple stored into a slot recently freed by VACUUM. If either
* check fails, one may assume that there is no live descendant version.

Now, imagine you have a tuple that gets updated repeatedly (say, 3x) and
each version gets to a different page. Say, pages #1, #2, #3. And then
VACUUM happens on some of the "middle" page (this may happen when trying
to fit new row onto a page to allow HOT, but it might happen even during
regular VACUUM).

So we started with 3 tuples on pages #1, #2, #3, but now we have this

#1 - tuple exists, points to tuple on page #2
#2 - tuple no longer exists, cleaned up by vacuum
#3 - tuple exists

The scheme you proposed requires existence of all the tuples in the
chain to determine visibility. When tuple #2 no longer exists, it's
impossible to decide whether tuple on page #1 is visible or not.

This also significantly increases the amount of random I/O, pretty much
by factor of 2, because whenever you look at a row, you also have to
look at the "next version" which may be on another page. That's pretty
bad, bot for I/O and cache hit ratio. I don't think that's a reasonable
trade-off (at least compared to simply making the XIDs 64bit).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Павел Ерёмин
Павел Ерёмин
shnoor111gmail@yandex.ru
In reply to: Tomas Vondra (#8)
Re: 64 bit transaction id

<div><div>I completely agree with all of the above. Therefore, the proposed mechanism may entail larger improvements (and not only VACUUM).</div><div>I can offer the following solution.</div><div>For VACUUM, create a hash table.</div><div>VACUUM scanning the table sees that the version (tuple1) has t_ctid filled and refers to the address tuple2, it creates a structure into which it writes the address tuple1, tuple1.xid, length tuple1 (well, and other information that is needed), puts this structure in the hash table by key tuple2 addresses.</div><div>VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if it finds it, it evaluates the connection between them and makes a decision on cleaning.</div><div> </div><div> </div><div><span style="background:white;color:black;font:11.5pt 'arial' , sans-serif">regards</span></div></div><div><br /></div><div><br /></div><div>03.11.2019, 02:20, "Tomas Vondra" &lt;tomas.vondra@2ndquadrant.com&gt;:</div><blockquote><p>On Sat, Nov 02, 2019 at 11:35:09PM +0300, Павел Ерёмин wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">   The proposed option is not much different from what it is now.<br />   We are not trying to save some space - we will reuse the existing one. We<br />   just work in 64 bit transaction counters. Correct me if I'm wrong - the<br />   address of the next version of the line is stored in the 6 byte field<br />   t_cid in the tuple header - which is not attached to the current page in<br />   any way - and can be stored anywhere in the table. Nothing changes.<br /></blockquote><p><br />I think you mean t_ctid, not t_cid (which is a 4-byte CommandId, not any<br />sort of item pointer).<br /><br />I think this comment from htup_details.h explains the issue:<br /><br /> * ... Beware however that VACUUM might<br /> * erase the pointed-to (newer) tuple before erasing the pointing (older)<br /> * tuple. Hence, when following a t_ctid link, it is necessary to check<br /> * to see if the referenced slot is empty or contains an unrelated tuple.<br /> * Check that the referenced tuple has XMIN equal to the referencing tuple's<br /> * XMAX to verify that it is actually the descendant version and not an<br /> * unrelated tuple stored into a slot recently freed by VACUUM. If either<br /> * check fails, one may assume that there is no live descendant version.<br /><br />Now, imagine you have a tuple that gets updated repeatedly (say, 3x) and<br />each version gets to a different page. Say, pages #1, #2, #3. And then<br />VACUUM happens on some of the "middle" page (this may happen when trying<br />to fit new row onto a page to allow HOT, but it might happen even during<br />regular VACUUM).<br /><br />So we started with 3 tuples on pages #1, #2, #3, but now we have this<br /><br />  #1 - tuple exists, points to tuple on page #2<br />  #2 - tuple no longer exists, cleaned up by vacuum<br />  #3 - tuple exists<br /><br />The scheme you proposed requires existence of all the tuples in the<br />chain to determine visibility. When tuple #2 no longer exists, it's<br />impossible to decide whether tuple on page #1 is visible or not.<br /><br />This also significantly increases the amount of random I/O, pretty much<br />by factor of 2, because whenever you look at a row, you also have to<br />look at the "next version" which may be on another page. That's pretty<br />bad, bot for I/O and cache hit ratio. I don't think that's a reasonable<br />trade-off (at least compared to simply making the XIDs 64bit).<br /><br /><br />regards<br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/&quot;&gt;http://www.2ndQuadrant.com&lt;/a&gt;&lt;br />PostgreSQL Development, 24x7 Support, Remote DBA, Training &amp; Services <br /></span></blockquote>

#10Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Павел Ерёмин (#9)
Re: 64 bit transaction id

On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote:

I completely agree with all of the above. Therefore, the proposed
mechanism may entail larger improvements (and not only VACUUM).

I think the best think you can do is try implementing this ...

I'm afraid the "improvements" essentially mean making various imporant
parts of the system much more complicated and expensive. There's a
trade-off between saving 8B per row and additional overhead (during
vacuum etc.), and it does not seem like a winning strategy. What started
as "we can simply look at the next row version" is clearly way more
complicated and expensive.

The trouble here is that it adds dependency between pages in the data
file. That for example means that during cleanup of a page it may be
necessary to modify the other page, when originally that would be
read-only in that checkpoint interval. That's essentially write
amplification, and may significantly increase the amount of WAL due to
generating FPW for the other page.

I can offer the following solution.
For VACUUM, create a hash table.
VACUUM scanning the table sees that the version (tuple1) has t_ctid filled
and refers to the address tuple2, it creates a structure into which it
writes the address tuple1, tuple1.xid, length tuple1 (well, and other
information that is needed), puts this structure in the hash table by key
tuple2 addresses.
VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if
it finds it, it evaluates the connection between them and makes a decision
on cleaning.

We know VACUUM is already pretty expensive, so making it even more
expensive seems pretty awful. And the proposed solution seems damn
expensive. We already do something similar for indexes - we track
pointers for removed rows, so that we can remove them from indexes. And
it's damn expensive because we don't know where in the index the tuples
are - so we have to scan the whole indexes.

This would mean we have to do the same thing for table, because we don't
know where in the table are the older versions of those rows, because we
don't know where the other rows are. That seems mighty expensive.

Not to mention that this does nothing for page-level vacuum, which we
do when trying to fit another row on a page (e.g. for HOT). This has to
be absolutely cheap, we certainly are not going to do lookups of other
pages or looking for older versions of the row, and so on.

Being able to do visibility decisions based on the tuple alone (or
possibly page-level + tuple information) has a lot of value, and I don't
think we want to make this more complicated.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#4)
Re: 64 bit transaction id

On Sat, Nov 2, 2019 at 6:15 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Fri, Nov 01, 2019 at 12:05:12PM +0300, Павел Ерёмин wrote:

Hi.
sorry for my English.
I want to once again open the topic of 64 bit transaction id. I did not
manage to find in the archive of the option that I want to discuss, so I
write. If I searched poorly, then please forgive me.
The idea is not very original and probably has already been considered,
again I repeat - I did not find it. Therefore, please do not scold me
severely.
In discussions of 64-bit transaction id, I did not find mention of an
algorithm for storing them, as it was done, for example, in MS SQL Server.
What if instead of 2 fields (xmin and xmax) with a total length of 64 bits
- use 1 field (let's call it xid) with a length of 64 bits in tuple
header? In this field store the xid of the transaction that created the
version. In this case, the new transaction in order to understand whether
the read version is suitable for it or not, will have to read the next
version as well. Those. The downside of such decision is of course an
increase in I / O. Transactions will have to read the +1 version. On the
plus side, the title of the tuple remains the same length.

I think that assumes we can easily identify the next version of a tuple,
and I don't think we can do that. We may be able to do that for for HOT
chains, but that only works when the next version fits onto the same
page (and does not update indexed columns). But when we store the new
version on a separate page, we don't have any link between those tuples.
And adding it may easily mean more overhead than the 8B we'd save by
only storing a single XID.

IMO the most promising solution to this is the "page epoch" approach
discussed some time ago (1-2 years?).

There have been so many discussions of this topic that it's hard to search for.

Since we have in fact begun to take some baby steps towards using 64
bit transaction IDs in a few places, I decided to create a new wiki
page to try to keep track of the various discussions. If you know
where to find the 'best' discussions (for example the one where, if I
recall correctly, it was Heikki who proposed a 'reference'
FullTranasctionId on the page header) and any proposals that came with
patches, then I'd be grateful if you could add links to it to this
wiki page!

https://wiki.postgresql.org/wiki/FullTransactionId

#12Павел Ерёмин
Павел Ерёмин
shnoor111gmail@yandex.ru
In reply to: Tomas Vondra (#10)
Re: 64 bit transaction id

<div>And yet, if I try to implement a similar mechanism, if successful, will my revision be considered?</div><div> </div><div><span style="background:#ffffff;color:#000000;float:none;font:400 15px 'arial' , sans-serif;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:pre-wrap;word-spacing:0px">regards</span></div><div><br /></div><div><br /></div><div>03.11.2019, 22:15, "Tomas Vondra" &lt;tomas.vondra@2ndquadrant.com&gt;:</div><blockquote><p>On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">   I completely agree with all of the above. Therefore, the proposed<br />   mechanism may entail larger improvements (and not only VACUUM).<br /></blockquote><p><br />I think the best think you can do is try implementing this ...<br /><br />I'm afraid the "improvements" essentially mean making various imporant<br />parts of the system much more complicated and expensive. There's a<br />trade-off between saving 8B per row and additional overhead (during<br />vacuum etc.), and it does not seem like a winning strategy. What started<br />as "we can simply look at the next row version" is clearly way more<br />complicated and expensive.<br /><br />The trouble here is that it adds dependency between pages in the data<br />file. That for example means that during cleanup of a page it may be <br />necessary to modify the other page, when originally that would be <br />read-only in that checkpoint interval. That's essentially write <br />amplification, and may significantly increase the amount of WAL due to <br />generating FPW for the other page.<br /><br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">   I can offer the following solution.<br />   For VACUUM, create a hash table.<br />   VACUUM scanning the table sees that the version (tuple1) has t_ctid filled<br />   and refers to the address tuple2, it creates a structure into which it<br />   writes the address tuple1, tuple1.xid, length tuple1 (well, and other<br />   information that is needed), puts this structure in the hash table by key<br />   tuple2 addresses.<br />   VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if<br />   it finds it, it evaluates the connection between them and makes a decision<br />   on cleaning.<br /><br /></blockquote><p><br />We know VACUUM is already pretty expensive, so making it even more<br />expensive seems pretty awful. And the proposed solution seems damn<br />expensive. We already do something similar for indexes - we track<br />pointers for removed rows, so that we can remove them from indexes. And<br />it's damn expensive because we don't know where in the index the tuples<br />are - so we have to scan the whole indexes.<br /><br />This would mean we have to do the same thing for table, because we don't<br />know where in the table are the older versions of those rows, because we<br />don't know where the other rows are. That seems mighty expensive.<br /><br />Not to mention that this does nothing for page-level vacuum, which we<br />do when trying to fit another row on a page (e.g. for HOT). This has to<br />be absolutely cheap, we certainly are not going to do lookups of other<br />pages or looking for older versions of the row, and so on.<br /><br />Being able to do visibility decisions based on the tuple alone (or<br />possibly page-level + tuple information) has a lot of value, and I don't<br />think we want to make this more complicated.<br /><br />regards<br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/&quot;&gt;http://www.2ndQuadrant.com&lt;/a&gt;&lt;br />PostgreSQL Development, 24x7 Support, Remote DBA, Training &amp; Services <br /></span></blockquote>

#13Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Павел Ерёмин (#12)
Re: 64 bit transaction id

On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:

And yet, if I try to implement a similar mechanism, if successful, will my
revision be considered?
 

Why wouldn't it be considered? If you submit a patch that demonstrably
improves the behavior (in this case reduces per-tuple overhead without
causing significant issues elsewhere), we'd be crazy not to consider it.

The bar is pretty high, though, because this touches one of the core
pieces of the database.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#13)
Re: 64 bit transaction id

Hi,

(I've not read the rest of this thread yet)

On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote:

On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:

And yet, if I try to implement a similar mechanism, if successful, will my
revision be considered?
 

Why wouldn't it be considered? If you submit a patch that demonstrably
improves the behavior (in this case reduces per-tuple overhead without
causing significant issues elsewhere), we'd be crazy not to consider it.

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Greetings,

Andres Freund

#15Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#14)
Re: 64 bit transaction id

On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:

Hi,

(I've not read the rest of this thread yet)

On 2019-11-04 16:07:23 +0100, Tomas Vondra wrote:

On Mon, Nov 04, 2019 at 04:39:44PM +0300, Павел Ерёмин wrote:

And yet, if I try to implement a similar mechanism, if successful, will my
revision be considered?
 

Why wouldn't it be considered? If you submit a patch that demonstrably
improves the behavior (in this case reduces per-tuple overhead without
causing significant issues elsewhere), we'd be crazy not to consider it.

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Yeah. I suppose we could have a different AM implementing this, but
maybe that's not possible ...

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#15)
Re: 64 bit transaction id

Hi,

On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:

On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Yeah. I suppose we could have a different AM implementing this, but
maybe that's not possible ...

Entirely possible. But the amount of code duplication / unnecessary
branching and the user confusion from two very similar AMs, would have
to be weighed against the benefits.

Greetings,

Andres Freund

#17Tomas Vondra
Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#16)
Re: 64 bit transaction id

On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote:

Hi,

On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:

On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Yeah. I suppose we could have a different AM implementing this, but
maybe that's not possible ...

Entirely possible. But the amount of code duplication / unnecessary
branching and the user confusion from two very similar AMs, would have
to be weighed against the benefits.

Agreed. I think code complexity is part of the trade-off. IMO it's fine
to hack existing heap AM initially, and only explore the separate AM if
that turns out to be promising.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Tomas Vondra (#17)
Re: 64 bit transaction id

On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

On Mon, Nov 04, 2019 at 10:44:53AM -0800, Andres Freund wrote:

On 2019-11-04 19:39:18 +0100, Tomas Vondra wrote:

On Mon, Nov 04, 2019 at 10:04:09AM -0800, Andres Freund wrote:

And "without causing significant issues elsewhere" unfortunately
includes continuing to allow pg_upgrade to work.

Yeah. I suppose we could have a different AM implementing this, but
maybe that's not possible ...

Entirely possible. But the amount of code duplication / unnecessary
branching and the user confusion from two very similar AMs, would have
to be weighed against the benefits.

Agreed. I think code complexity is part of the trade-off. IMO it's fine
to hack existing heap AM initially, and only explore the separate AM if
that turns out to be promising.

I thought a bit about how to make a minimally-diffferent-from-heap
non-freezing table AM using 64 bit xids, as a thought experiment when
trying to understand or explain to others what zheap is about.
Committed transactions are easy (you don't have to freeze fxid
references from the ancient past because they don't wrap around so
they always look old), but how do you deal with *aborted* transactions
when truncating the CLOG (given that our current rule is "if it's
before the CLOG begins, it must be committed")? I see three
possibilities: (1) don't truncate the CLOG anymore (use 64 bit
addressing and let it leak disk forever, like we did before commit
2589735d and later work), (2) freeze aborted transactions only, using
a wraparound vacuum (and now you have failed, if the goal was to avoid
having to scan all tuples periodically to freeze stuff, though
admittedly it will require less IO to freeze only the aborted
transactions), (3) go and remove aborted fxid references eagerly, when
you roll back (this could be done using the undo technology that we
have been developing to support zheap). Another way to explain (3) is
that this hypothetical table AM, let's call it "yheap", takes the
minimum parts of the zheap technology stack required to get rid of
vacuum-for-wraparound, without doing in-place updates or any of that
hard stuff. To make this really work you'd also have to deal with
multixacts, which also require freezing. If that all sounds too
complicated, you're back to (2) which seems a bit weak to me. Or
perhaps I'm missing something?

#19Bruce Momjian
Bruce Momjian
bruce@momjian.us
In reply to: Thomas Munro (#18)
Re: 64 bit transaction id

On Tue, Nov 5, 2019 at 09:34:52AM +1300, Thomas Munro wrote:

On Tue, Nov 5, 2019 at 8:45 AM Tomas Vondra

Agreed. I think code complexity is part of the trade-off. IMO it's fine
to hack existing heap AM initially, and only explore the separate AM if
that turns out to be promising.

I thought a bit about how to make a minimally-diffferent-from-heap
non-freezing table AM using 64 bit xids, as a thought experiment when
trying to understand or explain to others what zheap is about.
Committed transactions are easy (you don't have to freeze fxid
references from the ancient past because they don't wrap around so
they always look old), but how do you deal with *aborted* transactions
when truncating the CLOG (given that our current rule is "if it's
before the CLOG begins, it must be committed")? I see three
possibilities: (1) don't truncate the CLOG anymore (use 64 bit
addressing and let it leak disk forever, like we did before commit
2589735d and later work), (2) freeze aborted transactions only, using
a wraparound vacuum (and now you have failed, if the goal was to avoid
having to scan all tuples periodically to freeze stuff, though
admittedly it will require less IO to freeze only the aborted
transactions), (3) go and remove aborted fxid references eagerly, when
you roll back (this could be done using the undo technology that we
have been developing to support zheap). Another way to explain (3) is
that this hypothetical table AM, let's call it "yheap", takes the
minimum parts of the zheap technology stack required to get rid of
vacuum-for-wraparound, without doing in-place updates or any of that
hard stuff. To make this really work you'd also have to deal with
multixacts, which also require freezing. If that all sounds too
complicated, you're back to (2) which seems a bit weak to me. Or
perhaps I'm missing something?

The above is a very good summary of the constraints that have led to our
current handling of XID wraparound. If we are concerned about excessive
vacuum freeze overhead, why is the default autovacuum_freeze_max_age =
200000000 so low? That causes feezing when the pg_xact directory holds
200 million xids or 50 megabytes of xid status?

As far as I understand it, we cause the database to stop writes when the
xid counter gets within 2 billion xids of the current transaction
counter, so 200 million is only 1/10th to that limit, and even then, I
am not sure why we couldn't make it stop writes at 3 billion or
something. My point is that increasing the default
autovacuum_freeze_max_age value seems like an easy way to reduce vacuum
freeze. (While, the visibility map helps avoid vacuum freeze from
reading all heap pages, and we still need to read all index pages.)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +
#20Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#19)
Re: 64 bit transaction id

On Thu, Nov 7, 2019 at 10:28 AM Bruce Momjian <bruce@momjian.us> wrote:

The above is a very good summary of the constraints that have led to our
current handling of XID wraparound. If we are concerned about excessive
vacuum freeze overhead, why is the default autovacuum_freeze_max_age =
200000000 so low? That causes feezing when the pg_xact directory holds
200 million xids or 50 megabytes of xid status?

As far as I understand it, we cause the database to stop writes when the
xid counter gets within 2 billion xids of the current transaction
counter, so 200 million is only 1/10th to that limit, and even then, I
am not sure why we couldn't make it stop writes at 3 billion or
something. My point is that increasing the default
autovacuum_freeze_max_age value seems like an easy way to reduce vacuum
freeze. (While, the visibility map helps avoid vacuum freeze from
reading all heap pages, and we still need to read all index pages.)

Yeah, I've also wondered why this isn't higher by default, but it's a
somewhat tricky topic.

Three billion won't work, because it's deeply baked into PostgreSQL's
architecture that at most two billion XIDs are used at one time. For
comparison purposes, the four billion XIDs form a ring, so that from
the perspective of any individual XID, half of the XIDs are in the
future and the other half are in the past. If three billion XIDs were
in use simultaneously, say starting with XID 5 and ending with XID
3,000,000,004, then XID 5 would see XID 3,000,000,004 as being the
past rather than the future, while XID 1,500,000,000 would (correctly)
see XID 5 as in the past and XID 3,000,000,004 as in the future. So
XID comparison would not be transitive, which would break a lot of
code. Allowing at most two billion XIDs to be in use at one time fixes
this problem.

That doesn't mean we couldn't raise the setting. It just means that
the hard limit is two billion, not four billion. But, how high should
we raise it? The highest safe value depends on how many XIDs you'll
burn while the freezing vacuums are running, which depends on both the
size of the database and the rate of XID consumption, and those values
can be very different on different systems. I think most people could
get by with a significantly higher value, but even with the current
value I think there are probably some people who run out of XIDs, at
which point they can no longer write to the database. The higher we
make the default, the more people are going to have that problem. It's
true that a lot of people only hit the limit because something has
gone wrong, like they've forgotten about a prepared transaction or an
unused replication slot, but still, on high-velocity systems you can't
afford to cut it too close because you're still going to be burning
through XIDs while vacuum is running.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company