Freezing without write I/O

Started by Heikki Linnakangasalmost 13 years ago79 messageshackers

heikki.linnakangas@enterprisedb.com

almost 13 years ago

Since we're bashing around ideas around freezing, let me write down the
idea I've been pondering and discussing with various people for years. I
don't think I invented this myself, apologies to whoever did for not
giving credit.

The reason we have to freeze is that otherwise our 32-bit XIDs wrap
around and become ambiguous. The obvious solution is to extend XIDs to
64 bits, but that would waste a lot space. The trick is to add a field
to the page header indicating the 'epoch' of the XID, while keeping the
XIDs in tuple header 32-bit wide (*).

The other reason we freeze is to truncate the clog. But with 64-bit
XIDs, we wouldn't actually need to change old XIDs on disk to FrozenXid.
Instead, we could implicitly treat anything older than relfrozenxid as
frozen.

That's the basic idea. Vacuum freeze only needs to remove dead tuples,
but doesn't need to dirty pages that contain no dead tuples.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need
to replace the XIDs with FrozenXid whenever the difference between the
smallest and largest XID on a page exceeds 2^31. But that would only
happen when you're updating the page, in which case the page is dirtied
anyway, so it wouldn't cause any extra I/O.

This would also be the first step in allowing the clog to grow larger
than 2 billion transactions, eliminating the need for anti-wraparound
freezing altogether. You'd still want to truncate the clog eventually,
but it would be nice to not be pressed against the wall with "run vacuum
freeze now, or the system will shut down".

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental
model. If you just add a 32-bit epoch field, then you cannot have xids
from different epochs on the page, which would be a problem. In reality,
you would store one 64-bit XID value in the page header, and use that as
the "reference point" for all the 32-bit XIDs on the tuples. See
existing convert_txid() function for how that works. Another method is
to store the 32-bit xid values in tuple headers as offsets from the
per-page 64-bit value, but then you'd always need to have the 64-bit
value at hand when interpreting the XIDs, even if they're all recent.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Freezing without write I/O

Heikki,

This sounds a lot like my idea for 9.3, which didn't go anywhere.
You've worked out the issues I couldn't, I think.

Another method is
to store the 32-bit xid values in tuple headers as offsets from the
per-page 64-bit value, but then you'd always need to have the 64-bit
value at hand when interpreting the XIDs, even if they're all recent.

Yeah, -1 on the latter, not least because it would require a 100%
rewrite of the tables in order to upgrade.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM1900ea6a927e464ef65d5d6093721eeced6b90daf40abb55effb88775449a7ec73e302cc2d2f1410be1ee51372eeac02@asav-1.01.com

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).

Check.

The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
could implicitly treat anything older than relfrozenxid as frozen.

Check.

That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
doesn't need to dirty pages that contain no dead tuples.

Check.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
replace the XIDs with FrozenXid whenever the difference between the smallest
and largest XID on a page exceeds 2^31. But that would only happen when
you're updating the page, in which case the page is dirtied anyway, so it
wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

This would also be the first step in allowing the clog to grow larger than 2
billion transactions, eliminating the need for anti-wraparound freezing
altogether. You'd still want to truncate the clog eventually, but it would
be nice to not be pressed against the wall with "run vacuum freeze now, or
the system will shut down".

Interesting. That seems like a major advantage.

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental
model. If you just add a 32-bit epoch field, then you cannot have xids from
different epochs on the page, which would be a problem. In reality, you
would store one 64-bit XID value in the page header, and use that as the
"reference point" for all the 32-bit XIDs on the tuples. See existing
convert_txid() function for how that works. Another method is to store the
32-bit xid values in tuple headers as offsets from the per-page 64-bit
value, but then you'd always need to have the 64-bit value at hand when
interpreting the XIDs, even if they're all recent.

As I see it, the main downsides of this approach are:

(1) It breaks binary compatibility (unless you do something to
provided for it, like put the epoch in the special space).

(2) It consumes 8 bytes per page. I think it would be possible to get
this down to say 5 bytes per page pretty easily; we'd simply decide
that the low-order 3 bytes of the reference XID must always be 0.
Possibly you could even do with 4 bytes, or 4 bytes plus some number
of extra bits.

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

The upsides of this approach as compared with what Andres and I are
proposing are:

(1) It provides a stepping stone towards allowing indefinite expansion
of CLOG, which is quite appealing as an alternative to a hard
shut-down.

(2) It doesn't place any particular requirements on PD_ALL_VISIBLE. I
don't personally find this much of a benefit as I want to keep
PD_ALL_VISIBLE, but I know Jeff and perhaps others disagree.

Random thought: Could you compute the reference XID based on the page
LSN? That would eliminate the storage overhead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Merlin Moncure

mmoncure@gmail.com

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

Why is this scan required?

Also, what happens if you delete a tuple on a page when another tuple
on the same page with age > 2^32 that is still in an open transaction?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Merlin Moncure (#4)

Re: Freezing without write I/O

On 30.05.2013 21:46, Merlin Moncure wrote:

On Thu, May 30, 2013 at 1:39 PM, Robert Haas<robertmhaas@gmail.com> wrote:

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

Why is this scan required?

To find all the dead tuples and remove them, and advance relfrozenxid.
That in turn is required so that you can truncate the clog. This scheme
relies on assuming that everything older than relfrozenxid committed, so
if there are any aborted XIDs present in the table, you can't advance
relfrozenxid past them.

Come to think of it, if there are no aborted XIDs in a range of XIDs,
only commits, then you could just advance relfrozenxid past that range
and truncate away the clog, without scanning the table. But that's quite
a special case - generally there would be at least a few aborted XIDs -
so it's probably not worth adding any special code to take advantage of
that.

Also, what happens if you delete a tuple on a page when another tuple
on the same page with age> 2^32 that is still in an open transaction?

Can't let that happen. Same as today.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Freezing without write I/O

On 2013-05-30 14:39:46 -0400, Robert Haas wrote:

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
replace the XIDs with FrozenXid whenever the difference between the smallest
and largest XID on a page exceeds 2^31. But that would only happen when
you're updating the page, in which case the page is dirtied anyway, so it
wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

You probably could do it similarly to how we currently do
XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
insert/update/delete record will already contain a FPI if necessary, so
that should be safe.

This would also be the first step in allowing the clog to grow larger than 2
billion transactions, eliminating the need for anti-wraparound freezing
altogether. You'd still want to truncate the clog eventually, but it would
be nice to not be pressed against the wall with "run vacuum freeze now, or
the system will shut down".

Interesting. That seems like a major advantage.

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:

This would also be the first step in allowing the clog to grow
larger than 2 billion transactions, eliminating the need for
anti-wraparound freezing altogether. You'd still want to truncate
the clog eventually, but it would be nice to not be pressed against
the wall with "run vacuum freeze now, or the system will shut down".

Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation. Does this help that?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Andres Freund (#6)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 3:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-05-30 14:39:46 -0400, Robert Haas wrote:

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
replace the XIDs with FrozenXid whenever the difference between the smallest
and largest XID on a page exceeds 2^31. But that would only happen when
you're updating the page, in which case the page is dirtied anyway, so it
wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

You probably could do it similarly to how we currently do
XLOG_HEAP_ALL_VISIBLE_CLEARED and just recheck the page on replay. The
insert/update/delete record will already contain a FPI if necessary, so
that should be safe.

Ah, good point.

This would also be the first step in allowing the clog to grow larger than 2
billion transactions, eliminating the need for anti-wraparound freezing
altogether. You'd still want to truncate the clog eventually, but it would
be nice to not be pressed against the wall with "run vacuum freeze now, or
the system will shut down".

Interesting. That seems like a major advantage.

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems. I think that's only going to get more common. As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time. Heikki's benchmark results for the XLOG scaling patch show
rates of >80,000 tps. Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours. That's not fun.

Being able to do such scans only of the not-all-visible pages would be
a huge step forward, of course. But not having to do them on any
particular deadline would be a whole lot better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 2:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Random thought: Could you compute the reference XID based on the page
LSN? That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new "half-epoch" every 2 billion transactions. We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began. When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a "new half-epoch" WAL record, which
becomes the starting LSN for the new half-epoch.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch. Whenever we know
this to be true, we set the bit. If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit. The advantage of this is that we avoid an
epidemic of freezing right after a half-epoch change. Immediately
after a half-epoch change, many pages will mix tuples from the current
and previous half-epoch - but relatively few pages will have tuples
from the current half-epoch and a half-epoch more than one in the
past.

As things stand today, we really only need to remember the last two
half-epoch boundaries; they could be stored, for example, in the
control file. But if we someday generalize CLOG to allow indefinite
retention as you suggest, we could instead remember all half-epoch
boundaries that have ever occurred; just maintain a file someplace
with 8 bytes of data for every 2 billion XIDs consumed over the
lifetime of the cluster. In fact, we might want to do it that way
anyhow, just to keep our options open, and perhaps for forensics.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Bruce Momjian (#7)

Re: Freezing without write I/O

On 31.05.2013 00:06, Bruce Momjian wrote:

On Thu, May 30, 2013 at 04:33:50PM +0300, Heikki Linnakangas wrote:

This would also be the first step in allowing the clog to grow
larger than 2 billion transactions, eliminating the need for
anti-wraparound freezing altogether. You'd still want to truncate
the clog eventually, but it would be nice to not be pressed against
the wall with "run vacuum freeze now, or the system will shut down".

Keep in mind that autovacuum_freeze_max_age is 200M to allow faster clog
truncation. Does this help that?

No. If you want to keep autovacuum_freeze_max_age set at less than 2
billion, you don't need support for more than 2 billion transactions.
But for those who would like to set autovacuum_freeze_max_age higher
than 2B, it would be nice to allow it.

Actually, even with autovacuum_freeze_max_age = 200 M, it would be nice
to not have the hard stop at 2 billion, in case autovacuum falls behind
really badly. With autovacuum_freeze_max_age = 200M, there's a lot of
safety margin, but with 1000M or so, not so much.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Robert Haas (#8)

Re: Freezing without write I/O

On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems. I think that's only going to get more common. As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time. Heikki's benchmark results for the XLOG scaling patch show
rates of >80,000 tps. Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours. That's not fun.

Are you assuming these are all write transactions, hence consuming xids?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Bruce Momjian (#11)

Re: Freezing without write I/O

On Fri, May 31, 2013 at 1:26 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, May 30, 2013 at 10:04:23PM -0400, Robert Haas wrote:

Hm. Why? If freezing gets notably cheaper I don't really see much need
for keeping that much clog around? If we still run into anti-wraparound
areas, there has to be some major operational problem.

That is true, but we have a decent number of customers who do in fact
have such problems. I think that's only going to get more common. As
hardware gets faster and PostgreSQL improves, people are going to
process more and more transactions in shorter and shorter periods of
time. Heikki's benchmark results for the XLOG scaling patch show
rates of >80,000 tps. Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours. That's not fun.

Are you assuming these are all write transactions, hence consuming xids?

Well, there might be read-only transactions as well, but the point is
about how many write transactions there can be. 10,000 tps or more is
not out of the question even today, and progressively higher numbers
are only going to become more and more common.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Freezing without write I/O

On 30 May 2013 14:33, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Since we're bashing around ideas around freezing, let me write down the idea
I've been pondering and discussing with various people for years. I don't
think I invented this myself, apologies to whoever did for not giving
credit.

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).

The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
could implicitly treat anything older than relfrozenxid as frozen.

That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
doesn't need to dirty pages that contain no dead tuples.

I have to say this is pretty spooky. I'd not read hackers all week, so
I had no idea so many other people were thinking about freezing as
well. This idea is damn near identical to what I've suggested. My
suggestion came because I was looking to get rid of fields out of the
tuple header; which didn't come to much. The good news is that is
complete chance, so it must mean we're on the right track.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Robert Haas (#3)

Re: Freezing without write I/O

On 30 May 2013 19:39, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 30, 2013 at 9:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The reason we have to freeze is that otherwise our 32-bit XIDs wrap around
and become ambiguous. The obvious solution is to extend XIDs to 64 bits, but
that would waste a lot space. The trick is to add a field to the page header
indicating the 'epoch' of the XID, while keeping the XIDs in tuple header
32-bit wide (*).

Check.

The other reason we freeze is to truncate the clog. But with 64-bit XIDs, we
wouldn't actually need to change old XIDs on disk to FrozenXid. Instead, we
could implicitly treat anything older than relfrozenxid as frozen.

Check.

That's the basic idea. Vacuum freeze only needs to remove dead tuples, but
doesn't need to dirty pages that contain no dead tuples.

Check.

Yes, this is the critical point. Large insert-only tables don't need
to be completely re-written twice.

Since we're not storing 64-bit wide XIDs on every tuple, we'd still need to
replace the XIDs with FrozenXid whenever the difference between the smallest
and largest XID on a page exceeds 2^31. But that would only happen when
you're updating the page, in which case the page is dirtied anyway, so it
wouldn't cause any extra I/O.

It would cause some extra WAL activity, but it wouldn't dirty the page
an extra time.

This would also be the first step in allowing the clog to grow larger than 2
billion transactions, eliminating the need for anti-wraparound freezing
altogether. You'd still want to truncate the clog eventually, but it would
be nice to not be pressed against the wall with "run vacuum freeze now, or
the system will shut down".

Interesting. That seems like a major advantage.

(*) "Adding an epoch" is inaccurate, but I like to use that as my mental
model. If you just add a 32-bit epoch field, then you cannot have xids from
different epochs on the page, which would be a problem. In reality, you
would store one 64-bit XID value in the page header, and use that as the
"reference point" for all the 32-bit XIDs on the tuples. See existing
convert_txid() function for how that works. Another method is to store the
32-bit xid values in tuple headers as offsets from the per-page 64-bit
value, but then you'd always need to have the 64-bit value at hand when
interpreting the XIDs, even if they're all recent.

As I see it, the main downsides of this approach are:

(1) It breaks binary compatibility (unless you do something to
provided for it, like put the epoch in the special space).

(2) It consumes 8 bytes per page. I think it would be possible to get
this down to say 5 bytes per page pretty easily; we'd simply decide
that the low-order 3 bytes of the reference XID must always be 0.
Possibly you could even do with 4 bytes, or 4 bytes plus some number
of extra bits.

Yes, the idea of having a "base Xid" on every page is complicated and
breaks compatibility. Same idea can work well if we do this via tuple
headers.

(3) You still need to periodically scan the entire relation, or else
have a freeze map as Simon and Josh suggested.

I don't think that is needed with this approach.

(The freeze map was Andres' idea, not mine. I just accepted it as what
I thought was the only way forwards. Now I see other ways)

The upsides of this approach as compared with what Andres and I are
proposing are:

(1) It provides a stepping stone towards allowing indefinite expansion
of CLOG, which is quite appealing as an alternative to a hard
shut-down.

I would be against expansion of the CLOG beyond its current size. If
we have removed all aborted rows and marked hints, then we don't need
the CLOG values and can trim that down.

I don't mind the hints, its the freezing we don't need.

convert_txid() function for how that works. Another method is to store the
32-bit xid values in tuple headers as offsets from the per-page 64-bit
value, but then you'd always need to have the 64-bit value at hand when
interpreting the XIDs, even if they're all recent.

You've touched here on the idea of putting the epoch in the tuple
header, which is where what I posted comes together. We don't need
anything at page level, we just need something on each tuple.

Please can you look at my recent post on how to put this in the tuple header?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Robert Haas (#9)

Re: Freezing without write I/O

On 31.05.2013 06:02, Robert Haas wrote:

On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com> wrote:

Random thought: Could you compute the reference XID based on the page
LSN? That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new "half-epoch" every 2 billion transactions. We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began. When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a "new half-epoch" WAL record, which
becomes the starting LSN for the new half-epoch.

Clever! Pages in unlogged tables need some extra treatment, as they
don't normally have a valid LSN, but that shouldn't be too hard.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch. Whenever we know
this to be true, we set the bit. If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.

Yep, I think that would work. Want to write the patch, or should I? ;-)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: Freezing without write I/O

On 1 June 2013 19:48, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 31.05.2013 06:02, Robert Haas wrote:

On Thu, May 30, 2013 at 2:39 PM, Robert Haas<robertmhaas@gmail.com>
wrote:

Random thought: Could you compute the reference XID based on the page
LSN? That would eliminate the storage overhead.

After mulling this over a bit, I think this is definitely possible.
We begin a new "half-epoch" every 2 billion transactions. We remember
the LSN at which the current half-epoch began and the LSN at which the
previous half-epoch began. When a new half-epoch begins, the first
backend that wants to stamp a tuple with an XID from the new
half-epoch must first emit a "new half-epoch" WAL record, which
becomes the starting LSN for the new half-epoch.

Clever! Pages in unlogged tables need some extra treatment, as they don't
normally have a valid LSN, but that shouldn't be too hard.

I like the idea of using the LSN to indicate the epoch. It saves any
other work we might consider, such as setting page or tuple level
epochs.

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch. Whenever we know
this to be true, we set the bit. If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.

Yep, I think that would work. Want to write the patch, or should I? ;-)

If we set a bit, surely we need to write the page. Isn't that what we
were trying to avoid?

Why set a bit at all? If we know the LSN of the page and we know the
epoch boundaries, then we can work out when the page was last written
to and infer that the page is "virtually frozen".

As soon as we make a change to a "virtually frozen" page, we can
actually freeze it and then make the change.

But we still have the problem of knowing which pages have been frozen
and which haven't.

Can we clear up those points first? Or at least my understanding of them.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: Freezing without write I/O

On Sat, Jun 1, 2013 at 2:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

We define a new page-level bit, something like PD_RECENTLY_FROZEN.
When this bit is set, it means there are no unfrozen tuples on the
page with XIDs that predate the current half-epoch. Whenever we know
this to be true, we set the bit. If the page LSN crosses more than
one half-epoch boundary at a time, we freeze the page and set the bit.
If the page LSN crosses exactly one half-epoch boundary, then (1) if
the bit is set, we clear it and (2) if the bit is not set, we freeze
the page and set the bit.

Yep, I think that would work. Want to write the patch, or should I? ;-)

Have at it. I think the tricky part is going to be figuring out the
synchronization around half-epoch boundaries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Simon Riggs (#16)

Re: Freezing without write I/O

On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

If we set a bit, surely we need to write the page. Isn't that what we
were trying to avoid?

No, the bit only gets set in situations when we were going to dirty
the page for some other reason anyway. Specifically, if a page
modification discovers that we've switched epochs (but just once) and
the bit isn't already set, we can set it in lieu of scanning the
entire page for tuples that need freezing.

Under this proposal, pages that don't contain any dead tuples needn't
be dirtied for freezing, ever. Smells like awesome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Robert Haas (#18)

Re: Freezing without write I/O

On 1 June 2013 21:26, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jun 1, 2013 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

If we set a bit, surely we need to write the page. Isn't that what we
were trying to avoid?

No, the bit only gets set in situations when we were going to dirty
the page for some other reason anyway. Specifically, if a page
modification discovers that we've switched epochs (but just once) and
the bit isn't already set, we can set it in lieu of scanning the
entire page for tuples that need freezing.

Under this proposal, pages that don't contain any dead tuples needn't
be dirtied for freezing, ever. Smells like awesome.

Agreed, well done both.

What I especially like about it is how little logic it will require,
and no page format changes.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Robert Haas (#8)

Re: Freezing without write I/O

On Fri, May 31, 2013 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Even at a more modest 10,000 tps, with default
settings, you'll do anti-wraparound vacuums of the entire cluster
about every 8 hours. That's not fun.

I've forgotten now. What happens if you have a long-lived transaction
still alive from > 2B xid ago?

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Bruce Momjian (#20)

#22

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Heikki Linnakangas (#21)

#23

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Bruce Momjian (#22)

#24

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Robert Haas (#23)

#25

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#24)

#26

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Simon Riggs (#25)

#27

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#26)

#28

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Simon Riggs (#27)

#29

Andres Freund

andres@anarazel.de

almost 13 years ago

In reply to: Simon Riggs (#27)

#30

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Andres Freund (#29)

#31

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Robert Haas (#28)

#32

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Heikki Linnakangas (#26)

#33

Hannu Krosing

hannu@tm.ee

almost 13 years ago

In reply to: Andres Freund (#29)

#34

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Robert Haas (#17)

#35

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#34)

#36

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Simon Riggs (#35)

#37

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Heikki Linnakangas (#34)

#38

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 13 years ago

In reply to: Heikki Linnakangas (#37)

#39

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Heikki Linnakangas (#37)

#40

Jeff Davis

pgsql@j-davis.com

over 12 years ago

In reply to: Andres Freund (#39)

#41

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Jeff Davis (#40)

#42

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Heikki Linnakangas (#38)

#43

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Heikki Linnakangas (#42)

#44

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Peter Eisentraut (#43)

#45

Alvaro Herrera

alvherre@2ndquadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#42)

#46

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Heikki Linnakangas (#42)

#47

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#42)

#48

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Jeff Janes (#47)

#49

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Andres Freund (#46)

#50

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Heikki Linnakangas (#49)

#51

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Andres Freund (#50)

#52

Ants Aasma

ants.aasma@cybertec.at

over 12 years ago

In reply to: Heikki Linnakangas (#49)

#53

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Ants Aasma (#52)

#54

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Robert Haas (#53)

#55

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Andres Freund (#54)

#56

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Robert Haas (#55)

#57

Alvaro Herrera

alvherre@2ndquadrant.com

over 12 years ago

In reply to: Andres Freund (#50)

#58

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Andres Freund (#56)

#59

didier

did447@gmail.com

over 12 years ago

In reply to: Andres Freund (#58)

#60

didier

did447@gmail.com

over 12 years ago

In reply to: Andres Freund (#58)

#61

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Andres Freund (#58)

#62

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Robert Haas (#61)

#63

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Robert Haas (#61)

#64

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Andres Freund (#51)

#65

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Jeff Janes (#47)

#66

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Heikki Linnakangas (#64)

#67

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 12 years ago

In reply to: Peter Eisentraut (#66)

#68

Ants Aasma

ants.aasma@cybertec.at

over 12 years ago

In reply to: Robert Haas (#53)

#69

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Ants Aasma (#68)

#70

Ants Aasma

ants.aasma@cybertec.at

over 12 years ago

In reply to: Andres Freund (#69)

#71

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Heikki Linnakangas (#64)

#72

Jeff Davis

pgsql@j-davis.com

over 12 years ago

In reply to: Heikki Linnakangas (#64)

#73

Peter Geoghegan

pg@bowt.ie

over 12 years ago

In reply to: Heikki Linnakangas (#65)

#74

Andres Freund

andres@anarazel.de

over 12 years ago

In reply to: Peter Geoghegan (#73)

#75

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Andres Freund (#74)

#76

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Simon Riggs (#75)

#77

Jeff Janes

jeff.janes@gmail.com

about 12 years ago

In reply to: Robert Haas (#76)

#78

David Fetter

david@fetter.org

about 12 years ago

In reply to: Jeff Janes (#77)

#79

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Jeff Janes (#77)