[WIP] In-place upgrade

Started by Zdenek Kotalaover 17 years ago70 messageshackers
Jump to latest
#1Zdenek Kotala
Zdenek.Kotala@Sun.COM

This is really first patch which is not clean up, but it add in-place upgrade
functionality. The patch requires other clean up patches which I already send.
You can find aslo GIT repository with "workable" version.

Main point is that tuples are converted to latest version in SeqScan and
IndexScan node. All storage/access module is able process database 8.1-8.4.
(Page Layout 3 and 4).

What works:
- select - heap scan is ok, but index scan does not work on varlena datatypes. I
need to convert index key somewhere in index access.

What does not work:
- tuple conversion which contains arrays, composite datatypes and toast
- vacuum - it tries to cleanup old pages - probably better could be converted
them to the new format during processing...
- insert/delete/update

The Patch contains lot of extra comments and rubbish, but it is in process of
cleanup.

What I need to know/solve:

1) yes/no for this kind of online upgrade method
2) I'm not sure if the calling ExecStoreTuple correct.
3) I'm still looking best place to store old data structures and conversion
functions. My idea is to create new directories:
src/include/odf/v03/...
src/backend/storage/upgrade/
src/backend/access/upgrade
(odf = On Disk Format)

Links:
http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=summary
http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/

Thanks for your comments

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Attachments:

inplaceupgrade.patchtext/x-diff; name=inplaceupgrade.patchDownload+1287-696
#2Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#1)
Re: [WIP] In-place upgrade

I tried to apply this patch to CVS HEAD and it blew up all over the
place. It doesn't seem to be intended to apply against CVS HEAD; for
example, I don't have backend/access/heap/htup.c at all, so can't
apply changes to that file. I was able to clone the GIT repository
with the following command...

git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git

...but now I'm confused, because I don't see the changes from the diff
reflected in the resulting tree. As you can see, I am not a git
wizard. Any help would be appreciated.

Here are a few initial thoughts based mostly on reading the diff:

In the minor nit department, I don't really like the idea of
PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
I think the latest version should just be PageHeaderData and
SizeOfPageHeaderData, and previous versions should be, e.g.
PageHeaderDataV3. It looks to me like this would cut a few hunks out
of this and maybe make it a bit easier to understand what is going on.
At any rate, if we are going to stick with an explicit version number
in both versions, it should be marked in a consistent way, not _04
sometimes and just 04 other times. My suggestion is e.g. "V4" but
YMMV.

The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
It looks like the added code is (nearly?) identical in both places, so
probably it needs to be refactored to avoid code duplication. I'm
also a bit skeptical about the idea of doing the tuple conversion
here. Why here rather than ExecStoreTuple()? If you decide to
convert the tuple, you can palloc the new one, pfree the old one if
ShouldFree is set, and reset shouldFree to true.

I am pretty skeptical of the idea that all of the HeapTuple* functions
can just be conditionalized on the page version and everything will
Just Work. It seems like that is too low a level to be worrying about
such things. Even if it happens to work for the changes between V3
and V4, what happens when V5 or V6 is changed in such a way that the
answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
"Maybe" or "Seven"? The performance hit also sounds painful. I don't
have a better idea right now though...

I think it's going to be absolutely imperative to begin vacuuming away
old V3 pages as quickly as possible after the upgrade. If you go with
the approach of converting the tuple in, or just before,
ExecStoreTuple, then you're going to introduce a lot of overhead when
working with V3 pages. I think that's fine. You should plan to do
your in-place upgrade at 1AM on Christmas morning (or whenever your
load hits rock bottom...) and immediately start converting the
database, starting with your most important and smallest tables. In
fact, I would look whenever possible for ways to make the V4 case a
fast-path and just accept that the system is going to labor a bit when
dealing with V3 stuff. Any overhead you introduce when dealing with
V3 pages can go away; any V4 overhead is permanent and therefore much
more difficult to accept.

That's about all I have for now... if you can give me some pointers on
working with this git repository, or provide a complete patch that
applies cleanly to CVS HEAD, I will try to look at this in more
detail.

...Robert

#3Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Robert Haas (#2)
Re: [WIP] In-place upgrade

Big thanks for review.

Robert Haas napsal(a):

I tried to apply this patch to CVS HEAD and it blew up all over the
place. It doesn't seem to be intended to apply against CVS HEAD; for
example, I don't have backend/access/heap/htup.c at all, so can't
apply changes to that file.

You need to apply also two other patches:
which are located here:
http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
I moved one related patch from another category here to correct place.

The problem is that it is difficult to keep it in sync with head, because they
change a lot of things. It the reason why I put all also into GIT repository,
but ...

I was able to clone the GIT repository
with the following command...

git clone http://git.postgresql.org/git/~davidfetter/upgrade_in_place/.git

...but now I'm confused, because I don't see the changes from the diff
reflected in the resulting tree. As you can see, I am not a git
wizard. Any help would be appreciated.

I'm GIT newbie I use mercurial for development and I manually applied changes
into GIT. I asked David Fetter with help how to get back the correct clone. In
meantime you can download a tarball.

http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz

It should contains every think including yesterdays improvements (delete,
insert, update works - inser/update only on table without index).

Here are a few initial thoughts based mostly on reading the diff:

In the minor nit department, I don't really like the idea of
PageHeaderData_04, SizeOfPageHeaderData04, PageLayoutIsValid_04, etc.
I think the latest version should just be PageHeaderData and
SizeOfPageHeaderData, and previous versions should be, e.g.
PageHeaderDataV3. It looks to me like this would cut a few hunks out
of this and maybe make it a bit easier to understand what is going on.
At any rate, if we are going to stick with an explicit version number
in both versions, it should be marked in a consistent way, not _04
sometimes and just 04 other times. My suggestion is e.g. "V4" but
YMMV.

Yeah, it is most difficult part :-) find correct names for it. I think that each
version of structure should have version suffix including lastone. And of
cource the last one we should have a general name without suffix - see example:

typedef struct PageHeaderData_04 { ...} PageHeaderData_04
typedef struct PageHeaderData_03 { ...} PageHeaderData_03
typedef PageHeaderData_04 PageHeaderData

This allows you exactly specify version on places where you need it and keep
general name where version is not relevant.

How suffix should looks it another question. I prefer to have 04 not only 4.
What's about PageHeaderData_V04?

By the way what YMMV means?

The changes to nodeIndexscan.c and nodeSeqscan.c are worrisome to me.
It looks like the added code is (nearly?) identical in both places, so
probably it needs to be refactored to avoid code duplication. I'm
also a bit skeptical about the idea of doing the tuple conversion
here. Why here rather than ExecStoreTuple()? If you decide to
convert the tuple, you can palloc the new one, pfree the old one if
ShouldFree is set, and reset shouldFree to true.

Good point. I thought about it as a one variant. And if I look it close now it
is really much better place. It should fix a problem why REINDEX does not work.
I will move it.

I am pretty skeptical of the idea that all of the HeapTuple* functions
can just be conditionalized on the page version and everything will
Just Work. It seems like that is too low a level to be worrying about
such things. Even if it happens to work for the changes between V3
and V4, what happens when V5 or V6 is changed in such a way that the
answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
"Maybe" or "Seven"? The performance hit also sounds painful. I don't
have a better idea right now though...

OK. Currently it works (or I hope that it works). If somebody in a future invent
some special change, i think in most (maybe all) cases there will be possible
mapping.

The speed is key point. When I check it last time I go 1% performance drop in
fresh database. I think 1% is good price for in-place online upgrade.

I think it's going to be absolutely imperative to begin vacuuming away
old V3 pages as quickly as possible after the upgrade. If you go with
the approach of converting the tuple in, or just before,
ExecStoreTuple, then you're going to introduce a lot of overhead when
working with V3 pages. I think that's fine. You should plan to do
your in-place upgrade at 1AM on Christmas morning (or whenever your
load hits rock bottom...) and immediately start converting the
database, starting with your most important and smallest tables. In
fact, I would look whenever possible for ways to make the V4 case a
fast-path and just accept that the system is going to labor a bit when
dealing with V3 stuff. Any overhead you introduce when dealing with
V3 pages can go away; any V4 overhead is permanent and therefore much
more difficult to accept.

Yes, it is a plan to improve vacuum to convert old page to new one. But in as a
second step. I have already page converter code. With some modification it could
be integrated easily into vacuum code.

That's about all I have for now... if you can give me some pointers on
working with this git repository, or provide a complete patch that
applies cleanly to CVS HEAD, I will try to look at this in more
detail.

Thanks for your comments. Try snapshot link. I hope that it will work.

Zdenek

PS: I'm sorry about response time, but I'm on training this week.

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#4Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#3)
Re: [WIP] In-place upgrade

You need to apply also two other patches:
which are located here:
http://wiki.postgresql.org/wiki/CommitFestInProgress#Upgrade-in-place_and_related_issues
I moved one related patch from another category here to correct place.

Just to confirm, which two?

http://git.postgresql.org/?p=~davidfetter/upgrade_in_place/.git;a=snapshot;h=c72bafada59ed278ffac59657c913bc375f77808;sf=tgz

It should contains every think including yesterdays improvements (delete,
insert, update works - inser/update only on table without index).

Wow, sounds like great improvements. I understand your difficulties
in keeping up with HEAD, but I hope we can figure out some solution,
because right now I have a diff (that I can't apply) and a tarball
(that I can't diff) and that is not ideal for reviewing.

Yeah, it is most difficult part :-) find correct names for it. I think that
each version of structure should have version suffix including lastone. And
of cource the last one we should have a general name without suffix - see
example:

typedef struct PageHeaderData_04 { ...} PageHeaderData_04
typedef struct PageHeaderData_03 { ...} PageHeaderData_03
typedef PageHeaderData_04 PageHeaderData

This allows you exactly specify version on places where you need it and keep
general name where version is not relevant.

That doesn't make sense to me. If PageHeaderData and
PageHeaderData_04 are the same type, how do you decide which one to
use in any particular place in the code?

How suffix should looks it another question. I prefer to have 04 not only 4.
What's about PageHeaderData_V04?

I prefer "V" as a delimiter rather than "_" because that makes it more
clear that the number which follows is a version number, but I think
"_V" is overkill. However, I don't really want to argue the point;
I'm just throwing in my $0.02 and I am sure others will have their own
views as well.

By the way what YMMV means?

"Your Mileage May Vary."
http://www.urbandictionary.com/define.php?term=YMMV

I am pretty skeptical of the idea that all of the HeapTuple* functions
can just be conditionalized on the page version and everything will
Just Work. It seems like that is too low a level to be worrying about
such things. Even if it happens to work for the changes between V3
and V4, what happens when V5 or V6 is changed in such a way that the
answer to HeapTupleIsWhatever is neither "Yes" nor "No", but rather
"Maybe" or "Seven"? The performance hit also sounds painful. I don't
have a better idea right now though...

OK. Currently it works (or I hope that it works). If somebody in a future
invent some special change, i think in most (maybe all) cases there will be
possible mapping.

The speed is key point. When I check it last time I go 1% performance drop
in fresh database. I think 1% is good price for in-place online upgrade.

I think that's arguable and something that needs to be more broadly
discussed. I wouldn't be keen to pay a 1% performance drop for this
feature, because it's not a feature I really need. Sure, in-place
upgrade would be nice to have, but for me, dump and reload isn't a
huge problem. It's a lot better than the 5% number you quoted
previously, but I'm not sure whether it is good enough,

I would feel more comfortable if the feature could be completely
disabled via compile-time defines. Then you could build the system
either with or without in-place upgrade, according to your needs. But
I don't think that's very practical with HeapTuple* as functions. You
could conditionalize away the switch, but the function call overhead
would remain. To get rid of that, you'd need some enormous, fragile
hack that I don't even want to contemplate.

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code. Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled). In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

But I think we probably need some input from -core on this topic as well.

...Robert

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: [WIP] In-place upgrade

"Robert Haas" <robertmhaas@gmail.com> writes:

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code. Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled). In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

We already do check the page version on read-in --- see PageHeaderIsValid.

But I think we probably need some input from -core on this topic as well.

I concur that I don't want to see this patch adding more than the
absolute unavoidable minimum of overhead for data that meets the
"current" layout definition. I'm disturbed by the proposal to stick
overhead into tuple header access, for example.

regards, tom lane

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: [WIP] In-place upgrade

We already do check the page version on read-in --- see PageHeaderIsValid.

Right, but the only place this is called is in ReadBuffer_common,
which doesn't seem like a suitable place to deal with the possibility
of a V3 page since you don't yet know what you plan to do with it.
I'm not quite sure what the right solution to that problem is...

But I think we probably need some input from -core on this topic as well.

I concur that I don't want to see this patch adding more than the
absolute unavoidable minimum of overhead for data that meets the
"current" layout definition. I'm disturbed by the proposal to stick
overhead into tuple header access, for example.

...but it seems like we both agree that conditionalizing heap tuple
header access on page version is not the right answer. Based on that,
I'm going to move the "htup and bufpage API clean up" patch to
"Returned with feedback" and continue reviewing the remainder of these
patches.

As I'm looking at this, I'm realizing another problem - there is a lot
of code that looks like this:

void HeapTupleSetXmax(HeapTuple tuple, TransactionId xmax)
{
switch(tuple->t_ver)
{
case 4 : tuple->t_data->t_choice.t_heap.t_xmax = xmax;
break;
case 3 : TPH03(tuple)->t_choice.t_heap.t_xmax = xmax;
break;
default: elog(PANIC, "HeapTupleSetXmax is not supported.");
}
}

TPH03 is a macro that is casting tuple->t_data to HeapTupleHeader_03.
Unless I'm missing something, that means that given an arbitrary
pointer to HeapTuple, there is absolutely no guarantee that
tuple->t_data->t_choice actually points to that field at all. It will
if tuple->t_ver happens to be 4 OR if HeapTupleHeader and
HeapTupleHeader_03 happen to agree on where t_choice is; otherwise it
points to some other member of HeapTupleHeader_03, or off the end of
the structure. To me that seems unacceptably fragile, because it
means the compiler can't warn us that we're using a pointer
inappropriately. If we truly want to be safe here then we need to
create an opaque HeapTupleHeader structure that contains only those
elements that HeapTupleHeader_03 and HeapTupleHeader_04 have in
common, and cast BOTH of them after checking the version. That way if
somone writes a function that attempts to deference a HeapTupleHeader
without going through the API, it will fail to compile rather than
mostly working but possibly failing on a V3 page.

...Robert

#7Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Robert Haas (#4)
Re: [WIP] In-place upgrade

Robert Haas napsal(a):

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code. Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled). In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

OK. It was original idea to make "Convert on read" which has several problems
with no easy solution. One is that new data does not fit on the page and second
big problem is how to convert TOAST table data. Another problem which is general
is how to convert indexes...

Convert on read has minimal impact on core when latest version is processed. But
problem is what happen when you need to migrate tuple form page to new one
modify index and also needs convert toast value(s)... Problem is that response
could be long in some query, because it invokes a lot of changes and conversion.
I think in corner case it could requires converts all index when you request
one record.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#8Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#7)
Re: [WIP] In-place upgrade

OK. It was original idea to make "Convert on read" which has several
problems with no easy solution. One is that new data does not fit on the
page and second big problem is how to convert TOAST table data. Another
problem which is general is how to convert indexes...

Convert on read has minimal impact on core when latest version is processed.
But problem is what happen when you need to migrate tuple form page to new
one modify index and also needs convert toast value(s)... Problem is that
response could be long in some query, because it invokes a lot of changes
and conversion. I think in corner case it could requires converts all index
when you request one record.

I don't think I'm proposing convert on read, exactly. If you actually
try to convert the entire page when you read it in, I think you're
doomed to failure, because, as you rightly point out, there is
absolutely no guarantee that the page contents in their new format
will still fit into one block. I think what you want to do is convert
the structures within the page one by one as you read them out of the
page. The proposed refactoring of ExecStoreTuple will do exactly
this, for example.

HEAD uses a pointer into the actual buffer for a V4 tuple that comes
from an existing relation, and a pointer to a palloc'd structure for a
tuple that is generated during query execution. The proposed
refactoring will keep these rules, plus add a new rule that if you
happen to read a V3 page, you will palloc space for a new V4 tuple
that is semantically equivalent to the V3 tuple on the page, and use
that pointer instead. That, it seems to me, is exactly the right
balance - the PAGE is still a V3 page, but all of the tuples that the
upper-level code ever sees are V4 tuples.

I'm not sure how far this particular approach can be generalized.
ExecStoreTuple has the advantage that it already has to deal with both
direct buffer pointers and palloc'd structures, so the code doesn't
need to be much more complex to handle this case as well. I think the
thing to do is go through and scrutinize all of the ReadBuffer call
sites and figure out an approach to each one. I haven't looked at
your latest code yet, so you may have already done this, but just for
example, RelationGetBufferForTuple should probably just reject any V3
pages encountered as if they were full, including updating the FSM
where appropriate. I would think that it would be possible to
implement that with almost zero performance impact. I'm happy to look
at and discuss the problem cases with you, and hopefully others will
chime in as well since my knowledge of the code is far from
exhaustive.

...Robert

#9Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Robert Haas (#8)
Re: [WIP] In-place upgrade

Robert Haas napsal(a):

OK. It was original idea to make "Convert on read" which has several
problems with no easy solution. One is that new data does not fit on the
page and second big problem is how to convert TOAST table data. Another
problem which is general is how to convert indexes...

Convert on read has minimal impact on core when latest version is processed.
But problem is what happen when you need to migrate tuple form page to new
one modify index and also needs convert toast value(s)... Problem is that
response could be long in some query, because it invokes a lot of changes
and conversion. I think in corner case it could requires converts all index
when you request one record.

I don't think I'm proposing convert on read, exactly. If you actually
try to convert the entire page when you read it in, I think you're
doomed to failure, because, as you rightly point out, there is
absolutely no guarantee that the page contents in their new format
will still fit into one block. I think what you want to do is convert
the structures within the page one by one as you read them out of the
page. The proposed refactoring of ExecStoreTuple will do exactly
this, for example.

I see. But Vacuum and other internals function access heap pages directly
without ExecStoreTuple. however you point to one idea which I'm currently
thinking about it too. There is my version:

If you look into new page API it has PageGetHeapTuple. It could do the
conversion job. Problem is that you don't have relation info there and you
cannot convert data, but transaction information can be converted.

I think about HeapTupleData structure modification. It will have pointer to
transaction info t_transinfo, which will point to the page tuple for V4. For V3
PageGetHeapTuple function will allocate memory and put converted data here.

ExecStoreTuple will finally convert data. Because it know about relation and It
does not make sense convert data early. Who wants to convert invisible or dead data.

With this approach tuple will be processed same way with V4 without any overhead
(they will be small overhead with allocating and free heaptupledata in some
places - mostly vacuum).

Only multi version access will be driven on page basis.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#10Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#9)
Re: [WIP] In-place upgrade

I see. But Vacuum and other internals function access heap pages directly
without ExecStoreTuple.

Right. I don't think there's any getting around the fact that any
function which accesses heap pages directly is going to need
modification. The key is to make those modifications as non-invasive
as possible. For example, in the case of vacuum, as soon as it
detects that a V3 page has been read, it should call a special
function whose only purpose in life is to move the data out of that V3
page and onto one or more V4 pages, and return. What you shouldn't do
is try to make the regular vacuum code handle both V3 and V4 pages,
because that will lead to code that may be slow and will almost
certainly be complicated and difficult to maintain.

I'll read through the rest of this when I have a bit more time.

...Robert

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Zdenek Kotala (#7)
Re: [WIP] In-place upgrade

Zdenek Kotala wrote:

Robert Haas napsal(a):

Really, what I'd ideally like to see here is a system where the V3
code is in essence error-recovery code. Everything should be V4-only
unless you detect a V3 page, and then you error out (if in-place
upgrade is not enabled) or jump to the appropriate V3-aware code (if
in-place upgrade is enabled). In theory, with a system like this, it
seems like the overhead for V4 ought to be no more than the cost of
checking the page version on each page read, which is a cheap sanity
check we'd be willing to pay for anyway, and trivial in cost.

OK. It was original idea to make "Convert on read" which has several
problems with no easy solution. One is that new data does not fit on the
page and second big problem is how to convert TOAST table data. Another
problem which is general is how to convert indexes...

We've talked about this many times before, so I'm sure you know what my
opinion is. Let me phrase it one more time:

1. You *will* need a function to convert a page from old format to new
format. We do want to get rid of the old format pages eventually,
whether it's during VACUUM, whenever a page is read in, or by using an
extra utility. And that process needs to online. Please speak up now if
you disagree with that.

2. It follows from point 1, that you *will* need to solve the problems
with pages where the data doesn't fit on the page in new format, as well
as converting TOAST data.

We've discussed various solutions to those problems; it's not
insurmountable. For the "data doesn't fit anymore" problem, a fairly
simple solution is to run a pre-upgrade utility in the old version, that
reserves some free space on each page, to make sure everything fits
after converting to new format. For TOAST, you can retoast tuples when
the heap page is read in. I'm not sure what the problem with indexes is,
but you can split pages if necessary, for example.

Assuming everyone agrees with point 1, could we focus on these issues?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#11)
Re: [WIP] In-place upgrade

We've talked about this many times before, so I'm sure you know what my
opinion is. Let me phrase it one more time:

1. You *will* need a function to convert a page from old format to new
format. We do want to get rid of the old format pages eventually, whether
it's during VACUUM, whenever a page is read in, or by using an extra
utility. And that process needs to online. Please speak up now if you
disagree with that.

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,
but in the meantime you can still continue to read the old pages
without being forced to change them to the new format. That's not the
only possible approach, but it's not obvious to me that it's insane.
If you think it's a non-starter, it would be good to know why.

...Robert

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: [WIP] In-place upgrade

"Robert Haas" <robertmhaas@gmail.com> writes:

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,
but in the meantime you can still continue to read the old pages
without being forced to change them to the new format. That's not the
only possible approach, but it's not obvious to me that it's insane.
If you think it's a non-starter, it would be good to know why.

That's sane *if* you can guarantee that only negligible overhead is
added for accessing data that is in the up-to-date format. I don't
think that will be the case if we start putting version checks into
every tuple access macro.

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: [WIP] In-place upgrade

That's sane *if* you can guarantee that only negligible overhead is
added for accessing data that is in the up-to-date format. I don't
think that will be the case if we start putting version checks into
every tuple access macro.

Yes, the point is that you'll read the page as V3 or V4, whichever it
is, but if it's V3, you'll convert the tuples to V4 format before you
try to doing anything with them (for example by modifying
ExecStoreTuple to copy any V3 tuple into a palloc'd buffer, which fits
nicely into what that function already does).

...Robert

#15Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#12)
Re: [WIP] In-place upgrade

"Robert Haas" <robertmhaas@gmail.com> writes:

We've talked about this many times before, so I'm sure you know what my
opinion is. Let me phrase it one more time:

1. You *will* need a function to convert a page from old format to new
format. We do want to get rid of the old format pages eventually, whether
it's during VACUUM, whenever a page is read in, or by using an extra
utility. And that process needs to online. Please speak up now if you
disagree with that.

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,

No, if you can force that process along via some command, whatever it is, then
you're still in the category he described.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#16Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#15)
Re: [WIP] In-place upgrade

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,

No, if you can force that process along via some command, whatever it is, then
you're still in the category he described.

Maybe. The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question. Since that seemed to be Heikki's
main concern, I thought we must be talking about different things. My
thought was that the code path for converting a tuple would be very
similar to what heap_update does today, and large tuples would be
handled via TOAST just as they are now - by converting the relation
one tuple at a time, you might end up with a new relation that has
either more or fewer pages than the old relation, and it really
doesn't matter which.

I haven't really thought through all of the other kinds of things that
might need to be converted, though. That's where it would be useful
for someone more experienced to weigh in on indexes, etc.

...Robert

#17Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#16)
Re: [WIP] In-place upgrade

"Robert Haas" <robertmhaas@gmail.com> writes:

Well, I just proposed an approach that doesn't work this way, so I
guess I'll have to put myself in the disagree category, or anyway yet
to be convinced. As long as you can move individual tuples onto new
pages, you can eventually empty V3 pages and reinitialize them as new,
empty V4 pages. You can force that process along via, say, VACUUM,

No, if you can force that process along via some command, whatever it is, then
you're still in the category he described.

Maybe. The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question.

No it's not, because as you pointed out you still need a way for the user to
force it to happen sometime. Unless you're going to be happy with telling
users they need to update all their tuples which would not be an online
process.

In any case it sounds like you're saying you want to allow multiple versions
of tuples on the same page -- which a) would be much harder and b) doesn't
solve the problem since the page still has to be converted sometime anyways.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#18Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#17)
Re: [WIP] In-place upgrade

Maybe. The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question.

No it's not, because as you pointed out you still need a way for the user to
force it to happen sometime. Unless you're going to be happy with telling
users they need to update all their tuples which would not be an online
process.

In any case it sounds like you're saying you want to allow multiple versions
of tuples on the same page -- which a) would be much harder and b) doesn't
solve the problem since the page still has to be converted sometime anyways.

No, that's not what I'm suggesting. My thought was that any V3 page
would be treated as if it were completely full, with the exception of
a completely empty page which can be reinitialized as a V4 page. So
you would never add any tuples to a V3 page, but you would need to
update xmax, hint bits, etc. Eventually when all the tuples were dead
you could reuse the page.

...Robert

#19Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#18)
Re: [WIP] In-place upgrade

"Robert Haas" <robertmhaas@gmail.com> writes:

Maybe. The difference is that I'm talking about converting tuples,
not pages, so "What happens when the data doesn't fit on the new
page?" is a meaningless question.

No it's not, because as you pointed out you still need a way for the user to
force it to happen sometime. Unless you're going to be happy with telling
users they need to update all their tuples which would not be an online
process.

In any case it sounds like you're saying you want to allow multiple versions
of tuples on the same page -- which a) would be much harder and b) doesn't
solve the problem since the page still has to be converted sometime anyways.

No, that's not what I'm suggesting. My thought was that any V3 page
would be treated as if it were completely full, with the exception of
a completely empty page which can be reinitialized as a V4 page. So
you would never add any tuples to a V3 page, but you would need to
update xmax, hint bits, etc. Eventually when all the tuples were dead
you could reuse the page.

But there's no guarantee that will ever happen. Heikki claimed you would need
a mechanism to convert the page some day and you said you proposed a system
where that wasn't true.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#20Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#19)
Re: [WIP] In-place upgrade

No, that's not what I'm suggesting. My thought was that any V3 page
would be treated as if it were completely full, with the exception of
a completely empty page which can be reinitialized as a V4 page. So
you would never add any tuples to a V3 page, but you would need to
update xmax, hint bits, etc. Eventually when all the tuples were dead
you could reuse the page.

But there's no guarantee that will ever happen. Heikki claimed you would need
a mechanism to convert the page some day and you said you proposed a system
where that wasn't true.

What's the scenario you're concerned about? An old snapshot that
never goes away?

Can we lock the old and new pages, move the tuple to a V4 page, and
update index entries without changing xmin/xmax?

...Robert

#21Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#20)
#22Joshua D. Drake
jd@commandprompt.com
In reply to: Bruce Momjian (#21)
#23Bruce Momjian
bruce@momjian.us
In reply to: Joshua D. Drake (#22)
#24Joshua D. Drake
jd@commandprompt.com
In reply to: Bruce Momjian (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#21)
#26Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Heikki Linnakangas (#11)
#27Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#5)
#28Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#25)
#29Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Bruce Momjian (#28)
#30Martijn van Oosterhout
kleptog@svana.org
In reply to: Zdenek Kotala (#29)
#31Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Martijn van Oosterhout (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#31)
#34Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#33)
#35Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#34)
#36Bruce Momjian
bruce@momjian.us
In reply to: Martijn van Oosterhout (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#37)
#39Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)
#41Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#41)
#43Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#42)
#45Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#44)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#42)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#45)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#46)
#50Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#49)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#50)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#50)
#53Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#52)
#54Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Heikki Linnakangas (#46)
#55Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#49)
#56Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#42)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#56)
#58Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#45)
#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#58)
#60Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#59)
#61Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Jim Nasby (#58)
#62Matthew T. O'Connor
matthew@zeut.net
In reply to: Tom Lane (#59)
#63Joshua D. Drake
jd@commandprompt.com
In reply to: Matthew T. O'Connor (#62)
#64Jeff
threshar@torgo.978.org
In reply to: Joshua D. Drake (#60)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Jeff (#64)
#66Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Robert Haas (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Zdenek Kotala (#66)
#68Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#65)
#69Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Alvaro Herrera (#68)
#70Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Robert Haas (#67)