switch UNLOGGED to LOGGED
Hi,
I read the discussion at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.
Suppose I try to write a patch that allows
ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)
only in the wal_level == minimal case: would it be accepted as a
"first step"? Or rejected because it doesn't allow it in the other
cases?
From what I got in the discussion, handling the other wal_level cases
can be very complicated (example: the issues in case "we *crash*
without writing an abort record").
Leonardo
On Fri, Apr 8, 2011 at 6:01 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
I read the discussion at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.Suppose I try to write a patch that allows
ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)only in the wal_level == minimal case: would it be accepted as a
"first step"? Or rejected because it doesn't allow it in the other
cases?
I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.
I understand.
Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:
"what happens if we *crash* without writing an abort record? It
seems like that could leave a stray file around on a standby,
because the current code only cleans things up on the standby
at the start of recovery"
But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".
Leonardo
On Sat, Apr 9, 2011 at 3:29 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.I understand.
Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:"what happens if we *crash* without writing an abort record? It
seems like that could leave a stray file around on a standby,
because the current code only cleans things up on the standby
at the start of recovery"But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".
I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.
Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...
Leonardo
On Mon, Apr 11, 2011 at 11:41:18AM +0100, Leonardo Francalanci wrote:
But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...
If the master crashes while a transaction that used CREATE TABLE is unfinished,
both the master and the standby will indefinitely retain identical, stray (not
referenced by pg_class) files. The catalogs do reference the relfilenode of
each unlogged relation; currently, that relfilenode never exists on a standby
while that standby is accepting connections. By the time the startup process
releases the AccessExclusiveLock acquired by the proposed UNLOGGED -> normal
conversion process, that relfilenode needs to be either fully copied or unlinked
all over again. (Alternately, find some other way to make sure queries don't
read the half-copied file.) In effect, the problem is that the relfilenode is
*not* stray, so its final state does need to be well-defined.
nm
On Mon, Apr 11, 2011 at 10:29 AM, Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 11, 2011 at 11:41:18AM +0100, Leonardo Francalanci wrote:
But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...If the master crashes while a transaction that used CREATE TABLE is unfinished,
both the master and the standby will indefinitely retain identical, stray (not
referenced by pg_class) files. The catalogs do reference the relfilenode of
each unlogged relation; currently, that relfilenode never exists on a standby
while that standby is accepting connections. By the time the startup process
releases the AccessExclusiveLock acquired by the proposed UNLOGGED -> normal
conversion process, that relfilenode needs to be either fully copied or unlinked
all over again. (Alternately, find some other way to make sure queries don't
read the half-copied file.) In effect, the problem is that the relfilenode is
*not* stray, so its final state does need to be well-defined.
Oh, right.
Maybe we should just put in a rule that a server in Hot Standby mode
won't ever try to read from an unlogged table (right now we count on
the fact that there will be nothing to read). If we crash before
copying the whole file, it won't matter, because the catalogs won't
have been updated, so we'll refuse to look at it anyway. And we have
to reinitialize on entering normal running anyway, so we can clean it
up then.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
If the master crashes while a transaction that used CREATE TABLE is
unfinished,
both the master and the standby will indefinitely retain identical, stray
(not
referenced by pg_class) files. The catalogs do reference the relfilenode
of
each unlogged relation; currently, that relfilenode never exists on a
standby
while that standby is accepting connections. By the time the startup
process
releases the AccessExclusiveLock acquired by the proposed UNLOGGED ->
normal
conversion process, that relfilenode needs to be either fully copied or
unlinked
all over again. (Alternately, find some other way to make sure queries
don't
read the half-copied file.) In effect, the problem is that the relfilenode
is
*not* stray, so its final state does need to be well-defined.
Oh, right.
Maybe we should just put in a rule that a server in Hot Standby mode
won't ever try to read from an unlogged table (right now we count on
the fact that there will be nothing to read). If we crash before
copying the whole file, it won't matter, because the catalogs won't
have been updated, so we'll refuse to look at it anyway. And we have
to reinitialize on entering normal running anyway, so we can clean it
up then.
Ok then... I'll try to code the "easy" version first (the wal_level=minimal
case)
and then we'll see....
Leonardo
I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).
To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.
Everything "works" (note the quotes...) except in the case of not-clean
shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a "pg_ctl -m immediate stop") the table doesn't have any forks at all.
See xact_redo_commit:
/* Make sure files supposed to be dropped are dropped */
for (i = 0; i < xlrec->nrels; i++)
{
[...]
for (fork = 0; fork <= MAX_FORKNUM; fork++)
{
if (smgrexists(srel, fork))
{
XLogDropRelation(xlrec->xnodes[i], fork);
smgrdounlink(srel, fork, true);
}
}
smgrclose(srel);
}
[...]
Should I change xl_xact_commit in order to have, instead of:
/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */
an array of structures such as:
{
RelFileNode relfilenode;
bool onlyInitFork;
}
???
Otherwise I don't know how to tell the redo commit code to delete only
the INIT fork, instead of all the relation forks...
(maybe I'm doing all wrong: I'm open to any kind of suggestion here...)
Leonardo
Excerpts from Leonardo Francalanci's message of lun abr 18 09:36:13 -0300 2011:
I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.Everything "works" (note the quotes...) except in the case of not-clean
shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a "pg_ctl -m immediate stop") the table doesn't have any forks at all.
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).
I tried to follow your suggestion, thank you very much.
Here's a first attempt at the patch.
I "tested" it with:
create table forei (v integer primary key);
insert into forei select * from generate_series(1,10000);
create unlogged table pun (c integer primary key, constraint
con foreign key (c) references forei(v));
insert into pun select * from generate_series(1,10000);
alter table pun set logged;
then shutdown the master with "immediate":
bin/pg_ctl -D data -m immediate stop
bin/pg_ctl -D data start
and "pun" still has data:
select * from pun where c=100;
Question/comments:
1) it's a very-first-stage patch; I would need to know if something is
*very* wrong before cleaning it.
2) there are some things I implemented using a logic like "let's see how it
worked 10 lines above, and I'll do the same". For example, the 2PC stuff
is totally "copied" from the other places, I have no idea if the code makes
sense at all (how can I test it?)
3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects
4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?
Leonardo
Attachments:
unl2log_20110422.patchapplication/octet-stream; name=unl2log_20110422.patchDownload+386-30
On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).I tried to follow your suggestion, thank you very much.
I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.
3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects
Look at the way ALTER TABLE [ONLY] works for other action types, and copy it.
4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?
It's a bit awkwardly phrased the way you have it. I would suggest
something like:
ERROR: constraints on permanent tables may reference only permanent tables
HINT: constraint %s
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, May 6, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
ERROR: constraints on permanent tables may reference only permanent tables
HINT: constraint %s
Argh, hit send too soon.
HINT: constraint %s references table %s
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, May 6, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
ERROR: �constraints on permanent tables may reference only permanent tables
HINT: �constraint %s
Argh, hit send too soon.
HINT: constraint %s references table %s
<nitpick>
That looks like a DETAIL, not a HINT. Also see message style guide
about how that should be a complete sentence.
</nitpick>
regards, tom lane
On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it>
wrote:
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files
to
drop as a whole).
I tried to follow your suggestion, thank you very much.
I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.
I understand.
What if, in xl_xact_commit, instead of
RelFileNode xnodes
I use another struct for xnodes, something like:
{
RelFileNode xnode;
bool onlyInitFork;
}
That would increase the commit record size only when there are
RelFileNode(s) to drop at commit. So, instead of 4 bytes in
every commit, there are "wasted" bytes when the commit record
contains deleted permanent relations (that should happen much
less). I'm open to suggestions here...
3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objectsLook at the way ALTER TABLE [ONLY] works for other action types, and copy it.
Ok
Thank you very much
Leonardo
Excerpts from Robert Haas's message of vie may 06 23:25:09 -0300 2011:
On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).I tried to follow your suggestion, thank you very much.
I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.
Hmm, yeah. Maybe we can add a "flags" int8 somewhere in that struct and
set a bit in it if nrels, nsubxacts, nmsgs and respective arrays are present.
That would save some int's that are already in there.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, May 9, 2011 at 12:51 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
Excerpts from Robert Haas's message of vie may 06 23:25:09 -0300 2011:
On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).I tried to follow your suggestion, thank you very much.
I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.Hmm, yeah. Maybe we can add a "flags" int8 somewhere in that struct and
set a bit in it if nrels, nsubxacts, nmsgs and respective arrays are present.
That would save some int's that are already in there.
Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure (and zero). This will probably require a bit of tricky
code reorganization so I think it should be done separately from the
main patch. With that done, then it's not a big deal for the main
patch to add in one more array that will normally get omitted. And in
the process, we can save 12 bytes on every commit record in the common
case, which is quite appealing: I don't expect a huge performance
gain, but a penny saved is a penny earned.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure (and zero). This will probably require a bit of tricky
code reorganization so I think it should be done separately from the
main patch.
Ok, I'll try and send a patch with this change only.
BTW xinfo is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left "empty" for future
extentions; I don't know if we care about it...
Leonardo
On Tue, May 10, 2011 at 3:35 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure (and zero). This will probably require a bit of tricky
code reorganization so I think it should be done separately from the
main patch.Ok, I'll try and send a patch with this change only.
BTW xinfo is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left "empty" for future
extentions; I don't know if we care about it...
I don't think making xinfo shorter will save anything, because
whatever follows it is going to be a 4-byte quantity and therefore
4-byte aligned.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I don't think making xinfo shorter will save anything, because
whatever follows it is going to be a 4-byte quantity and therefore
4-byte aligned.
ups, didn't notice it.
I'll split xinfo into:
uint16 xinfo;
uint16 presentFlags;
I guess it helps with the reading? I mean, instead
of having a single uint32?