switch UNLOGGED to LOGGED

Started by Leonardo Francalancialmost 15 years ago44 messages
#1Leonardo Francalanci
m_lists@yahoo.it

Hi,

I read the discussion at

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php

From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.

Suppose I try to write a patch that allows

ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)

only in the wal_level == minimal case: would it be accepted as a
"first step"? Or rejected because it doesn't allow it in the other
cases?

From what I got in the discussion, handling the other wal_level cases
can be very complicated (example: the issues in case "we *crash*
without writing an abort record").

Leonardo

#2Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#1)
Re: switch UNLOGGED to LOGGED

On Fri, Apr 8, 2011 at 6:01 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I read the discussion at

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php

From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.

Suppose I try to write a patch that allows

ALTER TABLE tablename SET LOGGED (or UNLOGGED)
(proper sql wording to be discussed...)

only in the wal_level == minimal case: would it be accepted as a
"first step"? Or rejected because it doesn't allow it in the other
cases?

I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#2)
Re: switch UNLOGGED to LOGGED

I'm pretty sure we wouldn't accept a patch for a feature that would
only work with wal_level=minimal, but it might be a useful starting
point for someone else to keep hacking on.

I understand.

Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:

"what happens if we *crash* without writing an abort record? It
seems like that could leave a stray file around on a standby,
because the current code only cleans things up on the standby
at the start of recovery"

But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".

Leonardo

#4Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#3)
Re: switch UNLOGGED to LOGGED

On Sat, Apr 9, 2011 at 3:29 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I'm pretty sure we wouldn't accept a patch for a  feature that would
only work with wal_level=minimal, but it might be a useful  starting
point for someone else to keep hacking on.

I understand.

Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thought I got the part:

"what happens if we *crash* without writing an abort record?  It
seems  like that could leave a stray file around on a standby,
because the  current code only cleans things up on the standby
at the start of  recovery"

But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".

I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#4)
Re: switch UNLOGGED to LOGGED

But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".

I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.

Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...

Leonardo

#6Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#5)
Re: switch UNLOGGED to LOGGED

On Mon, Apr 11, 2011 at 11:41:18AM +0100, Leonardo Francalanci wrote:

But re-reading it, I don't understand: what's the difference in creating
a new "regular" table and crashing before emitting the abort record,
and converting an unlogged table to logged and crashing before
emitting the abort record? How do the standby servers handle a
"CREATE TABLE" followed by a ROLLBACK if the master crashes
before writing the abort record? I thought that too would "leave a
stray file around on a standby".

I've been thinking about the same thing. And AFAICS, your analysis is
correct, though there may be some angle to it I'm not seeing.

Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...

If the master crashes while a transaction that used CREATE TABLE is unfinished,
both the master and the standby will indefinitely retain identical, stray (not
referenced by pg_class) files. The catalogs do reference the relfilenode of
each unlogged relation; currently, that relfilenode never exists on a standby
while that standby is accepting connections. By the time the startup process
releases the AccessExclusiveLock acquired by the proposed UNLOGGED -> normal
conversion process, that relfilenode needs to be either fully copied or unlinked
all over again. (Alternately, find some other way to make sure queries don't
read the half-copied file.) In effect, the problem is that the relfilenode is
*not* stray, so its final state does need to be well-defined.

nm

#7Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#6)
Re: switch UNLOGGED to LOGGED

On Mon, Apr 11, 2011 at 10:29 AM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Apr 11, 2011 at 11:41:18AM +0100, Leonardo Francalanci wrote:

But re-reading  it, I don't understand: what's the difference in creating
a new  "regular" table and crashing before emitting the abort record,
and  converting an unlogged table to logged and crashing before
emitting the  abort record? How do the standby servers handle a
"CREATE TABLE"  followed by a ROLLBACK if the master crashes
before writing the abort  record? I thought that too would "leave a
stray file around on a  standby".

I've been thinking about the same thing.  And AFAICS, your  analysis is
correct, though there may be some angle to it I'm not  seeing.

Anyone else? I would like to know if what I'm trying to do is, in fact,
possible... otherwise starting with thewal_level=minimal case first
will be wasted effort in case the other cases can't be integrated
somehow...

If the master crashes while a transaction that used CREATE TABLE is unfinished,
both the master and the standby will indefinitely retain identical, stray (not
referenced by pg_class) files.  The catalogs do reference the relfilenode of
each unlogged relation; currently, that relfilenode never exists on a standby
while that standby is accepting connections.  By the time the startup process
releases the AccessExclusiveLock acquired by the proposed UNLOGGED -> normal
conversion process, that relfilenode needs to be either fully copied or unlinked
all over again.  (Alternately, find some other way to make sure queries don't
read the half-copied file.)  In effect, the problem is that the relfilenode is
*not* stray, so its final state does need to be well-defined.

Oh, right.

Maybe we should just put in a rule that a server in Hot Standby mode
won't ever try to read from an unlogged table (right now we count on
the fact that there will be nothing to read). If we crash before
copying the whole file, it won't matter, because the catalogs won't
have been updated, so we'll refuse to look at it anyway. And we have
to reinitialize on entering normal running anyway, so we can clean it
up then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#7)
Re: switch UNLOGGED to LOGGED

If the master crashes while a transaction that used CREATE TABLE is

unfinished,

both the master and the standby will indefinitely retain identical, stray

(not

referenced by pg_class) files. The catalogs do reference the relfilenode

of

each unlogged relation; currently, that relfilenode never exists on a

standby

while that standby is accepting connections. By the time the startup

process

releases the AccessExclusiveLock acquired by the proposed UNLOGGED ->

normal

conversion process, that relfilenode needs to be either fully copied or

unlinked

all over again. (Alternately, find some other way to make sure queries

don't

read the half-copied file.) In effect, the problem is that the relfilenode

is

*not* stray, so its final state does need to be well-defined.

Oh, right.

Maybe we should just put in a rule that a server in Hot Standby mode
won't ever try to read from an unlogged table (right now we count on
the fact that there will be nothing to read). If we crash before
copying the whole file, it won't matter, because the catalogs won't
have been updated, so we'll refuse to look at it anyway. And we have
to reinitialize on entering normal running anyway, so we can clean it
up then.

Ok then... I'll try to code the "easy" version first (the wal_level=minimal
case)
and then we'll see....

Leonardo

#9Leonardo Francalanci
m_lists@yahoo.it
In reply to: Leonardo Francalanci (#8)
Re: switch UNLOGGED to LOGGED

I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).

To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.

Everything "works" (note the quotes...) except in the case of not-clean
shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a "pg_ctl -m immediate stop") the table doesn't have any forks at all.

See xact_redo_commit:

/* Make sure files supposed to be dropped are dropped */
for (i = 0; i < xlrec->nrels; i++)
{
[...]
for (fork = 0; fork <= MAX_FORKNUM; fork++)
{
if (smgrexists(srel, fork))
{
XLogDropRelation(xlrec->xnodes[i], fork);
smgrdounlink(srel, fork, true);
}
}
smgrclose(srel);
}
[...]

Should I change xl_xact_commit in order to have, instead of:

/* Array of RelFileNode(s) to drop at commit */
RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */

an array of structures such as:

{
RelFileNode relfilenode;
bool onlyInitFork;
}

???

Otherwise I don't know how to tell the redo commit code to delete only
the INIT fork, instead of all the relation forks...
(maybe I'm doing all wrong: I'm open to any kind of suggestion here...)

Leonardo

#10Alvaro Herrera
alvherre@commandprompt.com
In reply to: Leonardo Francalanci (#9)
Re: switch UNLOGGED to LOGGED

Excerpts from Leonardo Francalanci's message of lun abr 18 09:36:13 -0300 2011:

I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).

To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.

Everything "works" (note the quotes...) except in the case of not-clean
shutdown ("-m immediate" to pg_ctl stop). The reason is that the replay
code doesn't have any idea that it has to remove only the INIT fork: it will
remove ALL forks; so at the end of the redo procedure (at startup, after
a "pg_ctl -m immediate stop") the table doesn't have any forks at all.

Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#11Leonardo Francalanci
m_lists@yahoo.it
In reply to: Alvaro Herrera (#10)
1 attachment(s)
Re: switch UNLOGGED to LOGGED

Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files to
drop as a whole).

I tried to follow your suggestion, thank you very much.

Here's a first attempt at the patch.

I "tested" it with:

create table forei (v integer primary key);
insert into forei select * from generate_series(1,10000);
create unlogged table pun (c integer primary key, constraint
con foreign key (c) references forei(v));
insert into pun select * from generate_series(1,10000);
alter table pun set logged;

then shutdown the master with "immediate":

bin/pg_ctl -D data -m immediate stop
bin/pg_ctl -D data start

and "pun" still has data:

select * from pun where c=100;

Question/comments:

1) it's a very-first-stage patch; I would need to know if something is
*very* wrong before cleaning it.
2) there are some things I implemented using a logic like "let's see how it
worked 10 lines above, and I'll do the same". For example, the 2PC stuff
is totally "copied" from the other places, I have no idea if the code makes
sense at all (how can I test it?)
3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects
4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?

Leonardo

Attachments:

unl2log_20110422.patchapplication/octet-stream; name=unl2log_20110422.patchDownload
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
new file mode 100644
index 2812681..d285cf2
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
*************** static void RecordTransactionCommitPrepa
*** 151,156 ****
--- 151,158 ----
  								RelFileNode *rels,
  								int ninvalmsgs,
  								SharedInvalidationMessage *invalmsgs,
+ 								int ninitforks,
+ 								RelFileNode *initforks,
  								bool initfileinval);
  static void RecordTransactionAbortPrepared(TransactionId xid,
  							   int nchildren,
*************** typedef struct TwoPhaseFileHeader
*** 769,774 ****
--- 771,777 ----
  	int32		ncommitrels;	/* number of delete-on-commit rels */
  	int32		nabortrels;		/* number of delete-on-abort rels */
  	int32		ninvalmsgs;		/* number of cache invalidation messages */
+ 	int32		ncommitinitforks;/* number of delete-on-commit init fork rels */
  	bool		initfileinval;	/* does relcache init file need invalidation? */
  	char		gid[GIDSIZE];	/* GID for transaction */
  } TwoPhaseFileHeader;
*************** StartPrepare(GlobalTransaction gxact)
*** 845,850 ****
--- 848,854 ----
  	TransactionId *children;
  	RelFileNode *commitrels;
  	RelFileNode *abortrels;
+ 	RelFileNode *commitinitforks;
  	SharedInvalidationMessage *invalmsgs;
  
  	/* Initialize linked list */
*************** StartPrepare(GlobalTransaction gxact)
*** 872,877 ****
--- 876,882 ----
  	hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
  	hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
  														  &hdr.initfileinval);
+ 	hdr.ncommitinitforks = smgrGetPendingInitForkDeletes(true, &commitinitforks);
  	StrNCpy(hdr.gid, gxact->gid, GIDSIZE);
  
  	save_state_data(&hdr, sizeof(TwoPhaseFileHeader));
*************** StartPrepare(GlobalTransaction gxact)
*** 902,907 ****
--- 907,918 ----
  						hdr.ninvalmsgs * sizeof(SharedInvalidationMessage));
  		pfree(invalmsgs);
  	}
+ 	if (hdr.ncommitinitforks > 0)
+ 	{
+ 		save_state_data(commitinitforks,
+ 						hdr.ncommitinitforks * sizeof(RelFileNode));
+ 		pfree(commitinitforks);
+ 	}
  }
  
  /*
*************** FinishPreparedTransaction(const char *gi
*** 1250,1255 ****
--- 1261,1267 ----
  	RelFileNode *commitrels;
  	RelFileNode *abortrels;
  	RelFileNode *delrels;
+ 	RelFileNode *commitinitforks;
  	int			ndelrels;
  	SharedInvalidationMessage *invalmsgs;
  	int			i;
*************** FinishPreparedTransaction(const char *gi
*** 1285,1290 ****
--- 1297,1304 ----
  	bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
  	invalmsgs = (SharedInvalidationMessage *) bufptr;
  	bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+ 	commitinitforks = (RelFileNode *) bufptr;
+ 	bufptr += MAXALIGN(hdr->ncommitinitforks * sizeof(RelFileNode));
  
  	/* compute latestXid among all children */
  	latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
*************** FinishPreparedTransaction(const char *gi
*** 1302,1307 ****
--- 1316,1323 ----
  										hdr->nsubxacts, children,
  										hdr->ncommitrels, commitrels,
  										hdr->ninvalmsgs, invalmsgs,
+ 										hdr->ncommitinitforks,
+ 										commitinitforks,
  										hdr->initfileinval);
  	else
  		RecordTransactionAbortPrepared(xid,
*************** FinishPreparedTransaction(const char *gi
*** 1349,1354 ****
--- 1365,1381 ----
  		smgrclose(srel);
  	}
  
+ 	if (isCommit)
+ 	{
+ 		/* Make sure init fork files supposed to be dropped are dropped */
+ 		for (i = 0; i < hdr->ncommitinitforks; i++)
+ 		{
+ 			SMgrRelation srel = smgropen(commitinitforks[i], InvalidBackendId);
+ 			if (smgrexists(srel, INIT_FORKNUM))
+ 				smgrdounlink(srel, INIT_FORKNUM, true);
+ 			smgrclose(srel);
+ 		}
+ 	}
  	/*
  	 * Handle cache invalidation messages.
  	 *
*************** RecoverPreparedTransactions(void)
*** 1890,1895 ****
--- 1917,1923 ----
  			bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
  			bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
  			bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+ 			bufptr += MAXALIGN(hdr->ncommitinitforks * sizeof(RelFileNode));
  
  			/*
  			 * It's possible that SubTransSetParent has been set before, if
*************** RecordTransactionCommitPrepared(Transact
*** 1961,1969 ****
  								RelFileNode *rels,
  								int ninvalmsgs,
  								SharedInvalidationMessage *invalmsgs,
  								bool initfileinval)
  {
! 	XLogRecData rdata[4];
  	int			lastrdata = 0;
  	xl_xact_commit_prepared xlrec;
  	XLogRecPtr	recptr;
--- 1989,1999 ----
  								RelFileNode *rels,
  								int ninvalmsgs,
  								SharedInvalidationMessage *invalmsgs,
+ 								int ninitforks,
+ 								RelFileNode *initforks,
  								bool initfileinval)
  {
! 	XLogRecData rdata[5];
  	int			lastrdata = 0;
  	xl_xact_commit_prepared xlrec;
  	XLogRecPtr	recptr;
*************** RecordTransactionCommitPrepared(Transact
*** 1981,1986 ****
--- 2011,2017 ----
  	xlrec.crec.nrels = nrels;
  	xlrec.crec.nsubxacts = nchildren;
  	xlrec.crec.nmsgs = ninvalmsgs;
+ 	xlrec.crec.ninitforks = ninitforks;
  
  	rdata[0].data = (char *) (&xlrec);
  	rdata[0].len = MinSizeOfXactCommitPrepared;
*************** RecordTransactionCommitPrepared(Transact
*** 2012,2017 ****
--- 2043,2057 ----
  		rdata[3].buffer = InvalidBuffer;
  		lastrdata = 3;
  	}
+ 	/* dump init fork deletions */
+ 	if (ninitforks > 0)
+ 	{
+ 		rdata[lastrdata].next = &(rdata[4]);
+ 		rdata[4].data = (char *) initforks;
+ 		rdata[4].len = ninitforks * sizeof(RelFileNode);
+ 		rdata[4].buffer = InvalidBuffer;
+ 		lastrdata = 4;
+ 	}
  	rdata[lastrdata].next = NULL;
  
  	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_PREPARED, rdata);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
new file mode 100644
index 8a4c4ec..a0bb222
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** RecordTransactionCommit(void)
*** 907,912 ****
--- 907,914 ----
  	TransactionId latestXid = InvalidTransactionId;
  	int			nrels;
  	RelFileNode *rels;
+ 	int			ninitforks;
+ 	RelFileNode *initforks;
  	int			nchildren;
  	TransactionId *children;
  	int			nmsgs = 0;
*************** RecordTransactionCommit(void)
*** 917,922 ****
--- 919,926 ----
  	/* Get data needed for commit record */
  	nrels = smgrGetPendingDeletes(true, &rels);
  	nchildren = xactGetCommittedChildren(&children);
+ 	/* TODO 2PC needs this too? */
+ 	ninitforks = smgrGetPendingInitForkDeletes(true, &initforks);
  	if (XLogStandbyInfoActive())
  		nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
  													 &RelcacheInitFileInval);
*************** RecordTransactionCommit(void)
*** 955,961 ****
  		/*
  		 * Begin commit critical section and insert the commit XLOG record.
  		 */
! 		XLogRecData rdata[4];
  		int			lastrdata = 0;
  		xl_xact_commit xlrec;
  
--- 959,965 ----
  		/*
  		 * Begin commit critical section and insert the commit XLOG record.
  		 */
! 		XLogRecData rdata[5];
  		int			lastrdata = 0;
  		xl_xact_commit xlrec;
  
*************** RecordTransactionCommit(void)
*** 999,1004 ****
--- 1003,1009 ----
  		xlrec.nrels = nrels;
  		xlrec.nsubxacts = nchildren;
  		xlrec.nmsgs = nmsgs;
+ 		xlrec.ninitforks = ninitforks;
  		rdata[0].data = (char *) (&xlrec);
  		rdata[0].len = MinSizeOfXactCommit;
  		rdata[0].buffer = InvalidBuffer;
*************** RecordTransactionCommit(void)
*** 1029,1034 ****
--- 1034,1048 ----
  			rdata[3].buffer = InvalidBuffer;
  			lastrdata = 3;
  		}
+ 		/* dump init fork deletions */
+ 		if (ninitforks > 0)
+ 		{
+ 			rdata[lastrdata].next = &(rdata[4]);
+ 			rdata[4].data = (char *) initforks;
+ 			rdata[4].len = ninitforks * sizeof(RelFileNode);
+ 			rdata[4].buffer = InvalidBuffer;
+ 			lastrdata = 4;
+ 		}
  		rdata[lastrdata].next = NULL;
  
  		(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
*************** cleanup:
*** 1143,1148 ****
--- 1157,1164 ----
  	/* Clean up local data */
  	if (rels)
  		pfree(rels);
+ 	if (initforks)
+ 		pfree(initforks);
  
  	return latestXid;
  }
*************** xact_redo_commit(xl_xact_commit *xlrec, 
*** 4438,4443 ****
--- 4454,4460 ----
  {
  	TransactionId *sub_xids;
  	SharedInvalidationMessage *inval_msgs;
+ 	RelFileNode *initforks;
  	TransactionId max_xid;
  	int			i;
  
*************** xact_redo_commit(xl_xact_commit *xlrec, 
*** 4445,4450 ****
--- 4462,4469 ----
  	sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
  	/* invalidation messages array follows subxids */
  	inval_msgs = (SharedInvalidationMessage *) &(sub_xids[xlrec->nsubxacts]);
+ 	/* init fork deletes array follows invalidation messages */
+ 	initforks = (RelFileNode *) &(inval_msgs[xlrec->nmsgs]);
  
  	max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids);
  
*************** xact_redo_commit(xl_xact_commit *xlrec, 
*** 4534,4539 ****
--- 4553,4570 ----
  		smgrclose(srel);
  	}
  
+ 	/* Make sure init fork files supposed to be dropped are dropped */
+ 	for (i = 0; i < xlrec->ninitforks; i++)
+ 	{
+ 		SMgrRelation srel = smgropen(initforks[i], InvalidBackendId);
+ 		if (smgrexists(srel, INIT_FORKNUM))
+ 		{
+ 			XLogDropRelation(initforks[i], INIT_FORKNUM);
+ 			smgrdounlink(srel, INIT_FORKNUM, true);
+ 		}
+ 		smgrclose(srel);
+ 	}
+ 
  	/*
  	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit()
  	 * in normal operation. For example, in DROP DATABASE, we delete all the
*************** xact_desc_commit(StringInfo buf, xl_xact
*** 4749,4754 ****
--- 4780,4801 ----
  				appendStringInfo(buf, " unknown id %d", msg->id);
  		}
  	}
+ 	if (xlrec->ninitforks > 0)
+ 	{
+ 		RelFileNode *rl;
+ 
+ 		rl = (RelFileNode *) &xacts[xlrec->nmsgs];
+ 
+ 		appendStringInfo(buf, "; init fork rels:");
+ 		for (i = 0; i < xlrec->ninitforks; i++)
+ 		{
+ 			char	   *path = relpathperm(rl[i], INIT_FORKNUM);
+ 
+ 			appendStringInfo(buf, " %s", path);
+ 			pfree(path);
+ 		}
+ 	}
+ 
  }
  
  static void
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
new file mode 100644
index 57987be..3363d7e
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
*************** typedef struct PendingRelDelete
*** 55,60 ****
--- 55,61 ----
  	BackendId	backend;		/* InvalidBackendId if not a temp rel */
  	bool		atCommit;		/* T=delete at commit; F=delete at abort */
  	int			nestLevel;		/* xact nesting level of request */
+ 	bool		onlyInitFork;	/* delete only INIT_FORKNUM */
  	struct PendingRelDelete *next;		/* linked-list link */
  } PendingRelDelete;
  
*************** RelationCreateStorage(RelFileNode rnode,
*** 135,140 ****
--- 136,142 ----
  	pending->backend = backend;
  	pending->atCommit = false;	/* delete if abort */
  	pending->nestLevel = GetCurrentTransactionNestLevel();
+ 	pending->onlyInitFork = false;
  	pending->next = pendingDeletes;
  	pendingDeletes = pending;
  }
*************** log_smgrcreate(RelFileNode *rnode, ForkN
*** 163,173 ****
  }
  
  /*
!  * RelationDropStorage
   *		Schedule unlinking of physical storage at transaction commit.
   */
! void
! RelationDropStorage(Relation rel)
  {
  	PendingRelDelete *pending;
  
--- 165,176 ----
  }
  
  /*
!  * RelationDropStorageForks
   *		Schedule unlinking of physical storage at transaction commit.
+  *	onlyInitFork: true if only the init fork as to be deleted
   */
! static void
! RelationDropStorageForks(Relation rel, bool onlyInitFork)
  {
  	PendingRelDelete *pending;
  
*************** RelationDropStorage(Relation rel)
*** 178,183 ****
--- 181,187 ----
  	pending->backend = rel->rd_backend;
  	pending->atCommit = true;	/* delete if commit */
  	pending->nestLevel = GetCurrentTransactionNestLevel();
+ 	pending->onlyInitFork = onlyInitFork;
  	pending->next = pendingDeletes;
  	pendingDeletes = pending;
  
*************** RelationDropStorage(Relation rel)
*** 195,200 ****
--- 199,225 ----
  }
  
  /*
+  * RelationDropStorage
+  *		Schedule unlinking of physical storage at transaction commit.
+  */
+ void
+ RelationDropStorage(Relation rel)
+ {
+ 	RelationDropStorageForks(rel, false);
+ }
+ 
+ /*
+  * RelationDropInitForkStorage
+  *		Schedule unlinking of physical storage of the init fork at transaction
+  *		commit.
+  */
+ void
+ RelationDropInitForkStorage(Relation rel)
+ {
+ 	RelationDropStorageForks(rel, true);
+ }
+ 
+ /*
   * RelationPreserveStorage
   *		Mark a relation as not to be deleted after all.
   *
*************** smgrDoPendingDeletes(bool isCommit)
*** 358,367 ****
  				int			i;
  
  				srel = smgropen(pending->relnode, pending->backend);
! 				for (i = 0; i <= MAX_FORKNUM; i++)
  				{
! 					if (smgrexists(srel, i))
! 						smgrdounlink(srel, i, false);
  				}
  				smgrclose(srel);
  			}
--- 383,400 ----
  				int			i;
  
  				srel = smgropen(pending->relnode, pending->backend);
! 				if (pending->onlyInitFork)
  				{
! 					if (smgrexists(srel, INIT_FORKNUM))
! 						smgrdounlink(srel, INIT_FORKNUM, false);
! 				}
! 				else
! 				{
! 					for (i = 0; i <= MAX_FORKNUM; i++)
! 					{
! 						if (smgrexists(srel, i))
! 							smgrdounlink(srel, i, false);
! 					}
  				}
  				smgrclose(srel);
  			}
*************** smgrDoPendingDeletes(bool isCommit)
*** 373,378 ****
--- 406,449 ----
  }
  
  /*
+  * smgrGetPendingInitForkDeletes() -- Get a list of init forksto be deleted.
+ */
+ int
+ smgrGetPendingInitForkDeletes(bool forCommit, RelFileNode **ptr)
+ {
+ 	int			nestLevel = GetCurrentTransactionNestLevel();
+ 	int			nrels;
+ 	RelFileNode *rptr;
+ 	PendingRelDelete *pending;
+ 
+ 	nrels = 0;
+ 	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ 	{
+ 		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
+ 			&& pending->backend == InvalidBackendId && pending->onlyInitFork)
+ 			nrels++;
+ 	}
+ 	if (nrels == 0)
+ 	{
+ 		*ptr = NULL;
+ 		return 0;
+ 	}
+ 	rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
+ 	*ptr = rptr;
+ 	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ 	{
+ 		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
+ 			&& pending->backend == InvalidBackendId && pending->onlyInitFork)
+ 		{
+ 			*rptr = pending->relnode;
+ 			rptr++;
+ 		}
+ 	}
+ 	return nrels;
+ }
+ 
+ 
+ /*
   * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
   *
   * The return value is the number of relations scheduled for termination.
*************** smgrGetPendingDeletes(bool forCommit, Re
*** 401,407 ****
  	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
  	{
  		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
! 			&& pending->backend == InvalidBackendId)
  			nrels++;
  	}
  	if (nrels == 0)
--- 472,478 ----
  	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
  	{
  		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
! 			&& pending->backend == InvalidBackendId && !pending->onlyInitFork)
  			nrels++;
  	}
  	if (nrels == 0)
*************** smgrGetPendingDeletes(bool forCommit, Re
*** 414,420 ****
  	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
  	{
  		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
! 			&& pending->backend == InvalidBackendId)
  		{
  			*rptr = pending->relnode;
  			rptr++;
--- 485,491 ----
  	for (pending = pendingDeletes; pending != NULL; pending = pending->next)
  	{
  		if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
! 			&& pending->backend == InvalidBackendId && !pending->onlyInitFork)
  		{
  			*rptr = pending->relnode;
  			rptr++;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index 790bc2a..bbdcc64
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*************** static void ATPrepAddInherit(Relation ch
*** 358,363 ****
--- 358,365 ----
  static void ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCKMODE lockmode);
  static void ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode);
  static void ATExecGenericOptions(Relation rel, List *options);
+ static void ATPrepSetLogged(Relation rel);
+ static void ATExecSetLogged(Relation rel);
  
  static void copy_relation_data(SMgrRelation rel, SMgrRelation dst,
  				   ForkNumber forkNum, char relpersistence);
*************** AlterTableGetLockLevel(List *cmds)
*** 2614,2619 ****
--- 2616,2622 ----
  			case AT_DropNotNull:		/* may change some SQL plans */
  			case AT_SetNotNull:
  			case AT_GenericOptions:
+ 			case AT_SetLogged:				/* must drop shared buffers */
  				cmd_lockmode = AccessExclusiveLock;
  				break;
  
*************** ATPrepCmd(List **wqueue, Relation rel, A
*** 2958,2963 ****
--- 2961,2973 ----
  			/* No command-specific prep needed */
  			pass = AT_PASS_MISC;
  			break;
+ 		case AT_SetLogged:	/* ALTER TABLE SET LOGGED */
+ 			ATSimpleRecursion(wqueue, rel, cmd, recurse, lockmode);
+ 			ATSimplePermissions(rel, ATT_TABLE);
+ 			/* Performs also own permission checks */
+ 			ATPrepSetLogged(rel);
+ 			pass = AT_PASS_MISC;
+ 			break;
  		default:				/* oops */
  			elog(ERROR, "unrecognized alter table type: %d",
  				 (int) cmd->subtype);
*************** ATExecCmd(List **wqueue, AlteredTableInf
*** 3221,3226 ****
--- 3231,3239 ----
  		case AT_GenericOptions:
  			ATExecGenericOptions(rel, (List *) cmd->def);
  			break;
+ 		case AT_SetLogged:
+ 			ATExecSetLogged(rel);
+ 			break;
  		default:				/* oops */
  			elog(ERROR, "unrecognized alter table type: %d",
  				 (int) cmd->subtype);
*************** ATPrepSetTableSpace(AlteredTableInfo *ta
*** 7603,7608 ****
--- 7616,7800 ----
  }
  
  /*
+  * 0) check we're in the wal_level=minimal case - this check will have to be
+  * removed later...
+  * 1) check the table is, in fact, unlogged
+  * 2) unlogged referencing permanent is OK, but permanent referencing unlogged
+  * is a no-no, so what matters when upgrading is "outbound" foreign keys
+  */
+ static void
+ ATPrepSetLogged(Relation rel)
+ {
+ 	Relation	pg_constraint;
+ 	HeapTuple	tuple;
+ 	SysScanDesc scan;
+ 	ScanKeyData skey[1];
+ 
+ 	if (wal_level != WAL_LEVEL_MINIMAL)
+ 	{
+ 		/* TODO change ERRCODE_INSUFFICIENT_PRIVILEGE to something else */
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ 				 errmsg("cannot change table to LOGGED in wal_level != minimal case")));
+ 
+ 	}
+ 
+ 	if (rel->rd_rel->relpersistence != RELPERSISTENCE_UNLOGGED)
+ 	{
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ 				 errmsg("table %s is not unlogged",
+ 						 RelationGetRelationName(rel))));
+ 	}
+ 
+ 	/*
+ 	 * Fetch the constraint tuple from pg_constraint.
+ 	 */
+ 	pg_constraint = heap_open(ConstraintRelationId, AccessShareLock);
+ 
+ 	ScanKeyInit(&skey[0],
+ 				Anum_pg_constraint_conrelid,
+ 				BTEqualStrategyNumber, F_OIDEQ,
+ 				ObjectIdGetDatum(RelationGetRelid(rel)));
+ 
+ 	scan = systable_beginscan(pg_constraint, ConstraintRelidIndexId, true,
+ 							  SnapshotNow, 1, skey);
+ 
+ 	while (HeapTupleIsValid(tuple = systable_getnext(scan)))
+ 	{
+ 		Form_pg_constraint con = (Form_pg_constraint) GETSTRUCT(tuple);
+ 		if (con->contype == CONSTRAINT_FOREIGN)
+ 		{
+ 			Relation relfk = relation_open(con->confrelid, AccessShareLock);
+ 			if (relfk->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT)
+ 			{
+ 				/* TODO better error description */
+ 				/* TODO: exit at first error, or list all dependencies? */
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ 						 errmsg("table %s references unlogged table %s",
+ 								 RelationGetRelationName(rel),
+ 								 RelationGetRelationName(relfk))));
+ 			}
+ 			relation_close(relfk, AccessShareLock);
+ 		}
+ 	}
+ 
+ 	systable_endscan(scan);
+ 
+ 	heap_close(pg_constraint, AccessShareLock);
+ 
+ }
+ 
+ /*
+ * 1) Write out all shared buffers for the target table/index, and drop them
+ * 2) fsync() any segments of the target relation - of any fork except
+ * that init fork
+ * 3) Arrange for the appropriate file deletions at commit or abort, by
+ * updating pendingDeletes
+ */
+ static void
+ ATUpdateToLogged(Relation rel, Relation	pg_classRel)
+ {
+ 	HeapTuple	reltup;
+ 	Form_pg_class relStruct;
+ 	bool fsm;
+ 	bool vm;
+ 
+ 	/* main heap */
+ 	FlushRelationBuffers(rel);
+ 
+ 	/* FlushRelationBuffers will have opened rd_smgr */
+ 	DropRelFileNodeBuffers(rel->rd_smgr->smgr_rnode, MAIN_FORKNUM, 0);
+ 	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+ 
+ 	fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
+ 	if (fsm)
+ 		smgrimmedsync(rel->rd_smgr, FSM_FORKNUM);
+ 
+ 	vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+ 	if (vm)
+ 		smgrimmedsync(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+ 
+ 	/* init fork delete */
+ 	RelationDropInitForkStorage(rel);
+ 
+ 	reltup = SearchSysCacheCopy1(RELOID,
+ 								 ObjectIdGetDatum(RelationGetRelid(rel)));
+ 	if (!HeapTupleIsValid(reltup))
+ 		elog(ERROR, "cache lookup failed for relation %u",
+ 			 RelationGetRelid(rel));
+ 	relStruct = (Form_pg_class) GETSTRUCT(reltup);
+ 
+ 	Assert(relStruct->relpersistence == RELPERSISTENCE_UNLOGGED);
+ 
+ 	relStruct->relpersistence = RELPERSISTENCE_PERMANENT;
+ 
+ 	simple_heap_update(pg_classRel, &reltup->t_self, reltup);
+ 
+ 	/* keep catalog indexes current */
+ 	CatalogUpdateIndexes(pg_classRel, reltup);
+ 
+ 	heap_freetuple(reltup);
+ 
+ }
+ 
+ /*
+  * update main heap, tost and index(es) of the target Relation to logged
+ */
+ static void
+ ATExecSetLogged(Relation rel)
+ {
+ 	Relation	indexRelation;
+ 	ScanKeyData skey;
+ 	SysScanDesc scan;
+ 	HeapTuple	indexTuple;
+ 	Relation	relrel;
+ 
+ 	/* open pg_class to update relpersistence */
+ 	relrel = heap_open(RelationRelationId, RowExclusiveLock);
+ 
+ 	/* main heap */
+ 	ATUpdateToLogged(rel, relrel);
+ 
+ 	/* toast heap, if any */
+ 	if (OidIsValid(rel->rd_rel->reltoastrelid))
+ 	{
+ 		Relation	toastrel;
+ 
+ 		toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+ 		ATUpdateToLogged(toastrel, relrel);
+ 		heap_close(toastrel, AccessShareLock);
+ 	}
+ 
+ 	/* indexes init fork delete */
+ 	/* Prepare to scan pg_index for entries having indrelid = this rel. */
+ 	indexRelation = heap_open(IndexRelationId, AccessShareLock);
+ 	ScanKeyInit(&skey,
+ 				Anum_pg_index_indrelid,
+ 				BTEqualStrategyNumber, F_OIDEQ,
+ 				ObjectIdGetDatum(RelationGetRelid(rel)));
+ 
+ 	scan = systable_beginscan(indexRelation, IndexIndrelidIndexId, true,
+ 							  SnapshotNow, 1, &skey);
+ 
+ 	while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
+ 	{
+ 		Form_pg_index index = (Form_pg_index) GETSTRUCT(indexTuple);
+ 		Relation indxrel = index_open(index->indexrelid, AccessShareLock);
+ 
+ 		ATUpdateToLogged(indxrel, relrel);
+ 
+ 		index_close(indxrel, AccessShareLock);
+ 	}
+ 
+ 	systable_endscan(scan);
+ 	heap_close(indexRelation, AccessShareLock);
+ 
+ 	heap_close(relrel, RowExclusiveLock);
+ }
+ 
+ /*
   * ALTER TABLE/INDEX SET (...) or RESET (...)
   */
  static void
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
new file mode 100644
index a22ab66..e1a50a4
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*************** static void SplitColQualList(List *qualL
*** 520,526 ****
  
  	LABEL LANGUAGE LARGE_P LAST_P LC_COLLATE_P LC_CTYPE_P LEADING
  	LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP
! 	LOCATION LOCK_P
  
  	MAPPING MATCH MAXVALUE MINUTE_P MINVALUE MODE MONTH_P MOVE
  
--- 520,526 ----
  
  	LABEL LANGUAGE LARGE_P LAST_P LC_COLLATE_P LC_CTYPE_P LEADING
  	LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL LOCALTIME LOCALTIMESTAMP
! 	LOCATION LOCK_P LOGGED
  
  	MAPPING MATCH MAXVALUE MINUTE_P MINVALUE MODE MONTH_P MOVE
  
*************** alter_table_cmd:
*** 1795,1800 ****
--- 1795,1807 ----
  					n->missing_ok = FALSE;
  					$$ = (Node *)n;
  				}
+ 			/* ALTER TABLE <name> SET LOGGED  */
+ 			| SET LOGGED
+ 				{
+ 					AlterTableCmd *n = makeNode(AlterTableCmd);
+ 					n->subtype = AT_SetLogged;
+ 					$$ = (Node *)n;
+ 				}
  			/* ALTER TABLE <name> SET WITH OIDS  */
  			| SET WITH OIDS
  				{
*************** unreserved_keyword:
*** 11935,11940 ****
--- 11942,11948 ----
  			| LOCAL
  			| LOCATION
  			| LOCK_P
+ 			| LOGGED
  			| MAPPING
  			| MATCH
  			| MAXVALUE
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
new file mode 100644
index cb440d4..ef56d17
*** a/src/include/access/xact.h
--- b/src/include/access/xact.h
*************** typedef struct xl_xact_commit
*** 123,134 ****
--- 123,136 ----
  	int			nrels;			/* number of RelFileNodes */
  	int			nsubxacts;		/* number of subtransaction XIDs */
  	int			nmsgs;			/* number of shared inval msgs */
+ 	int			ninitforks;  /* number of init fork RelFileNodes */
  	Oid			dbId;			/* MyDatabaseId */
  	Oid			tsId;			/* MyDatabaseTableSpace */
  	/* Array of RelFileNode(s) to drop at commit */
  	RelFileNode xnodes[1];		/* VARIABLE LENGTH ARRAY */
  	/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  	/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
+ 	/* ARRAY OF INIT FORK RelFileNodes FOLLOWS */
  } xl_xact_commit;
  
  #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
new file mode 100644
index 8dee8cf..04e1118
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 22,27 ****
--- 22,28 ----
  
  extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
  extern void RelationDropStorage(Relation rel);
+ extern void RelationDropInitForkStorage(Relation rel);
  extern void RelationPreserveStorage(RelFileNode rnode);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  
*************** extern void RelationTruncate(Relation re
*** 31,36 ****
--- 32,38 ----
   */
  extern void smgrDoPendingDeletes(bool isCommit);
  extern int	smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+ extern int	smgrGetPendingInitForkDeletes(bool forCommit, RelFileNode **ptr);
  extern void AtSubCommit_smgr(void);
  extern void AtSubAbort_smgr(void);
  extern void PostPrepare_smgr(void);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
new file mode 100644
index 566cd3a..56f1384
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
*************** typedef enum AlterTableType
*** 1219,1224 ****
--- 1219,1225 ----
  	AT_AddInherit,				/* INHERIT parent */
  	AT_DropInherit,				/* NO INHERIT parent */
  	AT_GenericOptions,			/* OPTIONS (...) */
+ 	AT_SetLogged,				/* SET LOGGED */
  } AlterTableType;
  
  typedef struct AlterTableCmd	/* one subcommand of an ALTER TABLE */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
new file mode 100644
index 12c2faf..45e17ea
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
*************** PG_KEYWORD("localtime", LOCALTIME, RESER
*** 227,232 ****
--- 227,233 ----
  PG_KEYWORD("localtimestamp", LOCALTIMESTAMP, RESERVED_KEYWORD)
  PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD)
  PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD)
+ PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD)
  PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD)
  PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD)
  PG_KEYWORD("maxvalue", MAXVALUE, UNRESERVED_KEYWORD)
#12Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#11)
Re: switch UNLOGGED to LOGGED

On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Maybe you should change  xl_act_commit to have a separate list of rels to
drop the init fork for  (instead of mixing those with the list of files to
drop as a  whole).

I tried to follow your suggestion, thank you very much.

I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.

3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects

Look at the way ALTER TABLE [ONLY] works for other action types, and copy it.

4) During the check for dependencies problems, I stop as soon as I find an
error; would it be enough?

It's a bit awkwardly phrased the way you have it. I would suggest
something like:

ERROR: constraints on permanent tables may reference only permanent tables
HINT: constraint %s

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#12)
Re: switch UNLOGGED to LOGGED

On Fri, May 6, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

ERROR:  constraints on permanent tables may reference only permanent tables
HINT:  constraint %s

Argh, hit send too soon.

HINT: constraint %s references table %s

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#13)
Re: switch UNLOGGED to LOGGED

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 6, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

ERROR: �constraints on permanent tables may reference only permanent tables
HINT: �constraint %s

Argh, hit send too soon.

HINT: constraint %s references table %s

<nitpick>
That looks like a DETAIL, not a HINT. Also see message style guide
about how that should be a complete sentence.
</nitpick>

regards, tom lane

#15Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#12)
Re: switch UNLOGGED to LOGGED

On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it>

wrote:

Maybe you should change xl_act_commit to have a separate list of rels to
drop the init fork for (instead of mixing those with the list of files

to

drop as a whole).

I tried to follow your suggestion, thank you very much.

I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.

I understand.

What if, in xl_xact_commit, instead of

RelFileNode xnodes

I use another struct for xnodes, something like:

{
RelFileNode xnode;
bool onlyInitFork;
}

That would increase the commit record size only when there are
RelFileNode(s) to drop at commit. So, instead of 4 bytes in
every commit, there are "wasted" bytes when the commit record
contains deleted permanent relations (that should happen much
less). I'm open to suggestions here...

3) Should we have a "cascade" option? I don't know if I have to handle
inherited tables and other dependent objects

Look at the way ALTER TABLE [ONLY] works for other action types, and copy it.

Ok

Thank you very much

Leonardo

#16Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#12)
Re: switch UNLOGGED to LOGGED

Excerpts from Robert Haas's message of vie may 06 23:25:09 -0300 2011:

On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Maybe you should change  xl_act_commit to have a separate list of rels to
drop the init fork for  (instead of mixing those with the list of files to
drop as a  whole).

I tried to follow your suggestion, thank you very much.

I have to admit I don't like this approach very much. I can't see
adding 4 bytes to every commit record for this feature.

Hmm, yeah. Maybe we can add a "flags" int8 somewhere in that struct and
set a bit in it if nrels, nsubxacts, nmsgs and respective arrays are present.
That would save some int's that are already in there.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#16)
Re: switch UNLOGGED to LOGGED

On Mon, May 9, 2011 at 12:51 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Robert Haas's message of vie may 06 23:25:09 -0300 2011:

On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Maybe you should change  xl_act_commit to have a separate list of rels to
drop the init fork for  (instead of mixing those with the list of files to
drop as a  whole).

I tried to follow your suggestion, thank you very much.

I have to admit I don't like this approach very much.  I can't see
adding 4 bytes to every commit record for this feature.

Hmm, yeah.  Maybe we can add a "flags" int8 somewhere in that struct and
set a bit in it if nrels, nsubxacts, nmsgs and respective arrays are present.
That would save some int's that are already in there.

Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure (and zero). This will probably require a bit of tricky
code reorganization so I think it should be done separately from the
main patch. With that done, then it's not a big deal for the main
patch to add in one more array that will normally get omitted. And in
the process, we can save 12 bytes on every commit record in the common
case, which is quite appealing: I don't expect a huge performance
gain, but a penny saved is a penny earned.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#17)
Re: switch UNLOGGED to LOGGED

Yes, that seems like a very appealing approach. There is plenty of
bit-space available in xinfo, and we could reserve a bit each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
that item is present and clear meaning that the count is omitted from
the structure (and zero). This will probably require a bit of tricky
code reorganization so I think it should be done separately from the
main patch.

Ok, I'll try and send a patch with this change only.
BTW xinfo is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left "empty" for future
extentions; I don't know if we care about it...

Leonardo

#19Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#18)
Re: switch UNLOGGED to LOGGED

On Tue, May 10, 2011 at 3:35 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Yes, that seems like a very appealing approach.   There is plenty of
bit-space available in xinfo, and we could reserve a bit  each for
nrels, nsubxacts, and nmsgs, with set meaning that an integer count  of
that item is present and clear meaning that the count is omitted  from
the structure (and zero).  This will probably require a bit of  tricky
code reorganization so I think it should be done separately from  the
main patch.

Ok, I'll try and send a patch with this change only.
BTW  xinfo  is 32 bit long, but I think only 2 bits are used right now?
I think I can make it a 8 bits, and add another 8 bits for nrels,
nsubxacts, and nmsgs and the new thing. That should save
another 2 bytes, while leaving space for extention. Or we can make
it a 8 bits only, but only 2 bits would be left "empty" for future
extentions; I don't know if we care about it...

I don't think making xinfo shorter will save anything, because
whatever follows it is going to be a 4-byte quantity and therefore
4-byte aligned.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#19)
Re: switch UNLOGGED to LOGGED

I don't think making xinfo shorter will save anything, because
whatever follows it is going to be a 4-byte quantity and therefore
4-byte aligned.

ups, didn't notice it.

I'll split xinfo into:

uint16 xinfo;
uint16 presentFlags;

I guess it helps with the reading? I mean, instead
of having a single uint32?

#21Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#20)
Re: switch UNLOGGED to LOGGED

On Tue, May 10, 2011 at 8:03 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I don't  think making xinfo shorter will save anything, because
whatever follows it is  going to be a 4-byte quantity and therefore
4-byte aligned.

ups, didn't notice it.

I'll split    xinfo into:

uint16   xinfo;
uint16   presentFlags;

I guess it helps with the reading? I mean, instead
of having a single uint32?

My feeling would be just keep it as uint32. Breaking it up into
chunks doesn't seem useful to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Leonardo Francalanci
m_lists@yahoo.it
In reply to: Noah Misch (#6)
Re: switch UNLOGGED to LOGGED

By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED -> normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again.
(Alternately, find some other way to make sure queries don't
read the half-copied file.)

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

#23Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#22)
Re: switch UNLOGGED to LOGGED

On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:

By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED -> normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again.
(Alternately, find some other way to make sure queries don't
read the half-copied file.)

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

I assume those locks remain. It wouldn't be safe to release them; a master
crash is just one kind of WAL receipt latency.

When you promote the standby, though, ShutdownRecoveryTransactionEnvironment()
releases the locks.

#24Leonardo Francalanci
m_lists@yahoo.it
In reply to: Noah Misch (#23)
Re: switch UNLOGGED to LOGGED

On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:

By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED -> normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again.
(Alternately, find some other way to make sure queries don't
read the half-copied file.)

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

I assume those locks remain. It wouldn't be safe to release them; a master
crash is just one kind of WAL receipt latency.

But somehow when the master restarts the standby gets notified it
has the unlock???

When you promote the standby, though,

ShutdownRecoveryTransactionEnvironment()

releases the locks.

Ok; then the problem in the UNLOGGED -> normal conversion is that:

1) the master and the standby acquire a lock on the table
2) the master starts sending the pages to the standby
3) the master crashes before committing

up until here no problems, as the standby still has the lock on the
table.

4) when the master restarts, it removes all the fork for rels with INIT forks;
are those "deletes" sent to the standby? And, if yes,
would those be replayed by the standby *before* releasing the lock?
If the answer is "yes", then I don't think we have a problem... but I think
that at the moment those unlogged-table-forks deletes aren't sent at all.

(When promoting the standby, we could have
ShutdownRecoveryTransactionEnvironment() remove all the fork for rels
with INIT forks before releasing the locks)

Leonardo

#25Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#24)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 09:23:53AM +0100, Leonardo Francalanci wrote:

On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:

By the time the startup process
releases the AccessExclusiveLock acquired by the proposed
UNLOGGED -> normal conversion process, that relfilenode
needs to be either fully copied or unlinked all over again.
(Alternately, find some other way to make sure queries don't
read the half-copied file.)

About this issue: how are AccessExclusiveLocks released on
the standby when the master crashes?

I assume those locks remain. It wouldn't be safe to release them; a master
crash is just one kind of WAL receipt latency.

But somehow when the master restarts the standby gets notified it
has the unlock???

I'd guess some WAL record arising from the post-crash master restart makes the
standby do so. When a crash isn't involved, the commit or abort record is that
signal. You could test and find out how it happens after a master crash with a
procedure like this:

1. Start a master and standby on the same machine.
2. Connect to master; CREATE TABLE t(); BEGIN; ALTER TABLE t ADD c int;
3. kill -9 -`head -n1 $master_PGDATA/postmaster.pid`
4. Connect to standby and confirm that t is still locked.
5. Attach debugger to standby startup process and set breakpoints on
StandbyReleaseLocks and StandbyReleaseLocksMany.
6. Restart master.

When you promote the standby, though,

ShutdownRecoveryTransactionEnvironment()

releases the locks.

Ok; then the problem in the UNLOGGED -> normal conversion is that:

1) the master and the standby acquire a lock on the table
2) the master starts sending the pages to the standby
3) the master crashes before committing

up until here no problems, as the standby still has the lock on the
table.

Correct.

4) when the master restarts, it removes all the fork for rels with INIT forks;
are those "deletes" sent to the standby? And, if yes,
would those be replayed by the standby *before* releasing the lock?
If the answer is "yes", then I don't think we have a problem... but I think
that at the moment those unlogged-table-forks deletes aren't sent at all.

I think you are correct that they are not currently WAL-logged.

(When promoting the standby, we could have
ShutdownRecoveryTransactionEnvironment() remove all the fork for rels
with INIT forks before releasing the locks)

Makes sense.

nm

#26Leonardo Francalanci
m_lists@yahoo.it
In reply to: Noah Misch (#25)
Re: switch UNLOGGED to LOGGED

I'd guess some WAL record arising from the post-crash master restart makes

the

standby do so. When a crash isn't involved, the commit or abort record is
that
signal. You could test and find out how it happens after a master crash with
a
procedure like this:

1. Start a master and standby on the same machine.
2. Connect to master; CREATE TABLE t(); BEGIN; ALTER TABLE t ADD c int;
3. kill -9 -`head -n1 $master_PGDATA/postmaster.pid`
4. Connect to standby and confirm that t is still locked.
5. Attach debugger to standby startup process and set breakpoints on
StandbyReleaseLocks and StandbyReleaseLocksMany.
6. Restart master.

Well yes, based on the test the stack is something like:

StandbyReleaseLocksManyStandbyReleaseOldLocks
ProcArrayApplyRecoveryInfo
xlog_redo

It's not very clear to me what ProcArrayApplyRecoveryInfo does (not too
familiar with the standby part I guess) but I see it's called by xlog_redo in
the "info == XLOG_CHECKPOINT_SHUTDOWN" case and by StartupXLOG.

But I don't know if calling ResetUnloggedRelations before
the call to ProcArrayApplyRecoveryInfo in xlog_redo makes sense...
if it makes sense, it would solve the problem of the stray files in
the master crashing case I guess?

When you promote the standby, though,

ShutdownRecoveryTransactionEnvironment()

releases the locks.

If I understand the code, ResetUnloggedRelations is called before
ShutdownRecoveryTransactionEnvironment, so that part shouldn't be
an issue

Leonardo

#27Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#26)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 01:52:46PM +0100, Leonardo Francalanci wrote:

I'd guess some WAL record arising from the post-crash master restart makes

the

standby do so. When a crash isn't involved, the commit or abort record is
that
signal. You could test and find out how it happens after a master crash with
a
procedure like this:

1. Start a master and standby on the same machine.
2. Connect to master; CREATE TABLE t(); BEGIN; ALTER TABLE t ADD c int;
3. kill -9 -`head -n1 $master_PGDATA/postmaster.pid`
4. Connect to standby and confirm that t is still locked.
5. Attach debugger to standby startup process and set breakpoints on
StandbyReleaseLocks and StandbyReleaseLocksMany.
6. Restart master.

Well yes, based on the test the stack is something like:

StandbyReleaseLocksMany
StandbyReleaseOldLocks
ProcArrayApplyRecoveryInfo
xlog_redo

It's not very clear to me what ProcArrayApplyRecoveryInfo does (not too
familiar with the standby part I guess) but I see it's called by xlog_redo in
the "info == XLOG_CHECKPOINT_SHUTDOWN" case and by StartupXLOG.

But I don't know if calling ResetUnloggedRelations before
the call to ProcArrayApplyRecoveryInfo in xlog_redo makes sense...
if it makes sense, it would solve the problem of the stray files in
the master crashing case I guess?

It would solve the problem, but it would mean resetting unlogged relations on
the standby at every shutdown checkpoint. That's probably not a performance
problem, but it is a hack. Offhand, I'd add a new smgr WAL record issued by
ResetUnloggedRelations() when called with UNLOGGED_RELATION_CLEANUP. Another,
simpler, idea is to split XLOG_CHECKPOINT_SHUTDOWN into XLOG_CHECKPOINT_SHUTDOWN
and XLOG_CHECKPOINT_END_OF_RECOVERY, mirroring CreateCheckPoint()'s distinction.
(Given that I regularly lack good taste, you might want to wait for other people
to weigh in before spending too much time on that.)

When you promote the standby, though,

ShutdownRecoveryTransactionEnvironment()

releases the locks.

If I understand the code, ResetUnloggedRelations is called before
ShutdownRecoveryTransactionEnvironment, so that part shouldn't be
an issue

Seems correct.

nm

#28Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#27)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:

It would solve the problem, but it would mean resetting unlogged relations on
the standby at every shutdown checkpoint.  That's probably not a performance
problem, but it is a hack.

I haven't thought about this too deeply, but I'm not sure I agree
that's a hack. Why do you think it is?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#28)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 11:42:03AM -0400, Robert Haas wrote:

On Thu, May 19, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:

It would solve the problem, but it would mean resetting unlogged relations on
the standby at every shutdown checkpoint. ?That's probably not a performance
problem, but it is a hack.

I haven't thought about this too deeply, but I'm not sure I agree
that's a hack. Why do you think it is?

It would make the standby reset unlogged relations on both regular shutdowns and
crashes, while the master only does so on crashes. This creates no functional
hazard since unlogged relation contents are never accessible during hot standby.
It seems like a hack to rely on that fact at any distance, but perhaps a comment
is enough to assuage that.

nm

#30Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#29)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 3:24 PM, Noah Misch <noah@leadboat.com> wrote:

On Thu, May 19, 2011 at 11:42:03AM -0400, Robert Haas wrote:

On Thu, May 19, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:

It would solve the problem, but it would mean resetting unlogged relations on
the standby at every shutdown checkpoint. ?That's probably not a performance
problem, but it is a hack.

I haven't thought about this too deeply, but I'm not sure I agree
that's a hack.  Why do you think it is?

It would make the standby reset unlogged relations on both regular shutdowns and
crashes, while the master only does so on crashes.  This creates no functional
hazard since unlogged relation contents are never accessible during hot standby.
It seems like a hack to rely on that fact at any distance, but perhaps a comment
is enough to assuage that.

I think I'd be more comfortable with that route if it seems like it'll
work. Whacking around the recovery code always makes me a little
nervous about bugs, since it's easy to fail to notice the problem
until something Bad happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#30)
Re: switch UNLOGGED to LOGGED

On Thu, May 19, 2011 at 03:53:12PM -0400, Robert Haas wrote:

On Thu, May 19, 2011 at 3:24 PM, Noah Misch <noah@leadboat.com> wrote:

On Thu, May 19, 2011 at 11:42:03AM -0400, Robert Haas wrote:

On Thu, May 19, 2011 at 11:13 AM, Noah Misch <noah@leadboat.com> wrote:

It would solve the problem, but it would mean resetting unlogged relations on
the standby at every shutdown checkpoint. ?That's probably not a performance
problem, but it is a hack.

I haven't thought about this too deeply, but I'm not sure I agree
that's a hack. ?Why do you think it is?

It would make the standby reset unlogged relations on both regular shutdowns and
crashes, while the master only does so on crashes. ?This creates no functional
hazard since unlogged relation contents are never accessible during hot standby.
It seems like a hack to rely on that fact at any distance, but perhaps a comment
is enough to assuage that.

I think I'd be more comfortable with that route if it seems like it'll
work. Whacking around the recovery code always makes me a little
nervous about bugs, since it's easy to fail to notice the problem
until something Bad happens.

No remaining objection from me, then. Thanks for reviewing.

#32Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#30)
Re: switch UNLOGGED to LOGGED

I'll try to sum up what I understood:

1) the standby keeps the lock, so no problem with
stray files coming from the unlogged->logged log
reply, as the table can't be read during the operation

2) calling ResetUnloggedRelations before
ProcArrayApplyRecoveryInfo would remove the problem
of the stray files on the standby in case of master crash
before commit/abort

3) promoting the standby shouldn't be an issue,
since ResetUnloggedRelations is already called in
ShutdownRecoveryTransactionEnvironment

Now, to move forward, some questions:

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?
I guess I have to generate some special log type that
is only "played" by standby servers.

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...

Leonardo

#33Leonardo Francalanci
m_lists@yahoo.it
In reply to: Leonardo Francalanci (#32)
Re: switch UNLOGGED to LOGGED

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?
I guess I have to generate some special log type that
is only "played" by standby servers.

Maybe I could use log_newpage, but instead of
XLOG_HEAP_NEWPAGE I could use something like
XLOG_HEAP_COPYPAGE; and in heap_redo, in the
XLOG_HEAP_COPYPAGE case, call heap_xlog_newpage
only in case we're in standby...

#34Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#32)
Re: switch UNLOGGED to LOGGED

On Fri, May 20, 2011 at 09:37:20AM +0100, Leonardo Francalanci wrote:

I'll try to sum up what I understood:

1) the standby keeps the lock, so no problem with
stray files coming from the unlogged->logged log
reply, as the table can't be read during the operation

2) calling ResetUnloggedRelations before
ProcArrayApplyRecoveryInfo would remove the problem
of the stray files on the standby in case of master crash
before commit/abort

3) promoting the standby shouldn't be an issue,
since ResetUnloggedRelations is already called in
ShutdownRecoveryTransactionEnvironment

All correct, as far as I can tell.

Now, to move forward, some questions:

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?

Nothing comes to mind as especially similar.

I guess I have to generate some special log type that
is only "played" by standby servers.

What you described in your followup mail seemed reasonable.

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...

Your "use less space in xl_xact_commit patch" seems to be going in a good
direction here. It would probably also be okay to do a ResetUnloggedRelations()
on the standby at every abort of a transaction that had started an UNLOGGED ->
LOGGED conversion. That is, just a flag might be enough.

nm

#35Leonardo Francalanci
m_lists@yahoo.it
In reply to: Noah Misch (#34)
Re: switch UNLOGGED to LOGGED

From: Noah Misch <noah@leadboat.com>

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?

Nothing comes to mind as especially similar.

I guess I have to generate some special log type that
is only "played" by standby servers.

What you described in your followup mail seemed reasonable.

So, it's ok to have a log item that is replayed only if

WalRcvInProgress()

is true?

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...

Your "use less space in xl_xact_commit patch" seems to be going in a good
direction here. It would probably also be okay to do a
ResetUnloggedRelations()
on the standby at every abort of a transaction that had started an UNLOGGED

->

LOGGED conversion. That is, just a flag might be enough.

ok, but that would mean that a transaction that aborts a conversion
would try to reset all unlogged relations (traversing all the FS)...
I don't know if that's acceptable performance-wise.

#36Noah Misch
noah@leadboat.com
In reply to: Leonardo Francalanci (#35)
Re: switch UNLOGGED to LOGGED

On Fri, May 27, 2011 at 10:49:13AM +0100, Leonardo Francalanci wrote:

From: Noah Misch <noah@leadboat.com>

- the patch is missing the "send all table pages to the
standby" part; is there some code I can use as base?

Nothing comes to mind as especially similar.

I guess I have to generate some special log type that
is only "played" by standby servers.

What you described in your followup mail seemed reasonable.

So, it's ok to have a log item that is replayed only if

WalRcvInProgress()

is true?

No, that checks for WAL streaming in particular. A log-shipping standby needs
the same treatment.

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

InArchiveRecovery looks like the right thing, but it's currently static to
xlog.c. Perhaps exporting that is the way to go.

- on the standby, the commit part should be played as it
is on the master (that is, removing the INIT fork).
The abort case is different though: it would mean
doing nothing on the master, while removing every forks
but the INIT fork on the standby.
Would it be ok to add to xl_xact_abort a new array of
RelFileNode(s), where for each one at abort all the forks,
except the init fork, have to be deleted by the standby
(while the master shouldn't do anything with them)?
I bet there's a cleaner solution...

Your "use less space in xl_xact_commit patch" seems to be going in a good
direction here. It would probably also be okay to do a
ResetUnloggedRelations()
on the standby at every abort of a transaction that had started an UNLOGGED

->

LOGGED conversion. That is, just a flag might be enough.

ok, but that would mean that a transaction that aborts a conversion
would try to reset all unlogged relations (traversing all the FS)...
I don't know if that's acceptable performance-wise.

I'm not sure, either, but I don't figure such operations will be at all common.

nm

#37Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#36)
Re: switch UNLOGGED to LOGGED

On Fri, May 27, 2011 at 6:19 AM, Noah Misch <noah@leadboat.com> wrote:

So, it's ok to have a log item that is replayed only if

WalRcvInProgress()

is true?

No, that checks for WAL streaming in particular.  A log-shipping standby needs
the same treatment.

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

InArchiveRecovery looks like the right thing, but it's currently static to
xlog.c.  Perhaps exporting that is the way to go.

Why is it necessary to replay the operation only on the slave? Can we
just use XLOG_HEAP_NEWPAGE?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#38Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#37)
Re: switch UNLOGGED to LOGGED

On Sat, May 28, 2011 at 09:33:09PM -0400, Robert Haas wrote:

On Fri, May 27, 2011 at 6:19 AM, Noah Misch <noah@leadboat.com> wrote:

So, it's ok to have a log item that is replayed only if

WalRcvInProgress()

is true?

No, that checks for WAL streaming in particular. ?A log-shipping standby needs
the same treatment.

Is it a correct approach? I couldn't find any other way to
find out if we are in a standby or a master...

InArchiveRecovery looks like the right thing, but it's currently static to
xlog.c. ?Perhaps exporting that is the way to go.

Why is it necessary to replay the operation only on the slave? Can we
just use XLOG_HEAP_NEWPAGE?

I don't think it is *necessary*. If we're replaying WAL on a master, we'll also
be resetting unlogged relations after recovery; what we write or do not write to
them in the mean time has no functional impact. Seemed like a sensible
optimization, but maybe it's premature.

nm

#39Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#37)
Re: switch UNLOGGED to LOGGED

Why is it necessary to replay the operation only on the slave? Can we
just use XLOG_HEAP_NEWPAGE?

Uh, I don't know why but I thought I shouldn't log a page on the master,
since all the pages are already there and fsync-ed. But if it makes no harm,
I can easily use XLOG_HEAP_NEWPAGE (of course, only in the
wal_level != minimal case).

#40Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#38)
Re: switch UNLOGGED to LOGGED

On Sun, May 29, 2011 at 4:29 AM, Noah Misch <noah@leadboat.com> wrote:

I don't think it is *necessary*.  If we're replaying WAL on a master, we'll also
be resetting unlogged relations after recovery; what we write or do not write to
them in the mean time has no functional impact.  Seemed like a sensible
optimization, but maybe it's premature.

Some jiggering may be necessary, because right now we remove the main
forks at the start of recovery and repopulate them at the end. It's
not immediately obvious to me that that's going to work well with
HEAP_XLOG_NEWPAGE, but then it's not immediately obvious to me that
it's going to work well with a new WAL record type, either. I think
we need a detailed design document for how this is all going to work.
We need to not only handle the master properly but also handle the
slave properly. Consider, for example, the case where the slave
begins to replay the transaction, reaches a restartpoint after
replaying some of the new pages, and then crashes. If the subsequent
restart from the restartpoint blows away the main relation fork, we're
hosed. I fear we're plunging into implementation details without
having a good overall design in mind first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#40)
Re: switch UNLOGGED to LOGGED

I think
we need a detailed design document for how this is all going to work.
We need to not only handle the master properly but also handle the
slave properly. Consider, for example, the case where the slave
begins to replay the transaction, reaches a restartpoint after
replaying some of the new pages, and then crashes. If the subsequent
restart from the restartpoint blows away the main relation fork, we're
hosed. I fear we're plunging into implementation details without
having a good overall design in mind first.

As I said in my first post, I'm basing the patch on the post:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php

So I assumed the design was ok (except for the "stray file around
on a standby" case, which has been discussed earlier on this thread).

If there are things to be discussed/analyzed (I guess the restart point
thing is one of those) we can do it... but I thought that the whole
design was somehow in place

Leonardo

#42Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#41)
Re: switch UNLOGGED to LOGGED

On Tue, May 31, 2011 at 3:39 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I think
we need a detailed design document for how  this is all going to work.
We need to not only handle the master properly but  also handle the
slave properly.  Consider, for example, the case where  the slave
begins to replay the transaction, reaches a restartpoint  after
replaying some of the new pages, and then crashes.  If the  subsequent
restart from the restartpoint blows away the main relation fork,  we're
hosed.  I fear we're plunging into implementation details  without
having a good overall design in mind first.

As I said in my first post, I'm basing the patch on the post:

http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php

So I assumed the design was ok (except for the "stray file around
on a standby" case, which has been discussed earlier on this thread).

Well, I sort of assumed the design was OK, too, but the more we talk
about this WAL-logging stuff, the less convinced I am that I really
understand the problem. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43Leonardo Francalanci
m_lists@yahoo.it
In reply to: Robert Haas (#42)
Re: switch UNLOGGED to LOGGED

Well, I sort of assumed the design was OK, too, but the more we talk
about this WAL-logging stuff, the less convinced I am that I really
understand the problem. :-(

I see. In fact, I think nobody thought about restart points...

To sum up:

1) everything seems ok when in the wal_level = minimal
case. In this case, we fsync everything and at transaction commit
we remove the init fork; in case of a crash, we don't reply
anything (as nothing has been written to the log), and we
remove the main fork as we do now.

2) in the wal_level != minimal case things become more
complicated: if the standby reaches a restart point
and then crashes we are in trouble: it would remove the
main fork at startup, and would reply only a portion of
the table.
I guess the same applies to the master too? I mean:
if we log HEAP_XLOG_NEWPAGEs, reach a checkpoint,
and then crash, at server restart the main fork would be
deleted, and the pages logged on the log couldn't be
replayed. But the problem on the master can be removed
using another type of log instead of HEAP_XLOG_NEWPAGE
(to be replayed by the standbys only).

Is this analysis correct?

Leonardo

#44Robert Haas
robertmhaas@gmail.com
In reply to: Leonardo Francalanci (#43)
Re: switch UNLOGGED to LOGGED

On Tue, May 31, 2011 at 12:25 PM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Well, I  sort of assumed the design was OK, too, but the more we talk
about this  WAL-logging stuff, the less convinced I am that I really
understand the  problem.  :-(

I see. In fact, I think nobody thought about restart points...

To sum up:

1) everything seems ok when in the wal_level = minimal
case. In this case, we fsync everything and at transaction commit
we remove the init fork; in case of a crash, we don't reply
anything (as nothing has been written to the log), and we
remove the main fork as we do now.

Yeah, that seems like it should work.

2) in the   wal_level != minimal case things become more
complicated: if the standby reaches a restart point
and then crashes we are in trouble: it would remove the
main fork at startup, and would reply only a portion of
the table.
I guess the same applies to the master too? I mean:
if we log   HEAP_XLOG_NEWPAGEs, reach a checkpoint,
and then crash, at server restart the main fork would be
deleted, and the pages logged on the log couldn't be
replayed. But the problem on the master can be removed
using another type of log instead of   HEAP_XLOG_NEWPAGE
(to be replayed by the standbys only).

I think that's about right, except that I feel we're missing some
trick here that's needed to make all this work out nicely. Somehow we
need to maintain some state that an unlogged->logged conversion is in
progress; that state needs to then get cleaned up at commit or abort
time (including implicit abort).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company