WAL format and API changes (9.5)

Started by Heikki Linnakangasover 11 years ago106 messages
#1Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com

I'd like to do some changes to the WAL format in 9.5. I want to annotate
each WAL record with the blocks that they modify. Every WAL record
already includes that information, but it's done in an ad hoc way,
differently in every rmgr. The RelFileNode and block number are
currently part of the WAL payload, and it's the REDO routine's
responsibility to extract it. I want to include that information in a
common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing
which blocks a WAL record modifies. One such tool is pg_rewind; it
currently has to understand every WAL record the backend writes. There's
also a tool out there called pg_readahead, which does prefetching of
blocks accessed by WAL records, to speed up PITR. I don't think that
tool has been actively maintained, but at least part of the reason for
that is probably that it's a pain to maintain when it has to understand
the details of every WAL record type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. The
boilerplate code in all WAL redo routines, and writing WAL records,
could be simplified.

So, here's my proposal:

Insertion
---------

The big change in creating WAL records is that the buffers involved in
the WAL-logged operation are explicitly registered, by calling a new
XLogRegisterBuffer function. Currently, buffers that need full-page
images are registered by including them in the XLogRecData chain, but
with the new system, you call the XLogRegisterBuffer() function instead.
And you call that function for every buffer involved, even if no
full-page image needs to be taken, e.g because the page is going to be
recreated from scratch at replay.

It is no longer necessary to include the RelFileNode and BlockNumber of
the modified pages in the WAL payload. That information is automatically
included in the WAL record, when XLogRegisterBuffer is called.

Currently, the backup blocks are implicitly numbered, in the order the
buffers appear in XLogRecData entries. With the new API, the blocks are
numbered explicitly. This is more convenient when a WAL record sometimes
modifies a buffer and sometimes not. For example, a B-tree split needs
to modify four pages: the original page, the new page, the right sibling
(unless it's the rightmost page) and if it's an internal page, the page
at the lower level whose split the insertion completes. So there are two
pages that are sometimes missing from the record. With the new API, you
can nevertheless always register e.g. original page as buffer 0, new
page as 1, right sibling as 2, even if some of them are actually
missing. SP-GiST contains even more complicated examples of that.

The new XLogRegisterBuffer would look like this:

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used
in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also
registering buffers that don't need full-page images, and perhaps also a
function for registering a page that *always* needs a full-page image,
regardless of the LSN. A few existing WAL record types just WAL-log the
whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:

/* register the buffer with the WAL record, with ID 0 */
XLogRegisterBuffer(0, buf, true);

rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(BlahRecord);
rdata[0].buffer_id = -1; /* -1 means the data is always included */
rdata[0].next = &(rdata[1]);

rdata[1].data = (char *) mydata;
rdata[1].len = mydatalen;
rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered above */
rdata[1].next = NULL

...
recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

PageSetLSN(buf, recptr);

(While we're at it, perhaps we should let XLogInsert set the LSN of all
the registered buffers, to reduce the amount of boilerplate code).

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list
or array. I'm not wedded on the details...)

Redo
----

There are four different states a block referenced by a typical WAL
record can be in:

1. The old page does not exist at all (because the relation was
truncated later)
2. The old page exists, but has an LSN higher than current WAL record,
so it doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

With the current API, that leads to a long boilerplate:

/* If we have a full-page image, restore it and we're done */
if (HasBackupBlock(record, 0))
{
(void) RestoreBackupBlock(lsn, record, 0, false, false);
return;
}
buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
/* If the page was truncated away, we're done */
if (!BufferIsValid(buffer))
return;

page = (Page) BufferGetPage(buffer);

/* Has this record already been replayed? */
if (lsn <= PageGetLSN(page))
{
UnlockReleaseBuffer(buffer);
return;
}

/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:

if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

WAL format
----------

The registered block references need to be included in the WAL record.
We already do that for backup blocks, so a naive implementation would be
to just include a BkpBlock struct for all the block references, even
those that don't need a full-page image. That would be rather bulky,
though, so that needs some optimization. Shouldn't be difficult to omit
duplicated/unnecessary information, and add a flags field indicating
which fields are present. Overall, I don't expect there to be any big
difference in the amount of WAL generated by a typical application.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: WAL format and API changes (9.5)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

The big change in creating WAL records is that the buffers involved in
the WAL-logged operation are explicitly registered, by calling a new
XLogRegisterBuffer function.

Seems reasonable, especially if we can make the buffer numbering business
less error-prone.

void XLogRegisterBuffer(int blockref_id, Buffer buffer, bool buffer_std)

blockref_id: An arbitrary ID given to this block reference. It is used
in the redo routine to open/restore the same block.
buffer: the buffer involved
buffer_std: is the page in "standard" page layout?

That's for the normal cases. We'll need a couple of variants for also
registering buffers that don't need full-page images, and perhaps also a
function for registering a page that *always* needs a full-page image,
regardless of the LSN. A few existing WAL record types just WAL-log the
whole page, so those ad-hoc full-page images could be replaced with this.

Why not just one function with an additional flags argument?

Also, IIRC there are places that WAL-log full pages that aren't in a
shared buffer at all (btree build does this I think). How will that fit
into this model?

(While we're at it, perhaps we should let XLogInsert set the LSN of all
the registered buffers, to reduce the amount of boilerplate code).

Yeah, maybe so. I'm not sure why that was separate to begin with.

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:

if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item. The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer. Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer? It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#2)
Re: WAL format and API changes (9.5)

On 04/03/2014 06:37 PM, Tom Lane wrote:

Also, IIRC there are places that WAL-log full pages that aren't in a
shared buffer at all (btree build does this I think). How will that fit
into this model?

Hmm. We could provide a function for registering a block with given
content, without a Buffer. Something like:

XLogRegisterPage(int id, RelFileNode, BlockNumber, Page)

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:

if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

We can certainly add more structure to the WAL records, but any extra
information you add will make the records larger. It might be worth it,
and would be lost in the noise for more complex records like page
splits, but we should keep frequently-used records like heap insertions
as lean as possible.

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item. The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer. Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer? It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

Hmm. You could register a separate XLogRecData chain for each buffer.
Along the lines of:

rdata[0].data = data for buffer
rdata[0].len = ...
rdata[0].next = &rdata[1];
rdata[1].data = more data for same buffer
rdata[1].len = ...
rdata[2].next = NULL;

XLogRegisterBuffer(0, buffer, &data[0]);

At replay:

if (XLogOpenBuffer(0, &buffer, &xldata, &len) == BLK_REPLAY)
{
/* xldata points to the data registered for this buffer */
}

Plus one more chain for the data not associated with a buffer.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#3)
Re: WAL format and API changes (9.5)

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 04/03/2014 06:37 PM, Tom Lane wrote:

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

We can certainly add more structure to the WAL records, but any extra
information you add will make the records larger. It might be worth it,
and would be lost in the noise for more complex records like page
splits, but we should keep frequently-used records like heap insertions
as lean as possible.

Sure, but in simple WAL records there would just be a single data chunk
that would be present in the normal case and not present in the FPI case.
Seems like we ought to be able to handle that degenerate case with no
significant wastage (probably just a flag bit someplace).

More generally, I'm pretty sure that your proposal is already going to
involve some small growth of WAL records compared to today, but I think
that's probably all right; the benefits seem significant.

Hmm. You could register a separate XLogRecData chain for each buffer.
Plus one more chain for the data not associated with a buffer.

That would probably work.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#4)
Re: WAL format and API changes (9.5)

On 04/03/2014 07:11 PM, Tom Lane wrote:

More generally, I'm pretty sure that your proposal is already going to
involve some small growth of WAL records compared to today,

Quite possible.

but I think
that's probably all right; the benefits seem significant.

Yep.

OTOH, once we store the relfilenode+block in a common format, we can
then try to optimize that format more heavily. Just as an example, omit
the tablespace oid in the RelFileNode, when it's the default tablespace
(with a flag bit indicating we did that). Or use a variable-length
endoding for the block number, on the assumption that smaller numbers
are more common. Probably not be worth the extra complexity, but we can
easily experiment with that kind of stuff once we have the
infrastructure in place.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: WAL format and API changes (9.5)

On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list or
array. I'm not wedded on the details...)

That would have the advantage of avoiding the function-call overhead,
which seems appealing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: WAL format and API changes (9.5)

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list or
array. I'm not wedded on the details...)

That would have the advantage of avoiding the function-call overhead,
which seems appealing.

You're kidding, right? One function call per buffer touched by an update
is going to be lost in the noise.

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out. The existing arrangement keeps all its data in local variables
of the function doing the update, which is probably a nice property to
preserve if we can manage it. That might force us to do it as above.

I'd still like something that *looks* more like a list of function calls,
though, even if they're macros under the hood. The existing approach to
setting up the XLogRecData chains is awfully prone to bugs of omission.
There are a few places that went so far as to set up custom macros to help
with that. I'm not a fan of doing the same thing in randomly different
ways in different places; but if we did it that way uniformly, it might be
better/safer.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#7)
Re: WAL format and API changes (9.5)

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Apr 3, 2014 at 10:14 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

(Instead of using a new XLogRegisterBuffer() function to register the
buffers, perhaps they should be passed to XLogInsert as a separate list or
array. I'm not wedded on the details...)

That would have the advantage of avoiding the function-call overhead,
which seems appealing.

You're kidding, right? One function call per buffer touched by an update
is going to be lost in the noise.

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does... I didn't remember it, but I
complained about it before:
http://archives.postgresql.org/message-id/4FEB223A.3060707%40enterprisedb.com

I didn't realize the implications for it back then and thus didn't press
hard for a change, but I think it should be fixed.

The existing arrangement keeps all its data in local variables
of the function doing the update, which is probably a nice property to
preserve if we can manage it. That might force us to do it as above.

We could simply allocate enough scratch space for the state at process
startup. I don't think there'll ever be a need for XLogInsert() to be
reentrant, so that should be fine.

I'd still like something that *looks* more like a list of function calls,
though, even if they're macros under the hood. The existing approach to
setting up the XLogRecData chains is awfully prone to bugs of
omission.

Agreed. There's some pretty ugly code around this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: WAL format and API changes (9.5)

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does...

Ouch. I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: WAL format and API changes (9.5)

On 2014-04-03 19:33:12 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does...

Ouch. I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.

XLogInsert() is using malloc() directly, so that wouldn't detect this
case...

It's not a bad idea tho. I wonder how far the regression tests
get...

Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#10)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 04/04/2014 02:40 AM, Andres Freund wrote:

On 2014-04-03 19:33:12 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-04-03 19:08:27 -0400, Tom Lane wrote:

A somewhat more relevant concern is where are we going to keep the state
data involved in all this. Since this code is, by definition, going to be
called in critical sections, any solution involving internal pallocs is
right out.

We actually already allocate memory in XLogInsert() :(, although only in
the first XLogInsert() a backend does...

Ouch. I wonder if we should put an Assert(not-in-critical-section)
into MemoryContextAlloc.

Good idea!

XLogInsert() is using malloc() directly, so that wouldn't detect this
case...

It's not a bad idea tho. I wonder how far the regression tests
get...

Not even through initdb. GetVirtualXIDsDelayingChkpt() is to blame.

Hmm, the checkpointer process seems to ignore the rule, and just hopes
for the best. The allocation in GetVirtualXIDsDelayingChkpt() was
probably just an oversight, and that call could be moved outside the
critical section, but we also have this in AbsortFsyncRequests():

/*
* We have to PANIC if we fail to absorb all the pending requests (eg,
* because our hashtable runs out of memory). This is because the system
* cannot run safely if we are unable to fsync what we have been told to
* fsync. Fortunately, the hashtable is so small that the problem is
* quite unlikely to arise in practice.
*/
START_CRIT_SECTION();

Perhaps we could fix that by just calling fsync() directly if the
allocation fails, like we do if ForwardFsyncRequest() fails.

But if we give the checkpointer process a free pass, running the
regression tests with an assertion in AllocSetAlloc catches five genuine
bugs:

1. _bt_newroot
2. XLogFileInit
3. spgPageIndexMultiDelete
4. PageRepairFragmentation
5. PageIndexMultiDelete

I'll commit the attached patch to fix those shortly.

- Heikki

Attachments:

avoid-allocations-in-crit-section.patchtext/x-diff; name=avoid-allocations-in-crit-section.patch
#12Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#11)
Re: WAL format and API changes (9.5)

Hi,

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:

But if we give the checkpointer process a free pass, running the regression
tests with an assertion in AllocSetAlloc catches five genuine bugs:

1. _bt_newroot
2. XLogFileInit
3. spgPageIndexMultiDelete
4. PageRepairFragmentation
5. PageIndexMultiDelete

Some of those, like PageRepairFragmentation, are somewhat bad...

@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
((PageHeader) page)->pd_upper = pd_special;
}
else
-	{							/* nstorage != 0 */
+	{
/* Need to compact the page the hard way */
-		itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
-		itemidptr = itemidbase;
+		itemIdSortData itemidbase[MaxHeapTuplesPerPage];
+		itemIdSort	itemidptr = itemidbase;
+

That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.

Thanks,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#12)
Re: WAL format and API changes (9.5)

On 04/04/2014 11:41 AM, Andres Freund wrote:

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:

@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
((PageHeader) page)->pd_upper = pd_special;
}
else
-	{							/* nstorage != 0 */
+	{
/* Need to compact the page the hard way */
-		itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
-		itemidptr = itemidbase;
+		itemIdSortData itemidbase[MaxHeapTuplesPerPage];
+		itemIdSort	itemidptr = itemidbase;
+

That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.

Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block
size. That's fairly large, but not unheard of; there are a few places
where we allocate a BLCKSZ-sized buffer from stack.

We could easily reduce the stack consumption here by changing itemIdSort
to use 16-bit ints; all the lengths and offsets that
PageRepairFragmentation deals with fit in 16-bits.

But overall I wouldn't worry about it. check_stack_depth() leaves a fair
amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't
recurse, that's plenty.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#13)
Re: WAL format and API changes (9.5)

On 2014-04-04 12:50:25 +0300, Heikki Linnakangas wrote:

On 04/04/2014 11:41 AM, Andres Freund wrote:

On 2014-04-04 10:48:32 +0300, Heikki Linnakangas wrote:

@@ -484,10 +483,11 @@ PageRepairFragmentation(Page page)
((PageHeader) page)->pd_upper = pd_special;
}
else
-	{							/* nstorage != 0 */
+	{
/* Need to compact the page the hard way */
-		itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nstorage);
-		itemidptr = itemidbase;
+		itemIdSortData itemidbase[MaxHeapTuplesPerPage];
+		itemIdSort	itemidptr = itemidbase;
+

That's a fair bit of stack, and it can be called somewhat deep on the
stack via heap_page_prune_opt(). I wonder if we ought to add a
check_stack_depth() somewhere.

Hmm, on my 64-bit laptop, that array is 24*291=6984 bytes with 8k block
size. That's fairly large, but not unheard of; there are a few places where
we allocate a BLCKSZ-sized buffer from stack.

Yea, I am not complaing about using so much stack. Seems sensible here.

But overall I wouldn't worry about it. check_stack_depth() leaves a fair
amount of headroom: STACK_DEPTH_SLOP is 512kB. As long as we don't recurse,
that's plenty.

Well, there's no checks at nearby afair. That's why I was
wondering... But I don't have a big problem with not checking, I just
wanted to bring it up.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#14)
Allocations in critical section (was Re: WAL format and API changes (9.5))

Ok, I fixed the issues that the assertion fixed. I also committed a
patch to add the assertion itself; let's see if the buildfarm finds more
cases that violate the rule.

It ignores the checkpointer, because it's known to violate the rule, and
allocations in ErrorContext, which is used during error recovery, e.g if
you indeed PANIC while in a critical section for some other reason.

I didn't backpatch this. Although you shouldn't be running with
assertions enabled in production, it nevertheless seems too risky. There
might be some obscure cases where there is no real risk, e.g because the
current memory context always has enough free space because of a
previous pfree, and it doesn't seem worth tracking down and fixing such
issues in backbranches. You have to be pretty unlucky to run out of
memory in a critical section to begin with.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#15)
Re: Allocations in critical section (was Re: WAL format and API changes (9.5))

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Ok, I fixed the issues that the assertion fixed. I also committed a
patch to add the assertion itself; let's see if the buildfarm finds more
cases that violate the rule.

It ignores the checkpointer, because it's known to violate the rule,

... uh, isn't that a bug to be fixed?

and
allocations in ErrorContext, which is used during error recovery, e.g if
you indeed PANIC while in a critical section for some other reason.

Yeah, I realized we'd have to do something about elog's own allocations.
Not sure if a blanket exemption for ErrorContext is the best way. I'd
been thinking of having a way to turn off the complaint once processing
of an elog(PANIC) has started.

I didn't backpatch this.

Agreed.

BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
eg, palloc does not need its own copy.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#16)
Re: Allocations in critical section (was Re: WAL format and API changes (9.5))

On 04/04/2014 04:56 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Ok, I fixed the issues that the assertion fixed. I also committed a
patch to add the assertion itself; let's see if the buildfarm finds more
cases that violate the rule.

It ignores the checkpointer, because it's known to violate the rule,

... uh, isn't that a bug to be fixed?

Yes. One step a time ;-).

and
allocations in ErrorContext, which is used during error recovery, e.g if
you indeed PANIC while in a critical section for some other reason.

Yeah, I realized we'd have to do something about elog's own allocations.
Not sure if a blanket exemption for ErrorContext is the best way. I'd
been thinking of having a way to turn off the complaint once processing
of an elog(PANIC) has started.

Hmm. PANIC processing should avoid allocations too, except in
ErrorContext, because otherwise you might get an out-of-memory during
PANIC processing.

ErrorContext also covers elog(DEBUG2, ...). I presume we'll want to
ignore that too. Although I also tried without the exemption for
ErrorContext at first, and didn't get any failures from the regression
tests, so I guess we never do that in a critical section. I was a bit
surprised by that.

BTW, I'm pretty sure you added some redundant assertions in mcxt.c.
eg, palloc does not need its own copy.

palloc is copy-pasted from MemoryContextAlloc - it does need its own copy.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: WAL format and API changes (9.5)

On Thu, Apr 3, 2014 at 7:44 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I'd like to do some changes to the WAL format in 9.5. I want to annotate
each WAL record with the blocks that they modify. Every WAL record already
includes that information, but it's done in an ad hoc way, differently in
every rmgr. The RelFileNode and block number are currently part of the WAL
payload, and it's the REDO routine's responsibility to extract it. I want to
include that information in a common format for every WAL record type.

That makes life a lot easier for tools that are interested in knowing which
blocks a WAL record modifies. One such tool is pg_rewind; it currently has
to understand every WAL record the backend writes. There's also a tool out
there called pg_readahead, which does prefetching of blocks accessed by WAL
records, to speed up PITR. I don't think that tool has been actively
maintained, but at least part of the reason for that is probably that it's a
pain to maintain when it has to understand the details of every WAL record
type.

It'd also be nice for contrib/pg_xlogdump and backend code itself. The
boilerplate code in all WAL redo routines, and writing WAL records, could be
simplified.

I think it will also be useful, if we want to implement table/tablespace
PITR.

That's for the normal cases. We'll need a couple of variants for also
registering buffers that don't need full-page images, and perhaps also a
function for registering a page that *always* needs a full-page image,
regardless of the LSN. A few existing WAL record types just WAL-log the
whole page, so those ad-hoc full-page images could be replaced with this.

With these changes, a typical WAL insertion would look like this:

/* register the buffer with the WAL record, with ID 0 */
XLogRegisterBuffer(0, buf, true);

rdata[0].data = (char *) &xlrec;
rdata[0].len = sizeof(BlahRecord);
rdata[0].buffer_id = -1; /* -1 means the data is always included */
rdata[0].next = &(rdata[1]);

rdata[1].data = (char *) mydata;
rdata[1].len = mydatalen;
rdata[1].buffer_id = 0; /* 0 here refers to the buffer registered
above */
rdata[1].next = NULL

...
recptr = XLogInsert(RM_BLAH_ID, xlinfo, rdata);

PageSetLSN(buf, recptr);

If we do register buffer's (that require or don't require FPI) before
calling XLogInsert(), then will there be any impact to handle case
where we come to know that we need to backup the buffer after
taking WALInsertLock.. or will that part of code remains same as it is
today.

Redo
----

There are four different states a block referenced by a typical WAL record
can be in:

1. The old page does not exist at all (because the relation was truncated
later)
2. The old page exists, but has an LSN higher than current WAL record, so it
doesn't need replaying.
3. The LSN is < current WAL record, so it needs to be replayed.
4. The WAL record contains a full-page image, which needs to be restored.

I think there might be a need to have separate handling for some special
cases like Init Page which is used in few ops (Insert/Update/multi-insert).
Is there any criteria to decide if it needs to be a separate state or a special
handling for operations?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#2)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 04/03/2014 06:37 PM, Tom Lane wrote:

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:
if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);
The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item. The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer. Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer? It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

Ok, I wrote the first version of a patch for this. The bulk of the patch
is fairly mechanical changes to all the routines that generate WAL
records and their redo routines, to use the new facilities. I'm sure
we'll have to go through a few rounds of discussions and review on the
APIs and names, but I wanted to go and change all the rest of the code
now anyway, to make sure the APIs cover the needs of all the WAL records.

The interesting part is the new APIs for constructing a WAL record, and
replaying it. I haven't polished the implementation of the APIs, but I'm
fairly happy with the way the WAL generation and redo routines now look
like.

Constructing a WAL record is now more convenient; no more constructing
of XLogRecData structs. There's a function, XLogRegisterData(int id,
char *data, int len), that takes the pointer and length directly as
arguments. You can call it multiple times with the same 'id', and the
data will be appended. (although I kept an API function that still
allows you to pass a chain of XLogRecDatas too, for the odd case where
that's more convenient)

Internally, XLogRegisterData() still constructs a chain of XLogRecDatas,
using a small number of static XLogRecData structs. That's a tad messy,
and I hope there is a better way to do that, but I wanted to focus on
the API for now.

Here's some more details on the API (also included in the README in the
patch):

Constructing a WAL record
=========================

A WAL record consists of multiple chunks of data. Each chunk is
identified by an ID number. A WAL record also contains information about
the blocks that it applies to, and each block reference is also
identified by an ID number. If the same ID number is used for data and a
block reference, the data is left out of the WAL record if a full-page
image of the block is taken.

The API for constructing a WAL record consists of four functions:
XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, and XLogInsert.
First, call XLogBeginInsert(). Then register all the buffers modified,
and data needed to replay the changes, using XLogRegister* functions.
Finally, insert the constructed record to the WAL with XLogInsert(). For
example:

XLogBeginInsert();

/* register buffers modified as part of this action */
XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);

/* register data that is always included in the WAL record */
XLogRegisterData(-1, &xlrec, SizeOfFictionalAction);

/*
* register data associated with lbuffer. This will not be
* included in the record if a full-page image is taken.
*/
XLogRegisterData(0, footuple->data, footuple->len);

/* data associated with rbuffer */
XLogRegisterData(0, data2, len2);

/*
* Ok, all the data and buffers to include in the WAL record
* have been registered. Insert the record.
*/
recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DOSTUFF);

Details of the API functions:

void XLogRegisterBuffer(int id, Buffer buf, int flags);
-----

XLogRegisterBuffer adds information about a data block to the WAL
record. 'id' is an arbitrary number used to identify this page reference
in the redo routine. The information needed to re-find the page at redo
- the relfilenode, fork, and block number - are included in the WAL record.

XLogInsert will automatically include a full copy of the page contents,
if this is the first modification of the buffer since the last
checkpoint. It is important to register every buffer modified by the
action with XLogRegisterBuffer, to avoid torn-page hazards.

The flags control when and how the buffer contents are included in the
WAL record. REGBUF_STANDARD means that the page follows the standard
page layout, and causes the area between pd_lower and pd_upper to be
left out from the image, reducing the WAL volume. Normally, a full-page
image is taken only if the page has not been modified since the last
checkpoint, and only if full_page_writes=on or an online backup is in
progress. The REGBUF_FORCE_IMAGE flag can be used to force a full-page
image to always be included. If the REGBUF_WILL_INIT flag is given, a
full-page image is never taken. The redo routine must completely
re-generate the page contents, using the data included in the WAL record.

void XLogRegisterData(int id, char *data, int len);
-----

XLogRegisterData is used to include arbitrary data in the WAL record. If
XLogRegisterData() is called multiple times with the same 'id', the data
are appended, and will be made available to the redo routine as one
contiguous chunk.

If the same 'id' is used in an XLogRegisterBuffer and XLogRegisterData()
call, the data is not included in the WAL record if a full-page image of
the page is taken. Data registered with id -1 is not related with a
buffer, and is always included.

Writing a REDO routine
======================

A REDO routine uses the data and page references included in the WAL
record to reconstruct the new state of the page. To access the data and
pages included in the WAL record, the following functions are available:

char *XLogRecGetData(XLogRecord *record)
-----

Returns the "main" chunk of data included in the WAL record. That is,
the data included in the record with XLogRegisterData(-1, ...).

XLogReplayResult XLogReplayBuffer(int id, Buffer *buf)
-----

Reads a block associated with the WAL record currently being replayed,
with the given 'id'. The shared buffer is returned in *buf, or
InvalidBuffer if the page could not be found. The block is read into a
shared buffer and locked in exclusive mode. Returns one of the following
result codes:

typedef enum
{
BLK_NEEDS_REDO, /* block needs to be replayed */
BLK_DONE, /* block was already replayed */
BLK_RESTORED, /* block was restored from a full-page image */
BLK_NOTFOUND /* block was not found (and hence does not need
* to be replayed) */
} XLogReplayResult;

The REDO routine must redo the actions to the page if XLogReplayBuffer
returns BLK_NEEDS_REDO. In other cases, no further action is no
required, although the result code can be used to distinguish the reason.

After modifying the page (if it was necessary), the REDO routine must
unlock and release the buffer. Note that the buffer must be unlocked and
released even if no action was required.

XLogReplayResult XLogReplayBufferExtended(int id, ReadBufferMode mode,
bool get_cleanup_lock, Buffer *buf)
-----

Like XLogReplayBuffer(), but with a few extra options.

mode' can be passed to e.g force the page to be zeroed, instead of
reading it from disk. This RBM_ZERO mode should be used to re-initialize
pages registered in the REGBUF_WILL_INIT mode in XLogRegisterBuffer().

if 'get_cleanup_lock' is TRUE, a stronger "cleanup" lock on the page is
acquired, instead of a reguler exclusive-lock.

char *XLogGetPayload(XLogRecord *record, int id, int *len)
-----

Returns a chunk of data included in the WAL record (with
XLogRegisterData(id, ...)). The length of the data is returned in *len.
This is typically used after XLogReplayBuffer() returned BLK_NEEDS_REDO,
with the same 'id', to get the information required to redo the actions
on the page. If no data with the given id is included, perhaps because a
full-page image of the associated buffer was taken instead, an error is
thrown.

void XLogBlockRefGetTag(XLogRecord *record, int id, RelFileNode *rnode,
ForkNumber *forknum, BlockNumber *blknum);
-----

Returns the relfilenode, fork number, and block number of the page
registered with the given ID.

Finally, here's the usual pattern for how a REDO routine works (taken
from btree_xlog_insert):

if (XLogReplayBuffer(0, &buffer) == BLK_NEEDS_REDO)
{
int datalen;
char *datapos = XLogGetPayload(record, 0, &datalen);

page = (Page) BufferGetPage(buffer);

if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
false, false) == InvalidOffsetNumber)
elog(PANIC, "btree_insert_redo: failed to add item");

PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);

- Heikki

Attachments:

wal-format-and-api-changes-1.patch.gzapplication/gzip; name=wal-format-and-api-changes-1.patch.gz
#20Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#19)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 05/17/2014 02:54 PM, Heikki Linnakangas wrote:

On 04/03/2014 06:37 PM, Tom Lane wrote:

Let's simplify that, and have one new function, XLogOpenBuffer, which
returns a return code that indicates which of the four cases we're
dealing with. A typical redo function looks like this:
if (XLogOpenBuffer(0, &buffer) == BLK_REPLAY)
{
/* Modify the page */
...
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
if (BufferIsValid(buffer))
UnlockReleaseBuffer(buffer);
The '0' in the XLogOpenBuffer call is the ID of the block reference
specified in the XLogRegisterBuffer call, when the WAL record was created.

+1, but one important step here is finding the data to be replayed.
That is, a large part of the complexity of replay routines has to do
with figuring out which parts of the WAL record were elided due to
full-page-images, and locating the remaining parts. What can we do
to make that simpler?

Ideally, if XLogOpenBuffer (bad name BTW) returns BLK_REPLAY, it would
also calculate and hand back the address/size of the logged data that
had been pointed to by the associated XLogRecData chain item. The
trouble here is that there might've been multiple XLogRecData items
pointing to the same buffer. Perhaps the magic ID number you give to
XLogOpenBuffer should be thought of as identifying an XLogRecData chain
item, not so much a buffer? It's fairly easy to see what to do when
there's just one chain item per buffer, but I'm not sure what to do
if there's more than one.

Ok, I wrote the first version of a patch for this.

Here's a new version, rebased against master. No other changes.

- Heikki

Attachments:

wal-format-and-api-changes-2.patch.gzapplication/gzip; name=wal-format-and-api-changes-2.patch.gz
#21Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#20)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

Here's a new version, rebased against master. No other changes.

This is missing xlogrecord.h ...

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Alvaro Herrera (#21)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 06/14/2014 09:43 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Here's a new version, rebased against master. No other changes.

This is missing xlogrecord.h ...

Oops, here you are.

- Heikki

Attachments:

wal-format-and-api-changes-3.patch.gzapplication/gzip; name=wal-format-and-api-changes-3.patch.gz
#23Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: WAL format and API changes (9.5)

On Sun, Jun 15, 2014 at 6:11 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 06/14/2014 09:43 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Here's a new version, rebased against master. No other changes.

This is missing xlogrecord.h ...

Oops, here you are.

Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Here is a review of this patch:
- Compilation without warnings, regression tests pass
- That's not a small patch, but the changes mainly done are xlog record
insertion in accordance to the new API format.
$ git diff master --stat | tail -n1
52 files changed, 3601 insertions(+), 4371 deletions(-)

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'
2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.
3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.
4) Current patch crashes when running make check in contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
wal_level
-----------
logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Looking even closer, log_heap_new_cid is broken:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x0000000104428819
postgres`ExceptionalCondition(conditionName=0x00000001044a711c,
errorType=0x000000010448b19f, fileName=0x00000001044a2cfd, lineNumber=6879)
+ 137 at assert.c:54
frame #4: 0x0000000103f08dbe
postgres`log_heap_new_cid(relation=0x00007fae4482afb8,
tup=0x00007fae438062e8) + 126 at heapam.c:6879
frame #5: 0x0000000103f080cf
postgres`heap_insert(relation=0x00007fae4482afb8, tup=0x00007fae438062e8,
cid=0, options=0, bistate=0x0000000000000000) + 383 at heapam.c:2095
frame #6: 0x0000000103f09b6a
postgres`simple_heap_insert(relation=0x00007fae4482afb8,
tup=0x00007fae438062e8) + 74 at heapam.c:2533
frame #7: 0x000000010406aacf postgres`createdb(stmt=0x00007fae44843690)
+ 6335 at dbcommands.c:511
frame #8: 0x000000010429def8
postgres`standard_ProcessUtility(parsetree=0x00007fae44843690,
queryString=0x00007fae44842c38, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0000000000000000, dest=0x00007fae44843a18,
completionTag=0x00007fff5bd4ee30) + 1720 at utility.c:56

5) Yes,there are some WAL records that have only data related to buffers,
XLOG_GIN_VACUUM_DATA_LEAF_PAGE being one, with nothing registered with
buffer_id = -d, but using a dummy entry seems counter-productive:
+       rdata = rdata_head;
+       if (rdata == NULL)
+       {
+               /*
+                * the rest of this function assumes that there is at least
some
+                * rdata entries, so fake an empty rdata entry
+                */
+               dummy_rdata.data = NULL;
+               dummy_rdata.len = 0;
+               dummy_rdata.next = NULL;
+               rdata = &dummy_rdata;
+       }
Could this be removed and replaced by a couple of checks on rdata being
NULL in some cases? XLogInsert should be updated in consequence.
6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.
By doing so, we could as remove the while loop at the bottom of
XLogRegisterXLogRecData as we would insert in the tail only the latest
record created, aka replacing that:
while ((*tail)->next)
        *tail = (*tail)->next;
By simply that:
*tail = (*tail)->next;
7) New structures at the top of xlog.c should have more comments about
their role, particularly rdata_head and rdata_tail that store main WAL data
(id of -1)
8) What do you think about adding in the README a description about how to
retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
    bkpb = XLogGetBlockRef(record, id, NULL);
    // stuff
}
pg_xlogdump is a good example of what can be done as well.
9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
+               char       *payload = XLogRecGetData(record) +
sizeof(ginxlogInsert);
+#ifdef NOT_USED
                leftChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
+#endif
10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
                {
                        XLogRecPtr      recptr;
-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);
11) Any reason for raising a PANIC here?
@@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
                        break;
        }
        if (i < 0)
-               elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+               elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
        num_held_lwlocks--;
        for (; i < num_held_lwlocks; i++)
14) SPgist redo is crashing (during make installcheck run on a master, redo
done on a standby at recovery).
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
    frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
    frame #3: 0x000000010ba8d819
postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) +
137 at assert.c:54
    frame #4: 0x000000010b5c3a50
postgres`addOrReplaceTuple(page=0x000000010bfee980,
tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
    frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
record=0x00007ffd41822800) + 719 at spgxlog.c:263
    frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
record=0x00007ffd41822800) + 402 at spgxlog.c:988
    frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at xlog.c:6780
    frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
startup.c:224
    frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
    frame #10: 0x000000010b864998
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
    frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
    frame #12: 0x000000010b7a9335 postgres`main(argc=3,
argv=0x00007ffd40c04470) + 773 at main.c:227

Also, I wanted to run the WAL replay facility to compare before and after
buffer images, but code crashed before... I am still planning to do so once
this code gets more stable though.
Regards,
--
Michael

#24Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#23)
1 attachment(s)
Re: WAL format and API changes (9.5)

Thanks for the review!

On 06/27/2014 09:12 AM, Michael Paquier wrote:

Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Fixed.

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'

Fixed.

2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by
redo routines, but for things like pg_xlogdump that want to extract
information about any record type.

3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
mostly stand for code that used to print block numbers or relfilenodes,
which doesn't make much sense to do in an ad hoc way in rmgr-specific
desc routines anymore. Will need to go through them all an decide what's
the important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract
some information from the WAL record, that isn't needed anymore. We
could remove the fields from the WAL records altogether, but might be
useful to keep for debugging purposes.

4) Current patch crashes when running make check in contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
wal_level
-----------
logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Fixed.

Looking even closer, log_heap_new_cid is broken:

Fixed.

6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.

Hmm. I considered that, but punted because I didn't want to rewrite all
the places that currently use XLogRecDatas in less straightforward ways.
I kept XLogRegisterXLogRecData as an escape hatch for those.

But I bit the bullet now, and got rid of all the
XLogRegisterXLogRecData() calls. XLogRecData struct is now completely
private to xlog.c.

Another big change in the attached patch version is that
XLogRegisterData() now *copies* the data to a working area, instead of
merely adding a XLogRecData entry to the chain. This simplifies things
in xlog.c, now that the XLogRecData concept is not visible to the rest
of the system. This is a subtle semantic change for the callers: you can
no longer register a struct first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a
typical WAL record payload is very small. But I haven't measured that.
If it becomes an issue, we could try to optimize, and link larger data
blocks to the rdata chain, but memcpy small small chunks of data. There
are a few WAL record types that don't fit in a small statically
allocated buffer, so I provided a new function, XLogEnsureSpace(), that
can be called before XLogBegin() to allocate a large-enough working area.

These changes required changing a couple of places where WAL records are
created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on
what kind of a tree it is, and dataPlaceToPage or entryPlaceToPage
contributes some data to the WAL record. Then ginPlaceToPage adds some
more, and finally calls XLogInsert(). I simplified the SPLIT case so
that we now just create full page images of both page halves. We were
already logging all the data on both page halves, but we were doing it
in a "smarter" way, knowing what kind of pages they were. For example,
for an entry tree page, we included an IndexTuple struct for every tuple
on the page, but didn't include the item pointers. A straight full page
image takes more space, but not much, so I think in any real application
it's going to be lost in noise. (again, haven't measured it though)

8) What do you think about adding in the README a description about how to
retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
bkpb = XLogGetBlockRef(record, id, NULL);
// stuff
}
pg_xlogdump is a good example of what can be done as well.

Yeah, that's probably in order. I don't actually like the way that works
currently very much. Passing the char[256] blockids array to the
function feels weird. Will have to think a bit harder about that..

9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
+               char       *payload = XLogRecGetData(record) +
sizeof(ginxlogInsert);
+#ifdef NOT_USED
leftChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
+#endif

No. XLogRecGetData() is used to get data from the "main chunk", while
XLogGetPayload() is used to get data associated with a buffer. This does
the former.

10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
{
XLogRecPtr recptr;

-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath,
or it would've triggered an assertion. Fixed now, just by looking at the
code, but there's still no test coverage of this so it's likely still
broken in other ways. Before this is committed, I think we'll need to go
through the coverage report and make sure all the XLogInsert() codepaths
(and their redo routines) are covered.

11) Any reason for raising a PANIC here?
@@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
break;
}
if (i < 0)
-               elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+               elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)

No, leftover from debugging. Fixed.

14) SPgist redo is crashing (during make installcheck run on a master, redo
done on a standby at recovery).
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x000000010ba8d819
postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) +
137 at assert.c:54
frame #4: 0x000000010b5c3a50
postgres`addOrReplaceTuple(page=0x000000010bfee980,
tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
record=0x00007ffd41822800) + 719 at spgxlog.c:263
frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
record=0x00007ffd41822800) + 402 at spgxlog.c:988
frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at xlog.c:6780
frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
startup.c:224
frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
frame #10: 0x000000010b864998
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
frame #12: 0x000000010b7a9335 postgres`main(argc=3,
argv=0x00007ffd40c04470) + 773 at main.c:227

Fixed.

Also, I wanted to run the WAL replay facility to compare before and after
buffer images, but code crashed before... I am still planning to do so once
this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from
a sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will
have to investigate it later.

- Heikki

Attachments:

wal-format-and-api-changes-4.patch.gzapplication/gzip; name=wal-format-and-api-changes-4.patch.gz
#25Fujii Masao
Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: WAL format and API changes (9.5)

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Thanks for the review!

On 06/27/2014 09:12 AM, Michael Paquier wrote:

Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Fixed.

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'

Fixed.

2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by redo
routines, but for things like pg_xlogdump that want to extract information
about any record type.

3) There are a couple of code blocks marked as FIXME and BROKEN. There are
as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They mostly
stand for code that used to print block numbers or relfilenodes, which
doesn't make much sense to do in an ad hoc way in rmgr-specific desc
routines anymore. Will need to go through them all an decide what's the
important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract some
information from the WAL record, that isn't needed anymore. We could remove
the fields from the WAL records altogether, but might be useful to keep for
debugging purposes.

4) Current patch crashes when running make check in
contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
wal_level
-----------
logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Fixed.

Looking even closer, log_heap_new_cid is broken:

Fixed.

6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.

Hmm. I considered that, but punted because I didn't want to rewrite all the
places that currently use XLogRecDatas in less straightforward ways. I kept
XLogRegisterXLogRecData as an escape hatch for those.

But I bit the bullet now, and got rid of all the XLogRegisterXLogRecData()
calls. XLogRecData struct is now completely private to xlog.c.

Another big change in the attached patch version is that XLogRegisterData()
now *copies* the data to a working area, instead of merely adding a
XLogRecData entry to the chain. This simplifies things in xlog.c, now that
the XLogRecData concept is not visible to the rest of the system. This is a
subtle semantic change for the callers: you can no longer register a struct
first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a typical
WAL record payload is very small. But I haven't measured that. If it becomes
an issue, we could try to optimize, and link larger data blocks to the rdata
chain, but memcpy small small chunks of data. There are a few WAL record
types that don't fit in a small statically allocated buffer, so I provided a
new function, XLogEnsureSpace(), that can be called before XLogBegin() to
allocate a large-enough working area.

These changes required changing a couple of places where WAL records are
created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage. ginPlaceToPage
calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a tree
it is, and dataPlaceToPage or entryPlaceToPage contributes some data to the
WAL record. Then ginPlaceToPage adds some more, and finally calls
XLogInsert(). I simplified the SPLIT case so that we now just create full
page images of both page halves. We were already logging all the data on
both page halves, but we were doing it in a "smarter" way, knowing what kind
of pages they were. For example, for an entry tree page, we included an
IndexTuple struct for every tuple on the page, but didn't include the item
pointers. A straight full page image takes more space, but not much, so I
think in any real application it's going to be lost in noise. (again,
haven't measured it though)

8) What do you think about adding in the README a description about how to
retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
bkpb = XLogGetBlockRef(record, id, NULL);
// stuff
}
pg_xlogdump is a good example of what can be done as well.

Yeah, that's probably in order. I don't actually like the way that works
currently very much. Passing the char[256] blockids array to the function
feels weird. Will have to think a bit harder about that..

9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
+               char       *payload = XLogRecGetData(record) +
sizeof(ginxlogInsert);
+#ifdef NOT_USED
leftChildBlkno = BlockIdGetBlockNumber((BlockId)
payload);
+#endif

No. XLogRecGetData() is used to get data from the "main chunk", while
XLogGetPayload() is used to get data associated with a buffer. This does the
former.

10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
*state,
{
XLogRecPtr recptr;

-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath, or it
would've triggered an assertion. Fixed now, just by looking at the code, but
there's still no test coverage of this so it's likely still broken in other
ways. Before this is committed, I think we'll need to go through the
coverage report and make sure all the XLogInsert() codepaths (and their redo
routines) are covered.

11) Any reason for raising a PANIC here?
@@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
break;
}
if (i < 0)
-               elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+               elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)

No, leftover from debugging. Fixed.

14) SPgist redo is crashing (during make installcheck run on a master,
redo
done on a standby at recovery).
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill +
10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x000000010ba8d819
postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65)
+
137 at assert.c:54
frame #4: 0x000000010b5c3a50
postgres`addOrReplaceTuple(page=0x000000010bfee980,
tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
record=0x00007ffd41822800) + 719 at spgxlog.c:263
frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
record=0x00007ffd41822800) + 402 at spgxlog.c:988
frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
xlog.c:6780
frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
startup.c:224
frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
frame #10: 0x000000010b864998
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
frame #12: 0x000000010b7a9335 postgres`main(argc=3,
argv=0x00007ffd40c04470) + 773 at main.c:227

Fixed.

Also, I wanted to run the WAL replay facility to compare before and after
buffer images, but code crashed before... I am still planning to do so
once
this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from a
sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
investigate it later.

With the patch, when I ran the simple test which generates lots of
WAL, I got PANIC error.

=# create extension pgcrypto ;
CREATE EXTENSION
=# create extension pg_trgm ;
CREATE EXTENSION
=# create table hoge (col1 text);
CREATE TABLE
=# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
CREATE INDEX
=# insert into hoge select gen_random_uuid() from generate_series(1,10000000);
PANIC: too many registered buffers
PANIC: too many registered buffers
The connection to the server was lost. Attempting reset: Failed.

Server log is
2014-07-01 07:12:37 JST PANIC: too many registered buffers
2014-07-01 07:12:37 JST STATEMENT: insert into hoge select
gen_random_uuid() from generate_series(1,10000000);
2014-07-01 07:12:40 JST LOG: server process (PID 64020) was
terminated by signal 6: Abort trap
2014-07-01 07:12:40 JST DETAIL: Failed process was running: insert
into hoge select gen_random_uuid() from generate_series(1,10000000);

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: WAL format and API changes (9.5)

On Mon, Jun 30, 2014 at 4:51 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Another big change in the attached patch version is that XLogRegisterData()
now *copies* the data to a working area, instead of merely adding a
XLogRecData entry to the chain. This simplifies things in xlog.c, now that
the XLogRecData concept is not visible to the rest of the system. This is a
subtle semantic change for the callers: you can no longer register a struct
first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a typical
WAL record payload is very small. But I haven't measured that. If it becomes
an issue, we could try to optimize, and link larger data blocks to the rdata
chain, but memcpy small small chunks of data. There are a few WAL record
types that don't fit in a small statically allocated buffer, so I provided a
new function, XLogEnsureSpace(), that can be called before XLogBegin() to
allocate a large-enough working area.

On the other hand, the patch I just committed
(9f03ca915196dfc871804a1f8aad26207f601fd6) demonstrates that even
copying relatively small amounts of data can be expensive if you do it
frequently, and certainly WAL insertions are frequent. Of course,
some of the overhead there is from palloc() and friends rather than
from the copying itself, but the copying itself isn't free, either.
And some WAL records can be really big.

Of course, it could still work out to a win if the simplifications in
xlog.c save enough. I don't know whether that's the case or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#25)
Re: WAL format and API changes (9.5)

On Tue, Jul 1, 2014 at 7:18 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Thanks for the review!

On 06/27/2014 09:12 AM, Michael Paquier wrote:

Looking at this patch, it does not compile when assertions are enabled
because of this assertion in heapam.c:
Assert(recdata == data + datalen);
It seems like a thinko as removing it makes the code compile.

Fixed.

Some comments about the code:
1) In src/backend/access/transam/README:
's/no further action is no required/no further action is required/g'

Fixed.

2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by

redo

routines, but for things like pg_xlogdump that want to extract

information

about any record type.

3) There are a couple of code blocks marked as FIXME and BROKEN. There

are

as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They

mostly

stand for code that used to print block numbers or relfilenodes, which
doesn't make much sense to do in an ad hoc way in rmgr-specific desc
routines anymore. Will need to go through them all an decide what's the
important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract

some

information from the WAL record, that isn't needed anymore. We could

remove

the fields from the WAL records altogether, but might be useful to keep

for

debugging purposes.

4) Current patch crashes when running make check in
contrib/test_decoding.
Looking closer, wal_level = logical is broken:
=# show wal_level;
wal_level
-----------
logical
(1 row)
=# create database foo;
The connection to the server was lost. Attempting reset: Failed.

Fixed.

Looking even closer, log_heap_new_cid is broken:

Fixed.

6) XLogRegisterData creates a XLogRecData entry and appends it in one of
the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to

the

WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes

the

interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for

example

constructLeafRecompressWALData directly building a XLogRecData entry.

Hmm. I considered that, but punted because I didn't want to rewrite all

the

places that currently use XLogRecDatas in less straightforward ways. I

kept

XLogRegisterXLogRecData as an escape hatch for those.

But I bit the bullet now, and got rid of all the

XLogRegisterXLogRecData()

calls. XLogRecData struct is now completely private to xlog.c.

Another big change in the attached patch version is that

XLogRegisterData()

now *copies* the data to a working area, instead of merely adding a
XLogRecData entry to the chain. This simplifies things in xlog.c, now

that

the XLogRecData concept is not visible to the rest of the system. This

is a

subtle semantic change for the callers: you can no longer register a

struct

first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a

typical

WAL record payload is very small. But I haven't measured that. If it

becomes

an issue, we could try to optimize, and link larger data blocks to the

rdata

chain, but memcpy small small chunks of data. There are a few WAL record
types that don't fit in a small statically allocated buffer, so I

provided a

new function, XLogEnsureSpace(), that can be called before XLogBegin() to
allocate a large-enough working area.

These changes required changing a couple of places where WAL records are
created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.

ginPlaceToPage

calls dataPlaceToPage or entryPlaceToPage, depending on what kind of a

tree

it is, and dataPlaceToPage or entryPlaceToPage contributes some data to

the

WAL record. Then ginPlaceToPage adds some more, and finally calls
XLogInsert(). I simplified the SPLIT case so that we now just create full
page images of both page halves. We were already logging all the data on
both page halves, but we were doing it in a "smarter" way, knowing what

kind

of pages they were. For example, for an entry tree page, we included an
IndexTuple struct for every tuple on the page, but didn't include the

item

pointers. A straight full page image takes more space, but not much, so I
think in any real application it's going to be lost in noise. (again,
haven't measured it though)

8) What do you think about adding in the README a description about how

to

retrieve the block list modified by a WAL record, for a flow similar to
that:
record = XLogReadRecord(...);
nblocks = XLogGetBlockRefIds(record, blockids);
for (i = 0; i < nblocks; i++)
{
bkpb = XLogGetBlockRef(record, id, NULL);
// stuff
}
pg_xlogdump is a good example of what can be done as well.

Yeah, that's probably in order. I don't actually like the way that works
currently very much. Passing the char[256] blockids array to the function
feels weird. Will have to think a bit harder about that..

9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
+               char       *payload = XLogRecGetData(record) +
sizeof(ginxlogInsert);
+#ifdef NOT_USED
leftChildBlkno = BlockIdGetBlockNumber((BlockId)
payload);
+#endif

No. XLogRecGetData() is used to get data from the "main chunk", while
XLogGetPayload() is used to get data associated with a buffer. This does

the

former.

10) Shouldn't the call to XLogBeginInsert come first here?
@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
*state,
{
XLogRecPtr recptr;

-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath,

or it

would've triggered an assertion. Fixed now, just by looking at the code,

but

there's still no test coverage of this so it's likely still broken in

other

ways. Before this is committed, I think we'll need to go through the
coverage report and make sure all the XLogInsert() codepaths (and their

redo

routines) are covered.

11) Any reason for raising a PANIC here?
@@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
break;
}
if (i < 0)
-               elog(ERROR, "lock %s %d is not held", T_NAME(l),

T_ID(l));

+ elog(PANIC, "lock %s %d is not held", T_NAME(l),

T_ID(l));

num_held_lwlocks--;
for (; i < num_held_lwlocks; i++)

No, leftover from debugging. Fixed.

14) SPgist redo is crashing (during make installcheck run on a master,
redo
done on a standby at recovery).
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff8a3c2866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill

+

10
frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill +
92
frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
frame #3: 0x000000010ba8d819
postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
errorType=0x000000010baf019f, fileName=0x000000010bb212dd,

lineNumber=65)

+
137 at assert.c:54
frame #4: 0x000000010b5c3a50
postgres`addOrReplaceTuple(page=0x000000010bfee980,
tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
frame #5: 0x000000010b5c031f

postgres`spgRedoMoveLeafs(lsn=71708608,

record=0x00007ffd41822800) + 719 at spgxlog.c:263
frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
record=0x00007ffd41822800) + 402 at spgxlog.c:988
frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at
xlog.c:6780
frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
startup.c:224
frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
frame #10: 0x000000010b864998
postgres`StartChildProcess(type=StartupProcess) + 328 at

postmaster.c:5090

frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
frame #12: 0x000000010b7a9335 postgres`main(argc=3,
argv=0x00007ffd40c04470) + 773 at main.c:227

Fixed.

Also, I wanted to run the WAL replay facility to compare before and

after

buffer images, but code crashed before... I am still planning to do so
once
this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from

a

sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have

to

investigate it later.

With the patch, when I ran the simple test which generates lots of
WAL, I got PANIC error.

=# create extension pgcrypto ;
CREATE EXTENSION
=# create extension pg_trgm ;
CREATE EXTENSION
=# create table hoge (col1 text);
CREATE TABLE
=# create index hogeidx on hoge using gin (col1 gin_trgm_ops);
CREATE INDEX
=# insert into hoge select gen_random_uuid() from
generate_series(1,10000000);
PANIC: too many registered buffers
PANIC: too many registered buffers
The connection to the server was lost. Attempting reset: Failed.

Yep, this is easily reproducible. I imagine that a call to
XLogEnsureRecordSpace before START_CRIT_SECTION() is missing in the gin
code path.
--
Michael

#28Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: WAL format and API changes (9.5)

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 06/27/2014 09:12 AM, Michael Paquier wrote:

2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by redo
routines, but for things like pg_xlogdump that want to extract information
about any record type.

That's definitely not part of the redo section or WAL construction section.
It could be a new section in the README describing how to get the list of
blocks touched by a WAL record. I'd rather have those APIs documented
somewhere than nowhere, and the README would be well-suited for that (as
well as in-code comments) as those APIs are aimed at developers.

3) There are a couple of code blocks marked as FIXME and BROKEN. There are

as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
mostly stand for code that used to print block numbers or relfilenodes,
which doesn't make much sense to do in an ad hoc way in rmgr-specific desc
routines anymore. Will need to go through them all an decide what's the
important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract some
information from the WAL record, that isn't needed anymore. We could remove
the fields from the WAL records altogether, but might be useful to keep for
debugging purposes.

They could be removed and be kept as a part of the git history... That's
not really a vital point btw.

6) XLogRegisterData creates a XLogRecData entry and appends it in one of

the static XLogRecData entries if buffer_id >= 0 or to
rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
WAL record). XLogRegisterXLogRecData does something similar, but with an
entry of XLogRecData already built. What do you think about removing
XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
interface simpler, and XLogRecData could be kept private in xlog.c. This
would need more work especially on gin side where I am seeing for example
constructLeafRecompressWALData directly building a XLogRecData entry.

Another big change in the attached patch version is that
XLogRegisterData() now *copies* the data to a working area, instead of
merely adding a XLogRecData entry to the chain. This simplifies things in
xlog.c, now that the XLogRecData concept is not visible to the rest of the
system. This is a subtle semantic change for the callers: you can no longer
register a struct first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a typical
WAL record payload is very small. But I haven't measured that. If it
becomes an issue, we could try to optimize, and link larger data blocks to
the rdata chain, but memcpy small small chunks of data. There are a few WAL
record types that don't fit in a small statically allocated buffer, so I
provided a new function, XLogEnsureSpace(), that can be called before
XLogBegin() to allocate a large-enough working area.

I just ran the following test on my laptop with your latest patch:
- Patched version:
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32913.419 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32261.045 ms
- master (d97e98e):
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29651.394 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29734.887 ms
10% difference... That was just a quick test though.

These changes required changing a couple of places where WAL records are

created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what
kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes
some data to the WAL record. Then ginPlaceToPage adds some more, and
finally calls XLogInsert(). I simplified the SPLIT case so that we now just
create full page images of both page halves. We were already logging all
the data on both page halves, but we were doing it in a "smarter" way,
knowing what kind of pages they were. For example, for an entry tree page,
we included an IndexTuple struct for every tuple on the page, but didn't
include the item pointers. A straight full page image takes more space, but
not much, so I think in any real application it's going to be lost in
noise. (again, haven't measured it though)

Thanks, the code looks cleaner this way. Having looked a bit at the way
ginPlaceToPage works, it seems a bit sensible to refactor
dataPlaceToPageLeaf and dataPlaceToPageInternal to make the WAL operations
smarter than they are now. I am not saying that this is impossible, just
not that straight-forward, especially considering that this code is like
that for a long time.
I actually tried to run some tests on the WAL difference it generates, but
got discouraged by the email Fujii-san sent reporting the crash in gin code
path.

10) Shouldn't the call to XLogBeginInsert come first here?

@@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
*state,
{
XLogRecPtr recptr;

-                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE, rdata);
+                       XLogRegisterBuffer(0, saveCurrent.buffer,
REGBUF_STANDARD);     /* orig page */
+                       XLogRegisterBuffer(1, current->buffer,
REGBUF_STANDARD);                /* new page */
+                       if (xlrec.parentBlk == 2)
+                               XLogRegisterBuffer(2, parent->buffer,
REGBUF_STANDARD);
+
+                       XLogBeginInsert();
+                       XLogRegisterData(-1, (char *) &xlrec,
sizeof(xlrec));
+                       XLogRegisterData(-1, (char *) newInnerTuple,
newInnerTuple->size);
+
+                       recptr = XLogInsert(RM_SPGIST_ID,
XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath, or
it would've triggered an assertion. Fixed now, just by looking at the code,
but there's still no test coverage of this so it's likely still broken in
other ways. Before this is committed, I think we'll need to go through the
coverage report and make sure all the XLogInsert() codepaths (and their
redo routines) are covered.

Hm. This is not really related to this patch as it means that even current
code is not tested properly. I wouldn't mind taking a closer look at the
test suite and come up new tests improving coverage of WAL replay and
record construction...

Also, I wanted to run the WAL replay facility to compare before and after

buffer images, but code crashed before... I am still planning to do so
once
this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from a
sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
investigate it later.

OK, thanks.

I haven't looked at the patch in more details:
1) Noticed a small type => s/reprsent/represent/g
2) This time make installcheck passed for both master and standby, even on
contrib stuff...
3) I noticed a bug in gin redo code path when trying to use the WAL replay
facility though so I couldn't make a proper run with it.
--
Michael

#29Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#28)
Re: WAL format and API changes (9.5)

On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

3) I noticed a bug in gin redo code path when trying to use the WAL replay
facility.

Looking at the backtrace, it seems that a page for gin is corrupted. The
buffer capture patch does some sanity checks on the page format when
performing masking and it is failing in one of them on a standby when
kicking ginRedoUpdateMetapage:
if (pd_lower > pd_upper || pd_special < pd_upper ||
pd_lower < SizeOfPageHeaderData || pd_special > BLCKSZ)
{
elog(ERROR, "invalid page at %X/%08X\n",
((PageHeader) page)->pd_lsn.xlogid,
((PageHeader) page)->pd_lsn.xrecoff);
}

frame #4: 0x000000010437dec5 postgres`CheckForBufferLeaks + 165 at
bufmgr.c:1778
frame #5: 0x000000010437df1e postgres`AtProcExit_Buffers(code=1, arg=0)
+ 30 at bufmgr.c:1750
frame #6: 0x000000010438fded postgres`shmem_exit(code=1) + 301 at
ipc.c:263
frame #7: 0x000000010438fc1c postgres`proc_exit_prepare(code=1) + 124
at ipc.c:187
frame #8: 0x000000010438fb63 postgres`proc_exit(code=1) + 19 at
ipc.c:102
frame #9: 0x0000000104555b3c postgres`errfinish(dummy=0) + 1180 at
elog.c:555
frame #10: 0x00000001045590de postgres`elog_finish(elevel=20,
fmt=0x0000000104633d4f) + 830 at elog.c:1362
frame #11: 0x000000010437c1af
postgres`mask_unused_space(page=0x00007fff5bc20a70) + 159 at bufcapt.c:78
frame #12: 0x000000010437b53d
postgres`mask_heap_page(page=0x00007fff5bc20a70) + 29 at bufcapt.c:95
frame #13: 0x000000010437b1cd
postgres`buffer_capture_write(newcontent=0x0000000104ab3980, blkno=0) + 205
at bufcapt.c:329
frame #14: 0x000000010437bc7d postgres`buffer_capture_forget(buffer=82)
+ 349 at bufcapt.c:433
frame #15: 0x00000001043801c9 postgres`LockBuffer(buffer=82, mode=0) +
233 at bufmgr.c:2773
frame #16: 0x00000001043800c8 postgres`UnlockReleaseBuffer(buffer=82) +
24 at bufmgr.c:2554
frame #17: 0x0000000103fefb03
postgres`ginRedoUpdateMetapage(lsn=335350144, record=0x00007fb1740382f0) +
1843 at ginxlog.c:580
frame #18: 0x0000000103fede96 postgres`gin_redo(lsn=335350144,
record=0x00007fb1740382f0) + 534 at ginxlog.c:724
frame #19: 0x00000001040ad692 postgres`StartupXLOG + 8482 at xlog.c:6810
frame #20: 0x0000000104330e0e postgres`StartupProcessMain + 430 at
startup.c:224
frame #21: 0x00000001040c64d9 postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff5bc231b0) + 1897 at bootstrap.c:416
frame #22: 0x000000010432b1a8
postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
frame #23: 0x00000001043292b9 postgres`PostmasterMain(argc=3,
argv=0x00007fb173c044e0) + 5401 at postmaster.c:1212
frame #24: 0x000000010426f995 postgres`main(argc=3,
argv=0x00007fb173c044e0) + 773 at main.c:219
Note that I have never seen that with vanilla, only with this patch.
Regards,
--
Michael

#30Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#29)
Re: WAL format and API changes (9.5)

On Wed, Jul 2, 2014 at 4:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jul 2, 2014 at 4:09 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

3) I noticed a bug in gin redo code path when trying to use the WAL replay
facility.

This patch has been on status "Waiting on author" for a little bit of
time, marking it now as "Returned with feedback" as there are still
issues with gin record construction and redo code paths.

Feel free to disagree if you don't think this is correct.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#29)
1 attachment(s)
Re: WAL format and API changes (9.5)

Here's a new version of this patch, please review.

I've cleaned up a lot of stuff, fixed all the bugs reported so far, and
a bunch of others I found myself while testing.

I'm not going to explain again what the patch does; the README and
comments should now be complete enough to explain how it works. If not,
please speak up.

In the previous version of this patch, I made the XLogRegisterData
function to copy the WAL data to a temporary buffer, instead of
constructing the XLogRecData chain. I decided to revert back to the old
way after all; I still think that copying would probably be OK from a
performance point of view, but that'd need more testing. We can still
switch to doing it that way later; the XLogRecData struct is no longer
exposed to the functions that generate the WAL records, so it would be a
very isolated change.

I moved the new functions for constructing the WAL record into a new
file, xlogconstruct.c. XLogInsert() now just calls a function called
XLogRecordAssemble(), which returns the full XLogRecData chain that
includes all the data and backup blocks, ready to be written to the WAL
buffer. All the code to construct the XLogRecData chain is now in
xlogrecord.c; this makes XLogInsert() simpler and more readable.

One change resulting from that worth mentioning is that XLogInsert() now
always re-constructs the XLogRecordData chain, if after acquiring the
WALInsertLock it sees that RedoRecPtr changed (i.e. a checkpoint just
started). Before, it would recheck the LSNs on the individual buffers to
see if it's necessary. This is simpler, and I don't think it makes any
difference to performance in practice.

I ran this through my WAL page comparison tool to verify that all the
WAL record types are replayed correctly (although I did some small
cleanup after that, so it's not impossible that I broke it again; will
re-test before committing).

- Heikki

Attachments:

wal-format-and-api-changes-5.patch.gzapplication/gzip; name=wal-format-and-api-changes-5.patch.gz
#32Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: WAL format and API changes (9.5)

On Sat, Aug 2, 2014 at 1:40 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

I ran this through my WAL page comparison tool to verify that all the WAL
record types are replayed correctly (although I did some small cleanup

after

that, so it's not impossible that I broke it again; will re-test before
committing).

I have run as well some tests on it with a master a a standby to check WAL
construction and replay:
- regression tests as well as tests from contrib, isolation
- WAL page comparison tool
- Some tests with pgbench
- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
gin_desc, file gindesc.c, line 127.

Then, I had a look at the code, and here are some comments:
- This comment has been introduced with 96ef3b8 that has added page
checksums. Perhaps the authors can tell what was the original purpose of
this comment (Jeff, Simon). For now, it may be better to remove this FIXME
and to let this comment as is.
+                * FIXME: I didn't understand below comment. Is it still
relevant?
+                *
                 * This also means there is no corresponding API call for
this, so an
                 * smgr implementation has no need to implement anything.
Which means
                 * nothing is needed in md.c etc
- Would it be a win to add an assertion check for (CritSectionCount == 0)
in XLogEnsureRecordSpace?
- There is no mention of REGBUF_NO_IMAGE in transam/README.
- This patch adds a lot of blocks like that in the redo code:
+ if (BufferIsValid(buffer))
+                       UnlockReleaseBuffer(buffer);
What do you think about adding a new macro like UnlockBufferIfValid(buffer)?
- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+               int             flags = REGBUF_FORCE_IMAGE;
+               if (buffer_std)
+                       flags |= REGBUF_STANDARD;
+               XLogRegisterBuffer(0, buffer, flags);
[...]
-               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
- There is a typo in the header of xlogrecord.h, the "to" is not needed:
+ * Definitions for to the WAL record
format.

- There is still a TODO item in ValidXLogRecordHeader to check if block
data header is valid or not. Just mentioning.
- Just came across that: s/reference refes to/reference refers to
- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?
- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

The patch is more solid and better structured than the previous versions.
It is in good shape.
Regards,
--
Michael

#33Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#32)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 08/05/2014 03:50 PM, Michael Paquier wrote:

- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
gin_desc, file gindesc.c, line 127.

I could not reproduce this. Do you have a test case?

- Would it be a win to add an assertion check for (CritSectionCount == 0)
in XLogEnsureRecordSpace?

Maybe; added.

- There is no mention of REGBUF_NO_IMAGE in transam/README.

Fixed

- This patch adds a lot of blocks like that in the redo code:
+ if (BufferIsValid(buffer))
+                       UnlockReleaseBuffer(buffer);
What do you think about adding a new macro like UnlockBufferIfValid(buffer)?

I don't think such a macro would be an improvement in readability, TBH.

- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+               int             flags = REGBUF_FORCE_IMAGE;
+               if (buffer_std)
+                       flags |= REGBUF_STANDARD;
+               XLogRegisterBuffer(0, buffer, flags);
[...]
-               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);

Indeed, fixed. With that, "initdb -k" works, I had not tested the patch
with page checksums before.

- There is still a TODO item in ValidXLogRecordHeader to check if block
data header is valid or not. Just mentioning.

Removed.

Previously, we ValidXLogRecordHeader checked that the xl_tot_len of a
record doesn't exceed the size of header + xl_len + (space needed for
max number of bkp blocks). But the new record format has no limit on the
amount of data registered with a buffer, so such a test is not possible
anymore. I added the TODO item there to remind me that I need to check
if we need to do something else there instead, but I think it's fine as
it is. We still sanity-check the block data in ValidXLogRecord; the
ValidXLogRecordHeader() check happens before we have read the whole
record header in memory.

- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?

I agree it's not a nice API. Returning a palloc'd array would be nicer,
but this needs to work outside the backend too (e.g. pg_xlogdump). It
could return a malloc'd array in frontend code, but it's not as nice. On
the whole, do you think that be better than the current approach?

- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

Hmm. I left the defines for xlogconstruct.c functions that are used to
build a WAL record in xlog.h, so that it's not necessary to include both
xlog.h and xlogconstruct.h in all files that write a WAL record
(XLogInsert() is defined in xlog.h).

But perhaps it would be better to move the prototype for XLogInsert to
xlogconstruct.h too? That might be a better division; everything needed
to insert a WAL record would be in xlogconstruct.h, and xlog.h would
left for more internal functions related to WAL. And perhaps rename the
files to xloginsert.c and xloginsert.h.

Here's a new patch with those little things fixed.

- Heikki

Attachments:

wal-format-and-api-changes-6.patch.gzapplication/gzip; name=wal-format-and-api-changes-6.patch.gz
#34Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#33)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

On 08/05/2014 03:50 PM, Michael Paquier wrote:

- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h. The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

Hmm. I left the defines for xlogconstruct.c functions that are used
to build a WAL record in xlog.h, so that it's not necessary to
include both xlog.h and xlogconstruct.h in all files that write a
WAL record (XLogInsert() is defined in xlog.h).

But perhaps it would be better to move the prototype for XLogInsert
to xlogconstruct.h too? That might be a better division; everything
needed to insert a WAL record would be in xlogconstruct.h, and
xlog.h would left for more internal functions related to WAL. And
perhaps rename the files to xloginsert.c and xloginsert.h.

Yes, IMO ideally modules that only construct WAL records to insert, but
are not directly involved with other XLog stuff, should only be using a
lean header file, not the whole xlog.h. I imagine xlogconstruct.h would
be such a file. (The patch you just posted doesn't have such a file --
AFAICS that stuff is all in xlog.h still).

No opinion on xlogconstruct vs. xloginsert as file names. Both seem
like good enough names to me. Unless others have stronger opinions, I
would left that decision to you.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#33)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

On 08/05/2014 03:50 PM, Michael Paquier wrote:

- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?

I agree it's not a nice API. Returning a palloc'd array would be
nicer, but this needs to work outside the backend too (e.g.
pg_xlogdump). It could return a malloc'd array in frontend code, but
it's not as nice. On the whole, do you think that be better than the
current approach?

I don't see the issue. Right now, in frontend code you can call
palloc() and pfree(), and they behave like malloc() and free() (see
fe_memutils.c). Your module doesn't need to do anything special.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#33)
Re: WAL format and API changes (9.5)

On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:

Here's a new patch with those little things fixed.

I've so far ignored this thread... I'm now taking a look. Unfortunately
I have to admit I'm not unequivocally happy.

* XLogRegisterData/XLogRegisterRecData imo don't really form a
consistent API namewise. What does 'Rec' mean here?

* I'm distinctly not a fan of passing around both the LSN and the struct
XLogRecord to functions. We should bit the bullet and use something
encapsulating both.

* XLogReplayBuffer imo isn't a very good name - the buffer isn't
replayed. If anything it's sometimes replaced by the backup block, but
the real replay still happens outside of that function.

* Why do we need XLogBeginInsert(). Right now this leads to uglyness
like duplicated if (RelationNeedsWAL()) wal checks and
XLogCancelInsert() of inserts that haven't been started. And if we
have Begin, why do we also need Cancel?

* "XXX: do we need to do this for every page?" - yes, and it's every
tuple, not every page... And It needs to be done *after* the tuple has
been put on the page, not before. Otherwise the location will be wrong.

* The patch mixes the API changes around WAL records with changes of how
individual actions are logged. That makes it rather hard to review -
and it's a 500kb patch already.

I realize it's hard to avoid because the new API changes which
information about blocks is available, but it does make it really hard
to see whether the old/new code is doing something
equivalent. It's hard to find more critical code than this :/

Have you done any performance evaluation?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#36)
Re: WAL format and API changes (9.5)

On 08/13/2014 01:04 AM, Andres Freund wrote:

On 2014-08-12 12:44:22 +0300, Heikki Linnakangas wrote:

Here's a new patch with those little things fixed.

I've so far ignored this thread... I'm now taking a look. Unfortunately
I have to admit I'm not unequivocally happy.

* XLogRegisterData/XLogRegisterRecData imo don't really form a
consistent API namewise. What does 'Rec' mean here?

The latter function is actually called XLogRegisterBufData. The README
was wrong, will fix.

* I'm distinctly not a fan of passing around both the LSN and the struct
XLogRecord to functions. We should bit the bullet and use something
encapsulating both.

You mean, the redo functions? Seems fine to me, and always it's been
like that...

* XLogReplayBuffer imo isn't a very good name - the buffer isn't
replayed. If anything it's sometimes replaced by the backup block, but
the real replay still happens outside of that function.

I agree, but haven't come up with a better name.

* Why do we need XLogBeginInsert(). Right now this leads to uglyness
like duplicated if (RelationNeedsWAL()) wal checks and
XLogCancelInsert() of inserts that haven't been started.

I like clearly marking the beginning of the insert, with
XLogBeginInsert(). We certainly could design it so that it's not needed,
and you could just start registering stuff without calling
XLogBeginInsert() first. But I believe it's more robust this way. For
example, you will get an error at the next XLogBeginInsert(), if a
previous operation had already registered some data without calling
XLogInsert().

And if we have Begin, why do we also need Cancel?

We could just automatically "cancel" any previous insertion when
XLogBeginInsert() is called, but again it seems like bad form to leave
references to buffers and data just lying there, if an operation bails
out after registering some stuff and doesn't finish the insertion.

* "XXX: do we need to do this for every page?" - yes, and it's every
tuple, not every page... And It needs to be done *after* the tuple has
been put on the page, not before. Otherwise the location will be wrong.

That comment is in heap_multi_insert(). All the inserted tuples have the
same command id, seems redundant to log multiple NEW_CID records for them.

* The patch mixes the API changes around WAL records with changes of how
individual actions are logged. That makes it rather hard to review -
and it's a 500kb patch already.

I realize it's hard to avoid because the new API changes which
information about blocks is available, but it does make it really hard
to see whether the old/new code is doing something
equivalent. It's hard to find more critical code than this :/

Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API,
until all the functions have been converted. But in the end, I figured
it's not worth it, as sooner or later we'd want to convert all the
functions anyway.

Have you done any performance evaluation?

No, not yet.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#33)
Re: WAL format and API changes (9.5)

On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 08/05/2014 03:50 PM, Michael Paquier wrote:

- When testing pg_xlogdump I found an assertion failure for record
XLOG_GIN_INSERT:
Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0)))), function
gin_desc, file gindesc.c, line 127.

I could not reproduce this. Do you have a test case?

Indeed. I am not seeing anything now for this record. Perhaps I was
using an incorrect version of pg_xlogdump.

- Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
+               int             flags = REGBUF_FORCE_IMAGE;
+               if (buffer_std)
+                       flags |= REGBUF_STANDARD;
+               XLogRegisterBuffer(0, buffer, flags);
[...]
-               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
+               recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);

Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with
page checksums before.

Thanks for the fix. Yes now this works.

- XLogRecGetBlockRefIds needing an already-allocated array for *out is not
user-friendly. Cannot we just do all the work inside this function?

I agree it's not a nice API. Returning a palloc'd array would be nicer, but
this needs to work outside the backend too (e.g. pg_xlogdump). It could
return a malloc'd array in frontend code, but it's not as nice. On the
whole, do you think that be better than the current approach?

A palloc'd array returned is nicer for the user. And I would rather
rewrite the API like that, having it return the array instead:
int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array)
Alvaro has also outlined that in the FRONTEND context pfree and palloc
would work as pg_free and pg_malloc (didn't recall it before he
mentioned it btw).

- All the functions in xlogconstruct.c could be defined in a separate
header xlogconstruct.h. What they do is rather independent from xlog.h.
The
same applies to all the structures and functions of xlogconstruct.c in
xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
InitXLogRecordConstruct. (worth noticing as well that the only reason
XLogRecData is defined externally of xlogconstruct.c is that it is being
used in XLogInsert()).

Hmm. I left the defines for xlogconstruct.c functions that are used to build
a WAL record in xlog.h, so that it's not necessary to include both xlog.h
and xlogconstruct.h in all files that write a WAL record (XLogInsert() is
defined in xlog.h).

But perhaps it would be better to move the prototype for XLogInsert to
xlogconstruct.h too? That might be a better division; everything needed to
insert a WAL record would be in xlogconstruct.h, and xlog.h would left for
more internal functions related to WAL. And perhaps rename the files to
xloginsert.c and xloginsert.h.

Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
it is not weird. This approach has the merit to make XLogRecData
completely bundled with the insertion and construction of the WAL
records. Then for the name xloginsert.c is fine for me too.

Among the things changed since v5, v6 is introducing a safety limit to
the maximum number of backup blocks within a single WAL record:
#define XLR_MAX_BKP_BLOCKS 256
XLogRecordBlockData is updated to use uint8 instead of char to
identify the block id, and XLogEnsureRecordSpace fails if more than
XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this
change, just mentioning it.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#38)
Re: WAL format and API changes (9.5)

On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
it is not weird. This approach has the merit to make XLogRecData
completely bundled with the insertion and construction of the WAL
records. Then for the name xloginsert.c is fine for me too.

At the same time, renaming XLogInsert to XLogCompleteInsert or
XLogFinishInsert would be nice to make the difference with
XLogBeginInsert. This could include XLogCancel renamed tos
XLogCancelInsert.

Appending the prefix XLogInsert* to those functions would make things
more consistent as well.

But feel free to discard those ideas if you do not like that.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#37)
Re: WAL format and API changes (9.5)

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:

On 08/13/2014 01:04 AM, Andres Freund wrote:

* I'm distinctly not a fan of passing around both the LSN and the struct
XLogRecord to functions. We should bit the bullet and use something
encapsulating both.

You mean, the redo functions? Seems fine to me, and always it's been like
that...

Meh. By that argument we could just not do this patch :)

* XLogReplayBuffer imo isn't a very good name - the buffer isn't
replayed. If anything it's sometimes replaced by the backup block, but
the real replay still happens outside of that function.

I agree, but haven't come up with a better name.

XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...?

* Why do we need XLogBeginInsert(). Right now this leads to uglyness
like duplicated if (RelationNeedsWAL()) wal checks and
XLogCancelInsert() of inserts that haven't been started.

I like clearly marking the beginning of the insert, with XLogBeginInsert().
We certainly could design it so that it's not needed, and you could just
start registering stuff without calling XLogBeginInsert() first. But I
believe it's more robust this way. For example, you will get an error at the
next XLogBeginInsert(), if a previous operation had already registered some
data without calling XLogInsert().

I'm coming around to that view - especially as accidentally nesting xlog
insertions is a real concern with the new API.

And if we have Begin, why do we also need Cancel?

We could just automatically "cancel" any previous insertion when
XLogBeginInsert() is called, but again it seems like bad form to leave
references to buffers and data just lying there, if an operation bails out
after registering some stuff and doesn't finish the insertion.

* "XXX: do we need to do this for every page?" - yes, and it's every
tuple, not every page... And It needs to be done *after* the tuple has
been put on the page, not before. Otherwise the location will be wrong.

That comment is in heap_multi_insert(). All the inserted tuples have the
same command id, seems redundant to log multiple NEW_CID records for them.

Yea, I know it's in heap_multi_insert(). They tuples have the same cid,
but they don't have the same tid. Or well, wouldn't if
log_heap_new_cid() were called after the PutHeapTuple(). Note that this
won't trigger for any normal user defined relations.

* The patch mixes the API changes around WAL records with changes of how
individual actions are logged. That makes it rather hard to review -
and it's a 500kb patch already.

I realize it's hard to avoid because the new API changes which
information about blocks is available, but it does make it really hard
to see whether the old/new code is doing something
equivalent. It's hard to find more critical code than this :/

Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API, until
all the functions have been converted. But in the end, I figured it's not
worth it, as sooner or later we'd want to convert all the functions anyway.

I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#40)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 08/13/2014 02:07 PM, Andres Freund wrote:

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:

On 08/13/2014 01:04 AM, Andres Freund wrote:

* The patch mixes the API changes around WAL records with changes of how
individual actions are logged. That makes it rather hard to review -
and it's a 500kb patch already.

I realize it's hard to avoid because the new API changes which
information about blocks is available, but it does make it really hard
to see whether the old/new code is doing something
equivalent. It's hard to find more critical code than this :/

Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API, until
all the functions have been converted. But in the end, I figured it's not
worth it, as sooner or later we'd want to convert all the functions anyway.

I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?

Hmm, thinking about this some more, there is one sensible way to split
this patch: We can add the XLogReplayBuffer() function and rewrite all
the redo routines to use it, without changing any WAL record formats or
anything in the way the WAL records are constructed. In the patch,
XLogReplayBuffer() takes one input arument, the block reference ID, and
it fetches the RelFileNode and BlockNumber of the block based on that.
Without the WAL format changes, the information isn't there in the
record, but we can require the callers to pass the RelFileNode and
BlockNumber. The final patch will remove those arguments from every
caller, but that's a very mechanical change.

As in the attached patch. I only modified the heapam redo routines to
use the new XLogReplayBuffer() idiom; the idea is to do that for every
redo routine.

After applying such a patch, the main WAL format changing patch becomes
much smaller, and makes it easier to see from the redo routines where
significant changes to the WAL record formats have been made. This also
allows us to split the bikeshedding; we can discuss the name of
XLogReplayBuffer() first :-).

- Heikki

Attachments:

xlogreplaybuffer-1.patchtext/x-diff; name=xlogreplaybuffer-1.patch
#42Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#41)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 08/13/2014 04:15 PM, Heikki Linnakangas wrote:

Hmm, thinking about this some more, there is one sensible way to split
this patch: We can add the XLogReplayBuffer() function and rewrite all
the redo routines to use it, without changing any WAL record formats or
anything in the way the WAL records are constructed. In the patch,
XLogReplayBuffer() takes one input arument, the block reference ID, and
it fetches the RelFileNode and BlockNumber of the block based on that.
Without the WAL format changes, the information isn't there in the
record, but we can require the callers to pass the RelFileNode and
BlockNumber. The final patch will remove those arguments from every
caller, but that's a very mechanical change.

As in the attached patch. I only modified the heapam redo routines to
use the new XLogReplayBuffer() idiom; the idea is to do that for every
redo routine.

After applying such a patch, the main WAL format changing patch becomes
much smaller, and makes it easier to see from the redo routines where
significant changes to the WAL record formats have been made. This also
allows us to split the bikeshedding; we can discuss the name of
XLogReplayBuffer() first :-).

Here's a full version of this refactoring patch, all the rmgr's have now
been updated to use XLogReplayBuffer(). I think this is a worthwhile
change on its own, even if we drop the ball on the rest of the WAL
format patch, because it makes the redo-routines more readable. I
propose to commit this as soon as someone has reviewed it, and we agree
on a for what's currently called XLogReplayBuffer(). I have tested this
with my page-image comparison tool.

- Heikki

Attachments:

xlogreplaybuffer-2.patchtext/x-diff; name=xlogreplaybuffer-2.patch
#43Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#42)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

Here's a full version of this refactoring patch, all the rmgr's have
now been updated to use XLogReplayBuffer(). I think this is a
worthwhile change on its own, even if we drop the ball on the rest
of the WAL format patch, because it makes the redo-routines more
readable. I propose to commit this as soon as someone has reviewed
it, and we agree on a for what's currently called
XLogReplayBuffer(). I have tested this with my page-image comparison
tool.

XLogReadBufferForReplay ?
ReadBufferForXLogReplay ?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#42)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

Here's a full version of this refactoring patch, all the rmgr's have
now been updated to use XLogReplayBuffer(). I think this is a
worthwhile change on its own, even if we drop the ball on the rest
of the WAL format patch, because it makes the redo-routines more
readable. I propose to commit this as soon as someone has reviewed
it, and we agree on a for what's currently called
XLogReplayBuffer(). I have tested this with my page-image comparison
tool.

What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Alvaro Herrera (#44)
Re: WAL format and API changes (9.5)

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Heikki Linnakangas wrote:
What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#42)
Re: WAL format and API changes (9.5)

On Thu, Aug 14, 2014 at 2:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Here's a full version of this refactoring patch, all the rmgr's have now
been updated to use XLogReplayBuffer(). I think this is a worthwhile change
on its own, even if we drop the ball on the rest of the WAL format patch,
because it makes the redo-routines more readable. I propose to commit this
as soon as someone has reviewed it, and we agree on a for what's currently
called XLogReplayBuffer(). I have tested this with my page-image comparison
tool.

I have as well passed this patch through the page comparison tool and
found no problems, with both regression and isolation tests. I also
had a look at the redo routines that are changed and actually found
nothing. Now, what this patch does is not much complicated, it adds a
couple of status flags returned by XLogReplayBuffer. Then, roughly,
when BLK_NEEDS_REDO is returned the previous restore actions are done.
This has the merit to put the LSN check on current page to determine
if a page needs to be replayed inside XLogReplayBuffer.

A couple of minor comments though:
1) Why changing that from ERROR to PANIC?
        /* Caller specified a bogus block_index */
-       elog(ERROR, "failed to restore block_index %d", block_index);
+       elog(PANIC, "failed to restore block_index %d", block_index);
2) There are some whitespaces, like here:
+               {
+                       destPage = NULL;        /* don't do any page updates */
                }

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#45)
Re: WAL format and API changes (9.5)

On 08/14/2014 10:29 AM, Michael Paquier wrote:

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Heikki Linnakangas wrote:
What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.

The next patch doesn't necessary require them either, you could always
pass the LSN and record as an argument. I wanted to avoid that, because
every redo function would just pass the current record being replayed,
so it seems nicer to pass that information "out-of-band". I guess if we
do that, though, we should remove those arguments from rm_redo interface
altogether, and always rely on the global variables to get the "current"
record or its LSN. I'm not wedded on this, I could be persuaded to go
either way...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#46)
Re: WAL format and API changes (9.5)

On 08/14/2014 11:22 AM, Michael Paquier wrote:

1) Why changing that from ERROR to PANIC?
/* Caller specified a bogus block_index */
-       elog(ERROR, "failed to restore block_index %d", block_index);
+       elog(PANIC, "failed to restore block_index %d", block_index);

No reason, just a leftover from debugging. Please ignore.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#47)
Re: WAL format and API changes (9.5)

On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote:

On 08/14/2014 10:29 AM, Michael Paquier wrote:

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Heikki Linnakangas wrote:
What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.

The next patch doesn't necessary require them either, you could always pass
the LSN and record as an argument. I wanted to avoid that, because every
redo function would just pass the current record being replayed, so it seems
nicer to pass that information "out-of-band". I guess if we do that, though,
we should remove those arguments from rm_redo interface altogether, and
always rely on the global variables to get the "current" record or its LSN.
I'm not wedded on this, I could be persuaded to go either way...

I personally find the out of band transport really ugly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#49)
2 attachment(s)
Re: WAL format and API changes (9.5)

On 08/14/2014 12:44 PM, Andres Freund wrote:

On 2014-08-14 12:41:35 +0300, Heikki Linnakangas wrote:

On 08/14/2014 10:29 AM, Michael Paquier wrote:

On Thu, Aug 14, 2014 at 3:04 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Heikki Linnakangas wrote:
What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible. Is there no other good way to get this info down to
whatever it is that needs them?

Yep, they do not look necessary. Looking at the patch, you could get
rid of XLogReplayLSN and XLogReplayRecord by adding two extra argument
to XLogReplayBuffer: one for the LSN of the current record, and a
second for the record pointer. The code just saves those two variables
in the redo loop of StartupXLOG to only reuse them in
XLogReplayBufferExtended, and I saw no code paths in the redo routines
where XLogReplayBuffer is called at places without the LSN position
and the record pointer.

However I think that Heikki introduced those two variables to make the
move to the next patch easier.

The next patch doesn't necessary require them either, you could always pass
the LSN and record as an argument. I wanted to avoid that, because every
redo function would just pass the current record being replayed, so it seems
nicer to pass that information "out-of-band". I guess if we do that, though,
we should remove those arguments from rm_redo interface altogether, and
always rely on the global variables to get the "current" record or its LSN.
I'm not wedded on this, I could be persuaded to go either way...

I personally find the out of band transport really ugly.

All rightey.

Here's the next version of this work. It now comes as two patches. The
first one refactors adds the XLogReplayBuffer() function and refactors
all the redo functions to use it. It doesn't change the WAL record
format in any way. The second patch applies over the first one, and
changes the WAL format, and all the WAL-creating functions to use the
new API for constructing WAL records. The second patch removes the
relfilenode and block number arguments from XLogReplayBuffer, because
they're no longer needed when that information is included in the record
format.

Todo:

* Performance testing. Do the new WAL construction functions add
overhead to WAL insertion?
* Compare WAL record sizes before and after. I've tried to keep it as
compact as possible, but I know that some records have gotten bigger.
Need to do a more thorough analysis.
* Rename XLogReplayBuffer. I don't particularly like any of the
suggestions so far, XLogReplayBuffer included. Discussion continues..
* Anything else?

- Heikki

Attachments:

0001-Refactor-redo-routines-to-use-XLogReplayBuffer.patch.gzapplication/gzip; name=0001-Refactor-redo-routines-to-use-XLogReplayBuffer.patch.gz
0002-Change-the-way-WAL-records-are-constructed.patch.gzapplication/gzip; name=0002-Change-the-way-WAL-records-are-constructed.patch.gz
#51Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#50)
Re: WAL format and API changes (9.5)

On Mon, Aug 18, 2014 at 10:55 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

All rightey.

Here's the next version of this work. It now comes as two patches. The

first

one refactors adds the XLogReplayBuffer() function and refactors all the
redo functions to use it. It doesn't change the WAL record format in any
way. The second patch applies over the first one, and changes the WAL
format, and all the WAL-creating functions to use the new API for
constructing WAL records. The second patch removes the relfilenode and

block

number arguments from XLogReplayBuffer, because they're no longer needed
when that information is included in the record format.

Todo:

* Performance testing. Do the new WAL construction functions add overhead

to

WAL insertion?
* Compare WAL record sizes before and after. I've tried to keep it as
compact as possible, but I know that some records have gotten bigger. Need
to do a more thorough analysis.
* Rename XLogReplayBuffer. I don't particularly like any of the

suggestions

so far, XLogReplayBuffer included. Discussion continues..
* Anything else?

Patch 1 looks good. The main difference with the v1 submitted a couple of
days back is that the global variable approach is replaced with additional
arguments for the LSN position and record pointer in XLogReplayBuffer. I
have as well run a couple of tests with the page comparison tool, done some
tests based on installcheck-world with a slave replaying WAL behind, and
found no issues with it.
Perhaps we could consider pushing it to facilitate the next work? Even if
the second patch is dropped it is still a win IMO to have backup block
replay managed within a single function (being it named XLogReplayBuffer in
latest patch), and having it return a status flag.

Regarding patch 2:
- The main differences with the latest version are the modifications for
XLogReplayBuffer having new arguments (LSN position and record pointer).
XLogRecGetBlockRefIds has been changed to return a palloc'd array of block
IDs. xloginsert.h, containing all the functions for xlog record
construction is introduced as well.
- Tiny thing, be aware of tab padding. Here is heapam.c:
page = BufferGetPage(buffer);
PageSetAllVisible(page);
MarkBufferDirty(buffer);
- XLogRecGetBlockRefIds is not described in
src/backend/access/transam/README. Btw, pg_xlogdump drops a core dump when
using it:
--Output:
Assertion failed: (n == *num_refs), function XLogRecGetBlockRefIds, file
xlogreader.c, line 1157.
rmgr: Heap len (rec/tot): 14/ 12912, tx: 3, lsn:
0/01000148, prev 0/01000120, Abort trap: 6 (core dumped)
-- Backtrace:
frame #4: 0x0000000103870363
pg_xlogdump`XLogRecGetBlockRefIds(record=0x00007ff38a003200,
num_refs=0x00007fff5c394464) + 435 at xlogreader.c:1157
frame #5: 0x000000010386d610
pg_xlogdump`XLogDumpDisplayRecord(config=0x00007fff5c3945c8,
ReadRecPtr=16777544, record=0x00007ff38a003200) + 272 at pg_xlogdump.c:357
frame #6: 0x000000010386cad8 pg_xlogdump`main(argc=2,
argv=0x00007fff5c394658) + 3160 at pg_xlogdump.c:749
In order to reproduce that, simply run regression tests, followed by
pg_xlogdump on one of the WAL files generated.
- This patch includes some diffs from pg_receivexlog.c taken from 52bffe3.
- I have run installcheck-world and compared the size of the WAL generated
on HEAD and the patch (any hints to improve such analysis are of course
welcome)
name | start | stop | diff
----------------+-----------+------------+-----------
HEAD (8605bc7) | 0/16C6808 | 0/11A2C670 | 271998568
Patch 1+2 | 0/16D45D8 | 0/1267A4B0 | 284843736
(2 rows)
So that's a diff of more or less 13MB for this test set.

Looking forward for some performance numbers as well as more precise
comparison of WAL record length.
--
Michael

#52Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#51)
Re: WAL format and API changes (9.5)

On 08/19/2014 11:07 AM, Michael Paquier wrote:

Patch 1 looks good. The main difference with the v1 submitted a couple of
days back is that the global variable approach is replaced with additional
arguments for the LSN position and record pointer in XLogReplayBuffer. I
have as well run a couple of tests with the page comparison tool, done some
tests based on installcheck-world with a slave replaying WAL behind, and
found no issues with it.

Thanks!

Perhaps we could consider pushing it to facilitate the next work? Even if
the second patch is dropped it is still a win IMO to have backup block
replay managed within a single function (being it named XLogReplayBuffer in
latest patch), and having it return a status flag.

Yeah, that was my plan.

Regarding the name, the following have been suggested so far:

XLogReplayBuffer
XLogRestoreBuffer
XLogRecoverBuffer
XLogReadBufferForReplay
ReadBufferForXLogReplay

One more idea:

XLogRedoBuffer (would be like three first options above, but would match
that we call the functions that call this "redo" functions)

I think XLogReadBufferForReplay is the most descriptive. Andres and
Alvaro both suggested it - independently I believe - so that seems to
come out naturally. But maybe make it XLogReadBufferForRedo, since we
call the redo functions redo functions and not replay functions.

Yet another option is to just call it XLogReadBuffer, and rename the
existing XLogReadBuffer to something else. With the 2nd patch, there are
only a few callers of XLogReadBuffer left. But is it too deceitful if
XLogReadBuffer doesn't merely read the page, but also sometimes replaces
it with a full-page image? Maybe it's OK..

Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#52)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.

WFM

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Alvaro Herrera (#53)
Re: WAL format and API changes (9.5)

On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.

WFM

for me too. Although we could imo strip the 'XLog' in the beginning if
we want to make it shorter. The ForRedo is saying that pretty much.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#54)
Re: WAL format and API changes (9.5)

On 08/19/2014 05:38 PM, Andres Freund wrote:

On 2014-08-19 10:33:29 -0400, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Barring objections or better ideas, I'm leaning towards
XLogReadBufferForRedo.

WFM

for me too. Although we could imo strip the 'XLog' in the beginning if
we want to make it shorter. The ForRedo is saying that pretty much.

I committed the redo-routine refactoring patch. I kept the XLog prefix
in the XLogReadBufferForRedo name; it's redundant, but all the other
similar functions in xlogutils.c use the XLog prefix so it would seem
inconsistent to not have it here.

I'll post a new version of the main patch shortly...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#55)
Re: WAL format and API changes (9.5)

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I committed the redo-routine refactoring patch. I kept the XLog prefix in
the XLogReadBufferForRedo name; it's redundant, but all the other similar
functions in xlogutils.c use the XLog prefix so it would seem inconsistent
to not have it here.

Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

I'll post a new version of the main patch shortly...

Looking forward to seeing it.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#55)
Re: WAL format and API changes (9.5)

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I committed the redo-routine refactoring patch. I kept the XLog prefix in
the XLogReadBufferForRedo name; it's redundant, but all the other similar
functions in xlogutils.c use the XLog prefix so it would seem inconsistent
to not have it here.

Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

I'll post a new version of the main patch shortly...

Looking forward to seeing it.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#57)
3 attachment(s)
Re: WAL format and API changes (9.5)

On 09/04/2014 03:39 AM, Michael Paquier wrote:

On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I committed the redo-routine refactoring patch. I kept the XLog prefix in
the XLogReadBufferForRedo name; it's redundant, but all the other similar
functions in xlogutils.c use the XLog prefix so it would seem inconsistent
to not have it here.

Thanks! Even that will be helpful for a potential patch doing
consistency comparisons of FPW with current pages having WAL of a
record applied.

I'll post a new version of the main patch shortly...

Looking forward to seeing it.

Here we go. I've split this again into two patches. The first patch is
just refactoring the current code. It moves XLogInsert into a new file,
xloginsert.c, and the definition of XLogRecord to new xlogrecord.h
header file. As a result, there is a a lot of churn in the #includes in
C files that generate WAL records, or contain redo routines. The number
of files that pull in xlog.h - directly or indirectly through other
headers - is greatly reduced.

The second patch contains the interesting changes.

I wrote a little benchmark kit to performance test this. I'm trying to
find out two things:

1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
functions add, compared to the current approach with XLogRecDatas.

2) How much extra WAL is generated with the patch. This affects the CPU
time spent in the tests, but it's also interesting to measure directly,
because WAL size affects many things like WAL archiving, streaming
replication etc.

Attached is the test kit I'm using. To run the battery of tests, use
"psql -f run.sql". To answer the question of WAL volume, it runs a bunch
of tests that exercise heap insert, update and delete, as well as b-tree
and GIN insertions. To answer the second test, it runs a heap insertion
test, with a tiny record size that's chosen so that it generates exactly
the same amount of WAL after alignment with and without the patch. The
test is repeated many times, and the median of runtimes is printed out.

Here are the results, comparing unpatched and patched versions. First,
the WAL sizes:

postgres=# \i compare.sql
description | wal_per_op (orig) | wal_per_op (patched) | %
--------------------------------+-------------------+----------------------+--------
heap insert 26 | 64 | 64 | 100.00
heap insert 27 | 64 | 72 | 112.50
heap insert 28 | 64 | 72 | 112.50
heap insert 29 | 64 | 72 | 112.50
heap insert 30 | 72 | 72 | 100.00
heap insert 31 | 72 | 72 | 100.00
heap insert 32 | 72 | 72 | 100.00
heap insert 33 | 72 | 72 | 100.00
heap insert 34 | 72 | 72 | 100.00
heap insert 35 | 72 | 80 | 111.11
heap update 26 | 80 | 80 | 100.00
heap update 27 | 80 | 88 | 110.00
heap update 28 | 107 | 88 | 82.24
heap update 29 | 88 | 88 | 100.00
heap update 30 | 88 | 108 | 122.73
heap update 31 | 88 | 88 | 100.00
heap update 32 | 105 | 88 | 83.81
heap update 33 | 88 | 88 | 100.00
heap update 34 | 88 | 102 | 115.91
heap update 35 | 88 | 96 | 109.09
hot update 26 | 112 | 80 | 71.43
hot update 27 | 80 | 88 | 110.00
hot update 28 | 80 | 94 | 117.50
hot update 29 | 88 | 88 | 100.00
hot update 30 | 105 | 88 | 83.81
hot update 31 | 88 | 105 | 119.32
hot update 32 | 88 | 88 | 100.00
hot update 33 | 88 | 88 | 100.00
hot update 34 | 124 | 88 | 70.97
hot update 35 | 88 | 111 | 126.14
heap + btree insert 26 | 149 | 157 | 105.37
heap + btree insert 27 | 161 | 161 | 100.00
heap + btree insert 28 | 177 | 178 | 100.56
heap + btree insert 29 | 177 | 185 | 104.52
heap + btree insert 30 | 178 | 185 | 103.93
heap + btree insert 31 | 185 | 188 | 101.62
heap + btree insert 32 | 202 | 202 | 100.00
heap + btree insert 33 | 205 | 211 | 102.93
heap + btree insert 34 | 202 | 210 | 103.96
heap + btree insert 35 | 211 | 210 | 99.53
heap + gin insert (fastupdate) | 12479 | 13182 | 105.63
heap + gin insert | 232547 | 236677 | 101.78
(42 rows)

A heap insertion records are 2 bytes larger with the patch. Due to
alignment, that makes for a 0 or 8 byte difference in the record sizes.
Other WAL records have a similar store; a few extra bytes but no big
regressions. There are a few outliers above where it appears that the
patched version takes less space. Not sure why that would be, probably
just a glitch in the test, autovacuum kicked in or something.

Now, for the CPU overhead:

description | dur_us (orig) | dur_us (patched) | %
----------------+---------------+------------------+--------
heap insert 30 | 0.7752835 | 0.831883 | 107.30
(1 row)

So, the patched version runs 7.3 % slower. That's disappointing :-(.

This are the result I got on my laptop today. Previously, the typical
result I've gotten has been about 5%, so that's a bit high.
Nevertheless, even a 5% slowdown is probably not acceptable.

While I've trying to nail down where that difference comes from, I've
seen a lot of strange phenomenon. At one point, the patched version was
10% slower, but I was able to bring the difference down to 5% if I added
a certain function in xloginsert.c, but never called it. It was very
repeatable at the time, I tried adding and removing it many times and
always got the same result, but I don't see it with the current HEAD and
patch version anymore. So I think 5% is pretty close to the margin of
error that arises from different compiler optimizations,
data/instruction cache effects etc.

Looking at the 'perf' profile, The new function calls only amount to
about 2% of overhead, so I'm not sure where the slowdown is coming from.
Here are explanations I've considered, but I haven't been able to prove
any of them:

* Function call overhead of the new functions. I've tried inlining them,
but found no big difference.

* The relation and block information are included as a separate
XLogRecData entry, so the chain that needs to be memcpy'd and CRCd is
one entry longer. I've tried hacking away the extra entry, but haven't
seen much difference.

* Even though the record size is the same after alignment, it's 2 bytes
longer without alignment, which happens to be about 5% of the total
record size. I've tried modifying the record to be 2 bytes smaller for
test purposes, but found no difference.

I'm out of ideas at the moment. Anyone else?

- Heikki

Attachments:

0001-Move-the-backup-block-logic-from-XLogInsert-to-xlogi.patchtext/x-diff; name=0001-Move-the-backup-block-logic-from-XLogInsert-to-xlogi.patch
0002-Change-the-way-WAL-records-are-constructed.patchtext/x-diff; name=0002-Change-the-way-WAL-records-are-constructed.patch
walperftest.tar.gzapplication/gzip; name=walperftest.tar.gz
#59Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#58)
Re: WAL format and API changes (9.5)

Heikki Linnakangas wrote:

Here we go. I've split this again into two patches. The first patch
is just refactoring the current code. It moves XLogInsert into a new
file, xloginsert.c, and the definition of XLogRecord to new
xlogrecord.h header file. As a result, there is a a lot of churn in
the #includes in C files that generate WAL records, or contain redo
routines. The number of files that pull in xlog.h - directly or
indirectly through other headers - is greatly reduced.

I think you should push the first patch for now and continue to
investigate the issues in the second patch. Some minor suggestions I
have for the first patch are that since xlog.h is no longer used by
other headers (but only by .c files), you should just #include
xloginsert.h instead of doing the forward declaration of struct
XLogRecData; and in xlog_internal.h, instead of

- * XLogRecord is defined in xlog.h, but we avoid #including that to keep
+ * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep

I would just suggest to #include xlogrecord.h, since it should just
work to include that file from frontend programs. It didn't work for
xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
to appear, which makes compilation blow up. Shouldn't be the case here.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Alvaro Herrera (#59)
Re: WAL format and API changes (9.5)

On Mon, Sep 15, 2014 at 7:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Heikki Linnakangas wrote:

Here we go. I've split this again into two patches. The first patch
is just refactoring the current code. It moves XLogInsert into a new
file, xloginsert.c, and the definition of XLogRecord to new
xlogrecord.h header file. As a result, there is a a lot of churn in
the #includes in C files that generate WAL records, or contain redo
routines. The number of files that pull in xlog.h - directly or
indirectly through other headers - is greatly reduced.

I think you should push the first patch for now and continue to
investigate the issues in the second patch. Some minor suggestions I
have for the first patch are that since xlog.h is no longer used by
other headers (but only by .c files), you should just #include
xloginsert.h instead of doing the forward declaration of struct
XLogRecData; and in xlog_internal.h, instead of

- * XLogRecord is defined in xlog.h, but we avoid #including that to keep
+ * XLogRecord is defined in xlogrecord.h, but we avoid #including that to keep

I would just suggest to #include xlogrecord.h, since it should just
work to include that file from frontend programs. It didn't work for
xlog.h because that one #includes fmgr.h IIRC and that one causes Datum
to appear, which makes compilation blow up. Shouldn't be the case here.

Alvaro got faster than me... I was just looking at the first patch and
+1 on those comments. It is worth noting that the first patch, as it
does only a large refactoring, does not impact performance or size of
WAL records. Btw, a declaration of access/xlog.h in fd.c is forgotten.
Also, if somebody is interested in running the test suite, there are
comments on how to do it at the top of compare.sql.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#60)
Re: WAL format and API changes (9.5)

On Mon, Sep 15, 2014 at 8:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Alvaro got faster than me... I was just looking at the first patch and
+1 on those comments. It is worth noting that the first patch, as it
does only a large refactoring, does not impact performance or size of
WAL records.

Some comments after a second and closer read of the first patch:
1) Wouldn't it be better to call GetFullPageWriteInfo directly in
XLogRecordAssemble, removing RedoRecPtr and doPageWrites from its
arguments?
2) XLogCheckBufferNeedsBackup is not used. It can be removed.
3) If I am following correctly, there are two code paths where
XLogInsertRecData can return InvalidXLogRecPtr, and then there is this
process in XLogInsert
+               {
+                       /*
+                        * Undo the changes we made to the rdata chain.
+                        *
+                        * XXX: This doesn't undo *all* the changes;
the XLogRecData
+                        * entries for buffers that we had already
decided to back up have
+                        * had their data-pointers cleared. That's OK,
as long as we
+                        * decide to back them up on the next
iteration as well. Hence,
+                        * don't allow "doPageWrites" value to go from
true to false after
+                        * we've modified the rdata chain.
+                        */
+                       bool            newDoPageWrites;
+
+                       GetFullPageWriteInfo(&RedoRecPtr, &newDoPageWrites);
+                       if (!doPageWrites && newDoPageWrites)
+                               doPageWrites = true;
+                       rdt_lastnormal->next = NULL;
+               }
Wouldn't it be better to keep that in XLogInsertRecData? This way
GetFullPageWriteInfo could be removed (I actually see little point in
keeping it as part of this patch, but does this change make its sense
in patch 2?).
4) This patch is in DOS encoding (?!)

That's all I have for now...
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#58)
Re: WAL format and API changes (9.5)

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:

Here we go. I've split this again into two patches. The first patch is just
refactoring the current code. It moves XLogInsert into a new file,
xloginsert.c, and the definition of XLogRecord to new xlogrecord.h header
file. As a result, there is a a lot of churn in the #includes in C files
that generate WAL records, or contain redo routines. The number of files
that pull in xlog.h - directly or indirectly through other headers - is
greatly reduced.

The second patch contains the interesting changes.

I wrote a little benchmark kit to performance test this. I'm trying to find
out two things:

1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
functions add, compared to the current approach with XLogRecDatas.

2) How much extra WAL is generated with the patch. This affects the CPU time
spent in the tests, but it's also interesting to measure directly, because
WAL size affects many things like WAL archiving, streaming replication etc.

Attached is the test kit I'm using. To run the battery of tests, use "psql
-f run.sql". To answer the question of WAL volume, it runs a bunch of tests
that exercise heap insert, update and delete, as well as b-tree and GIN
insertions. To answer the second test, it runs a heap insertion test, with a
tiny record size that's chosen so that it generates exactly the same amount
of WAL after alignment with and without the patch. The test is repeated many
times, and the median of runtimes is printed out.

Here are the results, comparing unpatched and patched versions. First, the
WAL sizes:

A heap insertion records are 2 bytes larger with the patch. Due to
alignment, that makes for a 0 or 8 byte difference in the record sizes.
Other WAL records have a similar store; a few extra bytes but no big
regressions. There are a few outliers above where it appears that the
patched version takes less space. Not sure why that would be, probably just
a glitch in the test, autovacuum kicked in or something.

I've to admit, that's already not a painless amount of overhead.

Now, for the CPU overhead:

description | dur_us (orig) | dur_us (patched) | %
----------------+---------------+------------------+--------
heap insert 30 | 0.7752835 | 0.831883 | 107.30
(1 row)

So, the patched version runs 7.3 % slower. That's disappointing :-(.

This are the result I got on my laptop today. Previously, the typical result
I've gotten has been about 5%, so that's a bit high. Nevertheless, even a 5%
slowdown is probably not acceptable.

Yes, I definitely think it's not.

While I've trying to nail down where that difference comes from, I've seen a
lot of strange phenomenon. At one point, the patched version was 10% slower,
but I was able to bring the difference down to 5% if I added a certain
function in xloginsert.c, but never called it. It was very repeatable at the
time, I tried adding and removing it many times and always got the same
result, but I don't see it with the current HEAD and patch version anymore.
So I think 5% is pretty close to the margin of error that arises from
different compiler optimizations, data/instruction cache effects etc.

Looking at the 'perf' profile, The new function calls only amount to about
2% of overhead, so I'm not sure where the slowdown is coming from. Here are
explanations I've considered, but I haven't been able to prove any of them:

I'd suggest doing:
a) perf stat -vvv of both workloads. That will often tell you stuff already
b) Look at other events. Particularly stalled-cycles-frontend,
stalled-cycles-backend, cache-misses

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#61)
Re: WAL format and API changes (9.5)

On Mon, Sep 15, 2014 at 9:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

2) XLogCheckBufferNeedsBackup is not used. It can be removed.

Please ignore this comment, XLogCheckBufferNeedsBackup is used in heapam.c.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#62)
2 attachment(s)
Re: WAL format and API changes (9.5)

On 09/16/2014 01:21 PM, Andres Freund wrote:

While I've trying to nail down where that difference comes from, I've seen a
lot of strange phenomenon. At one point, the patched version was 10% slower,
but I was able to bring the difference down to 5% if I added a certain
function in xloginsert.c, but never called it. It was very repeatable at the
time, I tried adding and removing it many times and always got the same
result, but I don't see it with the current HEAD and patch version anymore.
So I think 5% is pretty close to the margin of error that arises from
different compiler optimizations, data/instruction cache effects etc.

Looking at the 'perf' profile, The new function calls only amount to about
2% of overhead, so I'm not sure where the slowdown is coming from. Here are
explanations I've considered, but I haven't been able to prove any of them:

I'd suggest doing:
a) perf stat -vvv of both workloads. That will often tell you stuff already
b) Look at other events. Particularly stalled-cycles-frontend,
stalled-cycles-backend, cache-misses

That didn't make me any wiser, unfortunately.

After a lot of experimentation, I figured out that the slowdown is
already apparent with the *first* patch, the one that just refactors
existing XLogInsert, without any changes to the WAL format or to the
callers. I fiddled with that for a long time, trying to tease apart the
change that makes the difference, and was finally able to narrow it down.

Attached are two patches. These are both just refactoring, with no
changes to the WAL format or APIs. The first is the same as the
refactoring patch I posted earlier, with only minor changes to
#includes, comments and such (per Alvaro's and Michael's suggestions -
thanks!). With the first patch, the test case I've been using to
performance test this becomes somewhere between 5% - 10% slower.
Applying the second patch on top of that restores the performance back
to what you get without these patches.

Strange. The second patch moves the CRC calculation from a separate loop
over the XLogRecDatas to the earlier loop, that also iterates through
the XLogRecDatas. The strange thing about this is that when I tried to
make the same change to current git master, without applying the first
patch, it didn't make any difference. The CRC calculation used to
integrated to the earlier loop in 9.1 and before, but in 9.2 it was
moved to a separate loop for simplicity, because it didn't make any
difference to performance.

So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.

If someone else wants to try this out, the performance difference can be
seen with the test suite I posted earlier, or with an even simpler
pgbench test script:

-- create the test table once, truncate between pgbench runs:
create table test (id int4);

-- pgbench script:
insert into test select i from generate_series(1,100) i;

- Heikki

Attachments:

0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patchtext/x-diff; name=0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patch
0002-Move-the-CRC-calculations-into-the-other-loop-throug.patchtext/x-diff; name=0002-Move-the-CRC-calculations-into-the-other-loop-throug.patch
#65Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#64)
Re: WAL format and API changes (9.5)

On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:

After a lot of experimentation, I figured out that the slowdown is already
apparent with the *first* patch, the one that just refactors existing
XLogInsert, without any changes to the WAL format or to the callers. I
fiddled with that for a long time, trying to tease apart the change that
makes the difference, and was finally able to narrow it down.

Attached are two patches. These are both just refactoring, with no changes
to the WAL format or APIs. The first is the same as the refactoring patch I
posted earlier, with only minor changes to #includes, comments and such (per
Alvaro's and Michael's suggestions - thanks!). With the first patch, the
test case I've been using to performance test this becomes somewhere between
5% - 10% slower. Applying the second patch on top of that restores the
performance back to what you get without these patches.

Strange. The second patch moves the CRC calculation from a separate loop
over the XLogRecDatas to the earlier loop, that also iterates through the
XLogRecDatas. The strange thing about this is that when I tried to make the
same change to current git master, without applying the first patch, it
didn't make any difference. The CRC calculation used to integrated to the
earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
for simplicity, because it didn't make any difference to performance.

So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.

Maybe a stupid question, but did you verify it's not caused by splitting
xloginsert over two translation units and/or not inlining the separate
function?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#65)
Re: WAL format and API changes (9.5)

On 10/03/2014 04:10 PM, Andres Freund wrote:

On 2014-10-03 15:51:37 +0300, Heikki Linnakangas wrote:

After a lot of experimentation, I figured out that the slowdown is already
apparent with the *first* patch, the one that just refactors existing
XLogInsert, without any changes to the WAL format or to the callers. I
fiddled with that for a long time, trying to tease apart the change that
makes the difference, and was finally able to narrow it down.

Attached are two patches. These are both just refactoring, with no changes
to the WAL format or APIs. The first is the same as the refactoring patch I
posted earlier, with only minor changes to #includes, comments and such (per
Alvaro's and Michael's suggestions - thanks!). With the first patch, the
test case I've been using to performance test this becomes somewhere between
5% - 10% slower. Applying the second patch on top of that restores the
performance back to what you get without these patches.

Strange. The second patch moves the CRC calculation from a separate loop
over the XLogRecDatas to the earlier loop, that also iterates through the
XLogRecDatas. The strange thing about this is that when I tried to make the
same change to current git master, without applying the first patch, it
didn't make any difference. The CRC calculation used to integrated to the
earlier loop in 9.1 and before, but in 9.2 it was moved to a separate loop
for simplicity, because it didn't make any difference to performance.

So I now have a refactoring patch ready that I'd like to commit (the
attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.

Maybe a stupid question, but did you verify it's not caused by splitting
xloginsert over two translation units and/or not inlining the separate
function?

Hmph, now that I try this again, I don't see any difference with or
without the "second" patch - they both perform just as well as unpatched
master. So I'm totally baffled again, but I guess the patch is good to
go. I need a more stable test environment..

Yeah, I did try merging xlog.c and xloginsert.c into the same .c file at
some point, and marking the XLogInsertRecData function as static,
allowing the compiler to inline it, but I didn't see any difference. But
you have to take that with a grain of salt - I'm apparently not capable
of constructing a properly repeatable test case for this.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#64)
Re: WAL format and API changes (9.5)

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

So I now have a refactoring patch ready that I'd like to commit (the

attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance. Note that fd.c uses
SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it.
--
Michael

#68Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#67)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 10/06/2014 04:42 AM, Michael Paquier wrote:

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

So I now have a refactoring patch ready that I'd like to commit (the

attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance.

Thanks for the confirmation. I'm really going crazy with benchmarking
this. Sometimes I see a big difference, the next day it's gone.

Note that fd.c uses
SYNC_METHOD_FSYNC_WRITETHROUGH, so it is necessary to include xlog.h in it.

Ah, thanks! You mentioned that before, but I didn't see it because my
laptop doesn't HAVE_FSYNC_WRITETHROUGH. Fixed now.

Attached is a new version of the first, refactoring-only patch. Notable
changes since last version:

* Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct,
rendering it completely broken.

* Moved XLogSaveBufferForHint, log_newpage and log_newpage_buffer to
xloginsert.c. XLogSaveBufferForHint needs to share the XLogFillBkpBlock
function with the rest of the code in xloginsert.c, so this allows
keeping it static. And it makes sense anyway - those are high-level
functions, comparable to XLogInsert, not low-level stuff related to
managing WAL itself.

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

- Heikki

Attachments:

0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patchtext/x-diff; name=0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new.patch
#69Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#68)
Re: WAL format and API changes (9.5)

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:

On 10/06/2014 04:42 AM, Michael Paquier wrote:

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

So I now have a refactoring patch ready that I'd like to commit (the

attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1, master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance.

Thanks for the confirmation. I'm really going crazy with benchmarking this.
Sometimes I see a big difference, the next day it's gone.

A usual suspect for this is turbo mode and power control. It often
already helps to disable the latter to get much more reproducible
benchmark results.

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

I'd like to have a look at it beforehand. I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#68)
Re: WAL format and API changes (9.5)

On Mon, Oct 6, 2014 at 8:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 10/06/2014 04:42 AM, Michael Paquier wrote:

On Fri, Oct 3, 2014 at 9:51 PM, Heikki Linnakangas <
hlinnakangas@vmware.com>
wrote:

So I now have a refactoring patch ready that I'd like to commit (the

attached two patches together), but to be honest, I have no idea why the
second patch is so essential to performance.
Thanks. I did some more tests with master, master+patch1,
master+patch1+CRC
refactoring, but I am not able to see any performance difference with
pgbench (--no-vacuum, -t) and the test suite you provided, just some noise
that barely changed performance.

Thanks for the confirmation. I'm really going crazy with benchmarking
this. Sometimes I see a big difference, the next day it's gone.

The benchmark paradigms.

* Fixed XLogSaveBufferForHint. It didn't initialize BkpBlock struct,
rendering it completely broken.

Note for other reviewers: that's represented by this addition in
XLogSaveBufferForHint:
+               /* Make a BkpBlock struct representing the buffer */
+               XLogFillBkpBlock(buffer, buffer_std, &bkpb)

Regards,
--
Michael

#71Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#69)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 10/06/2014 02:29 PM, Andres Freund wrote:

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

I'd like to have a look at it beforehand.

Ping? Here's an rebased patch. I'd like to proceed with this.

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

- Heikki

Attachments:

0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-2.patchtext/x-diff; name=0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-2.patch
#72Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#71)
Re: WAL format and API changes (9.5)

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

I'd like to have a look at it beforehand.

Ping? Here's an rebased patch. I'd like to proceed with this.

Doing so.

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.

Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.

The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.

Aside: In C11 function local statics become a bit more expensive - they
essentially require atomics and/or spinlocks on the first few calls.

Next thing: The patch doesn't actually compile. Misses an #include
storage/proc.h for MyPgXact.

I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

I'm also somewhat confused about the new split of the headers. I'm not
adverse to splitting them, but it's getting a bit hard to understand:
* xlog.h - generic mishmash
* xlogdefs.h - Three typedefs and some #defines
* xlog_fn.h - SQL functions
* xlog_internal.h - Pretty random collection of stuff
* xlogutils.h - "Utilities for replaying WAL records"
* xlogreader.h - "generic XLog reading facility"
* xloginsert.h - "Functions for generating WAL records"
* xlogrecord.h - "Definitions for the WAL record format."

Isn't that a bit too much? Pretty much the only files that have a
recognizable separation to me are xlog_fn.h and xlogreader.h.

You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on the data they reference. This is OK since the
XLogRecData
- * structs are always just temporaries in the calling code.
comment, but we still do, no?

While not this patches blame, I see currently we finish the critical
section in XLogInsert/XLogInsertRecData pretty early:
END_CRIT_SECTION();

/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
I don't think this is particularly critical, but it still looks wrong to me.

Hm. I think that's what I see for now.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#72)
Re: WAL format and API changes (9.5)

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

Really? To me, that feels like a natural split. xloginsert.c is
responsible for constructing the final WAL record, with the backup
blocks and all. It doesn't access any shared memory (related to WAL; it
does look at Buffers, to determine what to back up). The function in
xlog.c inserts the assembled record to the WAL, as is. It handles the
locking and WAL buffer management related to that.

What would you suggest? I don't think putting everything in one
XLogInsert function, like we have today, is better. Note that the second
patch makes xloginsert.c a lot more complicated.

And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.

Modifying the chain that was already constructed is not pretty, but it's
not this patch's fault. I don't think splitting the functions makes that
any worse. (BTW, the follow-up patch to change the WAL format fixes
that; the per-buffer data is kept separately from the main data chain).

Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.

The situation where it leads to a spurious retry is when the constructed
WAL record skipped the FPW of a buffer, and RedoRecPtr was updated, but
the buffer stilll doesn't need to be FPW according to the updated
RedoRecPtr. That only happens if the same buffer was updated (by another
backend) after the new checkpoint began. I believe that's extremely rare.

We could make that more fine-grained, though. Instead of passing a
boolean 'fpw_sensitive' flag to XLogInsertRecData, pass the lowest LSN
among the pages whose FPW was skipped. If that's less than the new
RedoRecPtr, then retry is needed. That would eliminate the spurious retries.

The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.

I wasn't sure what to do about that. I don't use tracepoints, and I
don't know what others use them for.

I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

Got a better suggestion? I wanted to keep the name XLogInsert()
unchanged, to avoid code churn, and because it's still a good name for
that. XLogRecordAssemble is pretty descriptive for what it does,
although "XLogRecord" implies that it construct an XLogRecord struct. It
does fill in that too, but the main purpose is to build the XLogRecData
chain. Perhaps XLogAssembleRecord() would be better.

I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?

I'm also somewhat confused about the new split of the headers. I'm not
adverse to splitting them, but it's getting a bit hard to understand:
* xlog.h - generic mishmash
* xlogdefs.h - Three typedefs and some #defines
* xlog_fn.h - SQL functions
* xlog_internal.h - Pretty random collection of stuff
* xlogutils.h - "Utilities for replaying WAL records"
* xlogreader.h - "generic XLog reading facility"
* xloginsert.h - "Functions for generating WAL records"
* xlogrecord.h - "Definitions for the WAL record format."

Isn't that a bit too much? Pretty much the only files that have a
recognizable separation to me are xlog_fn.h and xlogreader.h.

This is a matter of taste, I guess. I find xlog.h, xlogdefs.h and
xlog_internal.h fairly random, while the rest are have a clear function.
xlogutils.h is needed by redo functions, and xloginsert.h is needed for
generating WAL.

Note that the second patch adds more stuff to xloginsert.h and xlogrecord.h.

You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on the data they reference. This is OK since the
XLogRecData
- * structs are always just temporaries in the calling code.
comment, but we still do, no?

True. XLogInsertRecData doesn't scribble on its input, but XLogInsert
still does. (this is moot after the second patch though)

While not this patches blame, I see currently we finish the critical
section in XLogInsert/XLogInsertRecData pretty early:
END_CRIT_SECTION();

/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
I don't think this is particularly critical, but it still looks wrong to me.

Updating LogwrtRqst.Write is not critical. If it's not done for some
reason, everything still works. Keeping the critical section as small as
possible is a good idea, no?

Hm. I think that's what I see for now.

Thanks!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#58)
Re: WAL format and API changes (9.5)

Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:

The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)
        }
    }
-   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+   while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",
                            long_options, &option_index)) != -1)
    {
        switch (c)

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you
don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

* At least in Assert mode XLogRegisterBuffer() should ensure we're not
registering the same buffer twice. It's a pretty easy to make mistake
to do it twice for some kind of actions (think UPDATE), and the
consequences will be far from obvious afaics.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void*
instead of char*? Right now there's lots of pointless casts in callers
needed that don't add any safety. The one additional line that's then
needed in these functions seems well worth it.

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?

* You've added an assert in RestoreBackupBlockContents() that ensures
the page isn't new. That wasn't there before, instead there was a
special case for it. I can't immediately think how that previously
could happen, but I do wonder why it was there.

* The following comment in +RestoreBackupBlock probably is two lines to
  early
+   /* Found it, apply the update */
+   if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+       return InvalidBuffer;
  This new InvalidBuffer case probably could use a comment in either
  XLogReadBufferForRedoExtended or here.

* Most of the commentary above RestoreBackupBlock() probably doesn't
belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just
into flags_fork? Right now it makes at least me think that it's fork
specific flags, which really isn't the actual meaning.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
compression patch, to keep the bkp block data after *all* the
"headers". That'd make it easier to just compress the data.

* +InitXLogRecordConstruct() seems like a remainder of the
xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

* Why did you duplicate the identical cases in btree_desc()? I guess due
to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be
completely wrong, but won't the output of e.g. updates be harder to
read now because it's not clear which of the references old/new
buffers are? Not that it matters that much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for
me.

* I think heap_xlog_update is buggy for wal_level=logical because it
computes the length of the tuple using
tuplen = recdataend - recdata;
But the old primary key/old tuple value might be stored there as
well. Afaics that code has to continue to use xl_heap_header_len.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
  to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);
 or have I misunderstood how that works?
* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
         * XLOG record's LSN, we mustn't mark the page all-visible, because
         * the subsequent update won't be replayed to clear the flag.
         */
-       page = BufferGetPage(buffer);
-       PageSetAllVisible(page);
-       MarkBufferDirty(buffer);
+           page = BufferGetPage(buffer);
+           PageSetAllVisible(page);
+           MarkBufferDirty(buffer);
    }

* Somewhat long line:
XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD | REGBUF_NO_IMAGE));

Man. This page is friggin huge. Now I'm tired ;)

This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd
be very surprised if there weren't bugs hidden somewhere. I've lost
the energy to read them all now.
* A good amount of consideration about the increase in WAL volume. I
think there's some cases where that's not inconsiderable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#73)
Re: WAL format and API changes (9.5)

On Fri, Oct 31, 2014 at 2:52 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

Really? To me, that feels like a natural split. xloginsert.c is
responsible for constructing the final WAL record, with the backup blocks
and all. It doesn't access any shared memory (related to WAL; it does look
at Buffers, to determine what to back up). The function in xlog.c inserts
the assembled record to the WAL, as is. It handles the locking and WAL
buffer management related to that.

FWIW, I tend to the same opinion here.

What would you suggest? I don't think putting everything in one XLogInsert

function, like we have today, is better. Note that the second patch makes
xloginsert.c a lot more complicated.

I recall some time ago seeing complaints that xlog.c is too complicated and
should be refactored. Any effort in this area is a good thing IMO, and this
split made sense when I went through the code.

I'm not a big fan of the naming for the new split. We have

* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

Got a better suggestion? I wanted to keep the name XLogInsert() unchanged,
to avoid code churn, and because it's still a good name for that.
XLogRecordAssemble is pretty descriptive for what it does, although
"XLogRecord" implies that it construct an XLogRecord struct. It does fill
in that too, but the main purpose is to build the XLogRecData chain.
Perhaps XLogAssembleRecord() would be better.

I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?

+1 for XLogInsertRecord.
-- 
Michael
#76Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#72)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:

Barring objections, I'll commit this, and then continue benchmarking the
second patch with the WAL format and API changes.

I'd like to have a look at it beforehand.

Ping? Here's an rebased patch. I'd like to proceed with this.

Doing so.

Here's a new version of this refactoring patch. It fixes all the
concrete issues you pointed out.

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what
else to do about it, as it feels very sensible to me as it is now. So
unless you object more loudly, I'm going to just go ahead with this and
commit, probably tomorrow.

And it leads to things like the already complicated logic to retry after
detecting missing fpws is now split across two files seems to confirm
that. What happens right now is that XLogInsert() (with some helper
routines) assembles the record. Then hands that off to
XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
returns back to XLogInsert(), which reverses *some* of its work and then
retries. Brr.

Similary the 'page_writes_omitted' logic doesn't make me particularly
happy. Previously we retried when there actually was a page affected by
the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
is out of date? Won't that quite often spuriously trigger retries? Am I
missing something?
Arguably this doesn't happen often enough to matter, but it's still
something that we should explicitly discuss.

I replaced the boolean with an XLogRecPtr of the smallest LSN among the
pages that were modified but not backed up. That brings the behavior
back to the way it is currently; no more spurious retries. Thanks for
making me to think harder about that, I think this way is also clearer
than the boolean.

The implementation of the split seems to change the meaning of
TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
particularly care about those tracepoints, but I see no simplification
due to changing its meaning.

Fixed.

Next thing: The patch doesn't actually compile. Misses an #include
storage/proc.h for MyPgXact.

Fixed.

I'm not a big fan of the naming for the new split. We have
* XLogInsert() which is the external interface
* XLogRecordAssemble() which builds the rdata chain that will be
inserted
* XLogInsertRecData() which actually inserts the data into the xl buffers.

To me that's pretty inconsistent.

Renamed XLogInsertRecData() to XLogInsertRecord(). Not sure if it makes
it more consistent, but it sounds better anyway.

You've removed the
- *
- * NB: this routine feels free to scribble on the XLogRecData structs,
- * though not on the data they reference. This is OK since the
XLogRecData
- * structs are always just temporaries in the calling code.
comment, but we still do, no?

Fixed.

- Heikki

Attachments:

0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-3.patchtext/x-diff; name=0001-Move-the-backup-block-logic-from-XLogInsert-to-a-new-3.patch
#77Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#76)
Re: WAL format and API changes (9.5)

On Tue, Nov 4, 2014 at 10:03 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

On 2014-10-06 14:19:39 +0300, Heikki Linnakangas wrote:

Barring objections, I'll commit this, and then continue benchmarking

the

second patch with the WAL format and API changes.

I'd like to have a look at it beforehand.

Ping? Here's an rebased patch. I'd like to proceed with this.

Doing so.

Here's a new version of this refactoring patch. It fixes all the concrete

issues you pointed out.

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c

doesn't

make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

I moved the checks for bootstrap mode and xl_len == 0, from the current

XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.

Few observations while reading the latest patch:

1.
+XLogRecPtr
+XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
..
+ /* info's high bits are reserved for use by me */
+ if (info & XLR_INFO_MASK)
+ elog(PANIC, "invalid xlog info mask %02X", info);
..
}

Earlier before this check, we use to check XLogInsertAllowed()
which is moved to XLogInsertRecord(), isn't it better to keep
the check in beginning of the function XLogInsert()?

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
..
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
* Oops, some buffer now needs to be backed up that the caller
* didn't back up. Start over.
*/
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
..
}

IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
assembled (XLogRecordAssemble) and just before actual insert (
XLogInsertRecord)

I think the behaviour for the state when doPageWrites is true during
XLogRecordAssemble and false during XLogInsertRecord (just before
actual insert) is different as compare to HEAD.

In the patch, it will not retry if doPageWrites is false when we are
about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
will detect this during assembly of record and retry, isn't this a
problem?

3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#78Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#77)
Re: WAL format and API changes (9.5)

On 11/05/2014 09:06 AM, Amit Kapila wrote:

1.
+XLogRecPtr
+XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
{
..
+ /* info's high bits are reserved for use by me */
+ if (info & XLR_INFO_MASK)
+ elog(PANIC, "invalid xlog info mask %02X", info);
..
}

Earlier before this check, we use to check XLogInsertAllowed()
which is moved to XLogInsertRecord(), isn't it better to keep
the check in beginning of the function XLogInsert()?

Doesn't matter much, but I think it's cleanest the way I had it in the
latest patch. XLogInsert() is responsible for building a valid WAL
record to insert, and XLogInsertRecord() handles all the locking, buffer
management etc. to actually insert it. The high-bit check quoted above
is more related to building a valid record, so it belongs in
XLogInsert(), while the XLogInsertAllowed() check is more related to the
actual insertion of the record, so it belongs in XLogInsertRecord().

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
..
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
* Oops, some buffer now needs to be backed up that the caller
* didn't back up. Start over.
*/
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
..
}

IIUC, there can be 4 states for doPageWrites w.r.t when record is getting
assembled (XLogRecordAssemble) and just before actual insert (
XLogInsertRecord)

I think the behaviour for the state when doPageWrites is true during
XLogRecordAssemble and false during XLogInsertRecord (just before
actual insert) is different as compare to HEAD.

In the patch, it will not retry if doPageWrites is false when we are
about to insert even though fpw_lsn <= RedoRecPtr whereas in HEAD it
will detect this during assembly of record and retry, isn't this a
problem?

So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.

Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has
changed, doPageWrites is now off, so no FPWs are required regardless of
RedoRecPtr, and not retry. Yeah, you're right, the behavior changes in
that case. However, the new behavior is correct; the retry is
unnecessary in that scenario.

3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.

Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by
name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
"LogInsert" and "LogFlush". But I guess you're right; if we are to
mention the function by name, it should say XLogInsertRecord.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#76)
Re: WAL format and API changes (9.5)

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.

I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#74)
Re: WAL format and API changes (9.5)

On 10/30/2014 09:19 PM, Andres Freund wrote:

Some things I noticed while reading the patch:

A lot of good comments, but let me pick up just two that are related:

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

Yeah, there are still operations that modify relation pages, but don't
store the information about the modified pages in the standard format.
That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE,
and also XLOG_DBASE_CREATE/DROP. And then there are updates to
non-relation files, like all the slru stuff, relcache init files, etc.
And updates to the FSM and VM bypass the full-page write mechanism too.

To play it safe, pg_rewind copies all non-relation files as is. That
includes all SLRUs, FSM and VM files, and everything else whose filename
doesn't match the (main fork of) a relation file. Of course, that's a
fair amount of copying to do, so we might want to optimize that in the
future, but I want to nail the relation files first. They are usually an
order of magnitude larger than the other files, after all.

Unfortunately pg_rewind still needs to recognize and parse the special
WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files
outside the normal block registration system. I've been thinking that we
should add another flag to the WAL record format to mark such records.
pg_rewind will still need to understand the record format of such
records, but the flag will help to catch bugs of omission. If pg_rewind
or another such tool sees a record that's flagged as "special", but
doesn't recognize the record type, it can throw an error.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#80)
Re: WAL format and API changes (9.5)

On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote:

On 10/30/2014 09:19 PM, Andres Freund wrote:

Some things I noticed while reading the patch:

A lot of good comments, but let me pick up just two that are related:

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

Yeah, there are still operations that modify relation pages, but don't store
the information about the modified pages in the standard format. That
includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also
XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files,
like all the slru stuff, relcache init files, etc. And updates to the FSM
and VM bypass the full-page write mechanism too.

That's a awful number of special cases. I see little reason not to
invent something that can reference at least the relation in those
cases. Then we can at least only resync the relations if there were
changes. E.g. for vm's that not necessarily all that likely in an insert
mostly case.
I'm not that worried about things like create/drop database, but fsm,
vm, and the various slru's are really somewhat essential.

To play it safe, pg_rewind copies all non-relation files as is. That
includes all SLRUs, FSM and VM files, and everything else whose filename
doesn't match the (main fork of) a relation file. Of course, that's a fair
amount of copying to do, so we might want to optimize that in the future,
but I want to nail the relation files first. They are usually an order of
magnitude larger than the other files, after all.

That's fair enough.

Unfortunately pg_rewind still needs to recognize and parse the special WAL
records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside
the normal block registration system. I've been thinking that we should add
another flag to the WAL record format to mark such records.

So everytime pg_rewind comes across a record with that flag set which it
doesn't have special case code it'd balk? Sounds sensible.

Personally I think we should at least have a generic format to refer to
entire relations without a specific block number. And one to refer to
SLRUs. I don't think we necessarily need to implement them now, but we
should make sure there's bit space left to denote them.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#78)
Re: WAL format and API changes (9.5)

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 11/05/2014 09:06 AM, Amit Kapila wrote:

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)

So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record.

doPageWrites goes from true to false.

Without the patch, we would retry, because first check RedoRecPtr has

changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.

How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup. Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?

3. There are couple of places where *XLogInsert* is used in wal.sgml
and it seems to me some of those needs change w.r.t this patch.

Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by

name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
"LogInsert" and "LogFlush". But I guess you're right; if we are to mention
the function by name, it should say XLogInsertRecord.

Agreed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#83Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#82)
Re: WAL format and API changes (9.5)

On 11/06/2014 07:57 AM, Amit Kapila wrote:

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 11/05/2014 09:06 AM, Amit Kapila wrote:

2.
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)

So the scenario is that:

* XLogRecordAssemble decides that a page doesn't need to be backed up
* both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.

Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.

How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup. Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?

It doesn't matter what doPageWrites and RedoRecPtr were when XLogInsert
started. It's the state when the insertion actually happens that matters.

It's a bit weird that pg_stop_backup() first turns off forcePagesWrites,
and only then writes the WAL record. But it's OK because when
pg_stop_backup() is called, the backup process has already finished
copying all the blocks. Any pages that are written to disk between
turning forcePageWrites off and writing the WAL record cannot be torn.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#79)
Re: WAL format and API changes (9.5)

On 11/05/2014 11:30 AM, Andres Freund wrote:

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...

Ok.. Can you elaborate?

To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.

I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.

I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.

Committed. Now, to the main patch...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#74)
Re: WAL format and API changes (9.5)

Replying to some of your comments below. The rest were trivial issues
that I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find
the right registered_buffer struct, might be expensive? If that ever
becomes a problem, a simple fix would be to start the linear search from
the end, and make sure that when you touch a large number of blocks, you
do all the XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer to
the struct, and passing it as argument to XLogRegisterBufData. So the
pattern in WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have
potential to turn XLogRegisterBufData into a macro or inline function,
to eliminate the function call overhead. I played with that a little
bit, but the difference in performance was so small that it didn't seem
worth it. But passing the registered_buffer pointer, like above, might
not be a bad thing anyway.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure. Vast majority of WAL records contain
one, maybe two, block references, so it's not that expensive to find the
right one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. I
ran pgbench for 1000000 transactions with WAL archiving enabled, and
measured the time it took to replay the generated WAL. I did that with
and without the patch, and I didn't see any big difference in replay
times. I also ran "perf" on the startup process, and the profiles looked
identical. I'll do more comprehensive testing later, with different
index types, but I'm convinced that there is no performance issue in
replay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be to
have XLogRecGetBlockTag return true/false, indicating if the block
reference existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
oldblk != newblk;

On the other hand, decomposing the WAL record into parts, and passing
the decomposed representation to the redo routines would allow us to
pack the WAL record format more tightly, as accessing the different
parts of the on-disk format wouldn't then need to be particularly fast.
For example, I've been thinking that it would be nice to get rid of the
alignment padding in XLogRecord, and between the per-buffer data
portions. We could copy the data to aligned addresses as part of the
decomposition or parsing of the WAL record, so that the redo routines
could still assume aligned access.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
compression patch, to keep the bkp block data after *all* the
"headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block
headers to the beginning of the WAL record, just after the XLogInsert
struct. Somehow it feels weird to have a bunch of header structs
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte
alignment is enough for the XLogRecordBlockData struct, so that would be
an easy way to make use of the space currently wasted for alignment
padding in XLogRecord.

* I think heap_xlog_update is buggy for wal_level=logical because it
computes the length of the tuple using
tuplen = recdataend - recdata;
But the old primary key/old tuple value might be stored there as
well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion.
That tuplen computation is done on backup block 0's data.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);
or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does*
use REGBUF_KEEP_DATA. That was a bug in the redo routine, it should
always look into buffer 0's data.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#85)
Re: WAL format and API changes (9.5)

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:

Replying to some of your comments below. The rest were trivial issues that
I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find the
right registered_buffer struct, might be expensive? If that ever becomes a
problem, a simple fix would be to start the linear search from the end, and
make sure that when you touch a large number of blocks, you do all the
XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

I've also though about having XLogRegisterBuffer() return the pointer to the
struct, and passing it as argument to XLogRegisterBufData. So the pattern in
WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential
to turn XLogRegisterBufData into a macro or inline function, to eliminate
the function call overhead. I played with that a little bit, but the
difference in performance was so small that it didn't seem worth it. But
passing the registered_buffer pointer, like above, might not be a bad thing
anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure.

True.

Vast majority of WAL records contain one,
maybe two, block references, so it's not that expensive to find the right
one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

I ran a quick performance test on WAL replay performance yesterday. I ran
pgbench for 1000000 transactions with WAL archiving enabled, and measured
the time it took to replay the generated WAL. I did that with and without
the patch, and I didn't see any big difference in replay times. I also ran
"perf" on the startup process, and the profiles looked identical. I'll do
more comprehensive testing later, with different index types, but I'm
convinced that there is no performance issue in replay that we'd need to
worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

If it mattered, a simple optimization to the above pattern would be to have
XLogRecGetBlockTag return true/false, indicating if the block reference
existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
oldblk != newblk;

That sounds like a good idea anyway?

On the other hand, decomposing the WAL record into parts, and passing the
decomposed representation to the redo routines would allow us to pack the
WAL record format more tightly, as accessing the different parts of the
on-disk format wouldn't then need to be particularly fast. For example, I've
been thinking that it would be nice to get rid of the alignment padding in
XLogRecord, and between the per-buffer data portions. We could copy the data
to aligned addresses as part of the decomposition or parsing of the WAL
record, so that the redo routines could still assume aligned access.

Right. I think it'd generally give us a bit more flexibility.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
compression patch, to keep the bkp block data after *all* the
"headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block headers
to the beginning of the WAL record, just after the XLogInsert struct.

Somehow it feels weird to have a bunch of header structs sandwiched between
the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the
XLogRecordBlockData struct, so that would be an easy way to make use of the
space currently wasted for alignment padding in XLogRecord.

Hm, ok. Don't really care ;)

* I think heap_xlog_update is buggy for wal_level=logical because it
computes the length of the tuple using
tuplen = recdataend - recdata;
But the old primary key/old tuple value might be stored there as
well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion. That
tuplen computation is done on backup block 0's data.

Ah. Still living in the "old world" apparently ;). Will look at it again.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#86)
2 attachment(s)
Re: WAL format and API changes (9.5)

Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

Alvaro: you still have the BRIN WAL-logging code fresh in your memory,
so could you take a look at that part of this patch to check that I
didn't break it? And also, how do you like the new way of writing that,
over what you had to in HEAD ?

More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:

On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:

Replying to some of your comments below. The rest were trivial issues that
I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find the
right registered_buffer struct, might be expensive? If that ever becomes a
problem, a simple fix would be to start the linear search from the end, and
make sure that when you touch a large number of blocks, you do all the
XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

I've also though about having XLogRegisterBuffer() return the pointer to the
struct, and passing it as argument to XLogRegisterBufData. So the pattern in
WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential
to turn XLogRegisterBufData into a macro or inline function, to eliminate
the function call overhead. I played with that a little bit, but the
difference in performance was so small that it didn't seem worth it. But
passing the registered_buffer pointer, like above, might not be a bad thing
anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

I ended up doing something different: The block_id is now used as the
index into the registered_buffers array. So looking up the right struct
is now just &registered_buffers[block_id]. That obviously gets rid of
the linear search. It doesn't seem to make any difference in the quick
performance testing I've been doing.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure.

True.

Vast majority of WAL records contain one,
maybe two, block references, so it's not that expensive to find the right
one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

I haven't done anything about this. We might want to do that, but I'd
like to get this committed now, with all the changes to the AM's, and
then spend some time trying to further optimize the WAL format. I might
add the "deparsed WAL record" infrastructure as part of that
optimization work.

I ran a quick performance test on WAL replay performance yesterday. I ran
pgbench for 1000000 transactions with WAL archiving enabled, and measured
the time it took to replay the generated WAL. I did that with and without
the patch, and I didn't see any big difference in replay times. I also ran
"perf" on the startup process, and the profiles looked identical. I'll do
more comprehensive testing later, with different index types, but I'm
convinced that there is no performance issue in replay that we'd need to
worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s.
pg_xlogdump --stats says:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
XLOG 1434 ( 0.01) 48912 ( 0.00) 10149668 ( 0.42) 10198580 ( 0.29)
Transaction 4000767 ( 16.24) 176048028 ( 16.75) 0 ( 0.00) 176048028 ( 5.04)
Storage 232 ( 0.00) 11136 ( 0.00) 0 ( 0.00) 11136 ( 0.00)
CLOG 123 ( 0.00) 4428 ( 0.00) 0 ( 0.00) 4428 ( 0.00)
Database 2 ( 0.00) 96 ( 0.00) 0 ( 0.00) 96 ( 0.00)
Tablespace 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
MultiXact 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
RelMap 14 ( 0.00) 7784 ( 0.00) 0 ( 0.00) 7784 ( 0.00)
Standby 94 ( 0.00) 6016 ( 0.00) 0 ( 0.00) 6016 ( 0.00)
Heap2 3563327 ( 14.47) 142523126 ( 13.56) 1332675761 ( 54.52) 1475198887 ( 42.20)
Heap 16880705 ( 68.54) 726291821 ( 69.09) 972350846 ( 39.78) 1698642667 ( 48.59)
Btree 182603 ( 0.74) 6342386 ( 0.60) 129238796 ( 5.29) 135581182 ( 3.88)
Hash 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gin 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Gist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
Sequence 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
SPGist 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
BRIN 0 ( 0.00) 0 ( 0.00) 0 ( 0.00) 0 ( 0.00)
-------- -------- -------- --------
Total 24629301 1051283733 [30.07%] 2444415071 [69.93%] 3495698804 [100%]

So full-page writes indeed form most of the volume.

I run this again with full_page_writes=off. With that, the patched
version generated about 3% more WAL than master. Replay seems to be a
few percentage points slower, too, which can be explained by the
increased WAL volume.

Attached is a shell script I used to create the WAL.

My next steps for this are:

1. Do more correctness testing of WAL replay routines. I've spent some
time looking at the coverage report generated by "make coverage-html",
after running "make installcheck" on a master-server setup, and crafting
test cases to hit the redo routines that are not otherwise covered by
our regression tests. That has already uncovered a couple of bugs in the
patch that I've now fixed. (That's also how I found the bug I just fixed
in 1961b1c1)

2. Reduce the WAL volume. Any increase in WAL volume is painful on its
own, and also reduces performance in general. This seems to add 1-5%,
depending on the workload. There's some padding bytes the WAL format
that I could use. I've refrained from using them up for now, because I
consider that cheating: we could have kept the current WAL format and
used the padding bytes to make that even smaller/faster. But if we
accept that this patch is worth it, even if it leads to some WAL bloat,
we'll want to buy that back if we can so that users won't see a regression.

If I can find some ways to reduce the WAL volume, I might still commit
this patch more or less as it is first, and commit the optimizations
separately later. Depends on how what those optimizations look like.

- Heikki

Attachments:

wal-format-and-api-changes-7.patch.gzapplication/gzip; name=wal-format-and-api-changes-7.patch.gz
replay-perf-test.shapplication/x-shellscript; name=replay-perf-test.sh
#88Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#87)
Re: WAL format and API changes (9.5)

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.
Regards
--
Michael

#89Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#88)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 11/11/2014 09:39 AM, Michael Paquier wrote:

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that fixed
WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.

Here's a rebased patch. No other changes.

- Heikki

Attachments:

wal-format-and-api-changes-8.patch.gzapplication/gzip; name=wal-format-and-api-changes-8.patch.gz
#90Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#89)
1 attachment(s)
Re: WAL format and API changes (9.5)

On Tue, Nov 11, 2014 at 2:15 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:z

On 11/11/2014 09:39 AM, Michael Paquier wrote:

On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas <

hlinnakangas@vmware.com

wrote:

Attached is a new version. It's rebased on current git master, including
BRIN. I've also fixed the laundry list of small things you reported, as
well as a bunch of bugs I uncovered during my own testing.

This patch needs a small rebase, it has been broken by a590f266 that

fixed

WAL replay for brin indexes:
patching file src/backend/access/brin/brin_xlog.c
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 91.
This will facilitate testing as well.

Here's a rebased patch. No other changes.

I have done some performance testing of this patch using attached
script and data is as below:

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout =30min
full_page_writes = off

HEAD (commit - ae667f7) -

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 398510104 | 11.993057012558
two short fields, no change | 396984168 | 10.3508098125458
two short fields, no change | 396983624 | 10.4196150302887
two short fields, one changed | 437106016 | 10.650365114212
two short fields, one changed | 437102456 | 10.756795167923
two short fields, one changed | 437103000 | 10.6824090480804
two short fields, both changed | 438131520 | 11.1027519702911
two short fields, both changed | 437112248 | 11.0681071281433
two short fields, both changed | 437107168 | 11.0370399951935
one short and one long field, no change | 76044176 | 2.90529608726501
one short and one long field, no change | 76047432 | 2.86844515800476
one short and one long field, no change | 76042664 | 2.84641098976135
ten tiny fields, all changed | 478210904 | 13.1221458911896
ten tiny fields, all changed | 477244448 | 12.8017349243164
ten tiny fields, all changed | 477217832 | 12.9907081127167
hundred tiny fields, all changed | 180353400 | 6.03354096412659
hundred tiny fields, all changed | 180349536 | 5.99893379211426
hundred tiny fields, all changed | 180350064 | 6.00634717941284
hundred tiny fields, half changed | 180348480 | 6.03162407875061
hundred tiny fields, half changed | 180348440 | 6.08004903793335
hundred tiny fields, half changed | 180349056 | 6.03126788139343
hundred tiny fields, half nulled | 101369936 | 5.43424606323242
hundred tiny fields, half nulled | 101660160 | 5.06207084655762
hundred tiny fields, half nulled | 100119184 | 5.51814889907837
9 short and 1 long, short changed | 108142168 | 3.26260495185852
9 short and 1 long, short changed | 108138144 | 3.20250797271729
9 short and 1 long, short changed | 109761240 | 3.21728992462158
(27 rows)

Patch-

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 398692848 | 11.2538840770721
two short fields, no change | 396988648 | 11.4972231388092
two short fields, no change | 396988416 | 11.0026741027832
two short fields, one changed | 437104840 | 11.1911199092865
two short fields, one changed | 437103832 | 11.1138079166412
two short fields, one changed | 437101872 | 11.3141660690308
two short fields, both changed | 437133648 | 11.5836050510406
two short fields, both changed | 437106640 | 11.5675358772278
two short fields, both changed | 437104944 | 11.6147150993347
one short and one long field, no change | 77226464 | 3.01720094680786
one short and one long field, no change | 77645248 | 2.9660210609436
one short and one long field, no change | 76045256 | 3.03326010704041
ten tiny fields, all changed | 477396616 | 13.3963651657104
ten tiny fields, all changed | 477219840 | 13.3838939666748
ten tiny fields, all changed | 477223960 | 13.5736708641052
hundred tiny fields, all changed | 180742136 | 5.87386584281921
hundred tiny fields, all changed | 180398320 | 6.12072706222534
hundred tiny fields, all changed | 180354080 | 6.16246104240417
hundred tiny fields, half changed | 180351456 | 6.17317008972168
hundred tiny fields, half changed | 180348096 | 6.13790798187256
hundred tiny fields, half changed | 183369688 | 6.19142913818359
hundred tiny fields, half nulled | 100111888 | 5.55706000328064
hundred tiny fields, half nulled | 100111952 | 5.55145716667175
hundred tiny fields, half nulled | 100119424 | 5.53378510475159
9 short and 1 long, short changed | 116821904 | 3.42408394813538
9 short and 1 long, short changed | 116160920 | 3.37269878387451
9 short and 1 long, short changed | 116160552 | 3.40163683891296
(27 rows)

It seems to me that there is a regression of (4 ~ 8%) for small records,
refer two short fields tests.

2.
Compilation errors on windows

1>E:\WorkSpace\PostgreSQL\master\postgresql\src\include\access/nbtree.h(296):
error C2016: C requires that a struct or union has at least one member
1>E:\WorkSpace\PostgreSQL\master\postgresql\src\include\access/nbtree.h(303):
error C2016: C requires that a struct or union has at least one member
1> nbtree.c

They are due to following structure declarations.
typedef struct xl_btree_split_left
{
/*
* In the _L variants, next is the new item.
* (In the _R variants, the new item is one of the right page's tuples.)
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in
the new right page.
*/
} xl_btree_split_left;

typedef struct xl_btree_split_right
{
/*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
} xl_btree_split_right;

3.
Patch is failed to apply against Head:
patching file src/backend/access/gin/ginfast.c
Hunk #1 succeeded at 108 (offset 2 lines).
Hunk #2 FAILED at 215.
Hunk #3 succeeded at 276 (offset 3 lines).
Hunk #4 succeeded at 311 (offset 3 lines).
Hunk #5 succeeded at 324 (offset 3 lines).
Hunk #6 succeeded at 339 (offset 3 lines).
Hunk #7 succeeded at 348 (offset 3 lines).
Hunk #8 succeeded at 379 (offset 3 lines).
Hunk #9 succeeded at 396 (offset 3 lines).
Hunk #10 succeeded at 523 (offset 5 lines).
Hunk #11 succeeded at 550 (offset 5 lines).
Hunk #12 succeeded at 587 (offset 5 lines).
1 out of 12 hunks FAILED -- saving rejects to file
src/backend/access/gin/ginfast.c.rej

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal-update-testsuite.shapplication/x-sh; name=wal-update-testsuite.sh
#91Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#90)
2 attachment(s)
Re: WAL format and API changes (9.5)

On 11/11/2014 04:42 PM, Amit Kapila wrote:

I have done some performance testing of this patch using attached
script and data is as below:

...

It seems to me that there is a regression of (4 ~ 8%) for small records,
refer two short fields tests.

Thanks for the testing!

Here's a new version, with big changes again to the record format. Have
a look at xlogrecord.h for the details, but in a nutshell:

1. The overall format is now: XLogRecord, per-block headers, header for
main data portion, per-block data, main data.

2. I removed xl_len field from XLogRecord and rearranged the fields, to
shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a
new 2- or 5-byte header for the "main data", after the block headers).

3. No alignment padding. (the data chunks are copied to aligned buffers
at replay, so redo functions can still assume aligned access)

In quick testing, this new WAL format is somewhat more compact than the
9.4 format. That also seems to have more than bought back the
performance regression I saw earlier. Here are results from my laptop,
using the wal-update-testsuite.sh script:

master:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 396982984 | 7.73713994026184
two short fields, no change | 398531152 | 7.72360110282898
two short fields, no change | 397228552 | 7.90237998962402
two short fields, one changed | 437108464 | 8.03014206886292
two short fields, one changed | 438368456 | 8.17672896385193
two short fields, one changed | 437105232 | 7.89896702766418
two short fields, both changed | 437100544 | 7.98763203620911
two short fields, both changed | 437107032 | 8.0971851348877
two short fields, both changed | 437105368 | 8.1279079914093
one short and one long field, no change | 76552752 | 2.47367906570435
one short and one long field, no change | 76043608 | 2.54243588447571
one short and one long field, no change | 76042576 | 2.6014678478241
ten tiny fields, all changed | 477221488 | 9.41646003723145
ten tiny fields, all changed | 477224080 | 9.37260103225708
ten tiny fields, all changed | 477220944 | 9.41951704025269
hundred tiny fields, all changed | 180889992 | 4.72576093673706
hundred tiny fields, all changed | 180348224 | 4.50496411323547
hundred tiny fields, all changed | 181347504 | 4.78004717826843
hundred tiny fields, half changed | 180379760 | 4.53589606285095
hundred tiny fields, half changed | 181773832 | 4.85075807571411
hundred tiny fields, half changed | 180348160 | 4.65349197387695
hundred tiny fields, half nulled | 100114832 | 3.70726609230042
hundred tiny fields, half nulled | 100116840 | 3.88224697113037
hundred tiny fields, half nulled | 100118848 | 4.00612688064575
9 short and 1 long, short changed | 108140640 | 2.63146805763245
9 short and 1 long, short changed | 108508784 | 2.76349496841431
9 short and 1 long, short changed | 108137144 | 2.79056811332703
(27 rows)

wal-format-and-api-changes-9.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 356865216 | 6.81889986991882
two short fields, no change | 356871304 | 7.0333080291748
two short fields, no change | 356869520 | 6.62423706054688
two short fields, one changed | 356867824 | 7.09969711303711
two short fields, one changed | 356866480 | 7.07576990127563
two short fields, one changed | 357987080 | 7.25394797325134
two short fields, both changed | 396996096 | 7.13484597206116
two short fields, both changed | 396990184 | 7.08063006401062
two short fields, both changed | 396987192 | 7.04641604423523
one short and one long field, no change | 70858376 | 2.2726149559021
one short and one long field, no change | 68024232 | 2.21982789039612
one short and one long field, no change | 69258192 | 2.4696249961853
ten tiny fields, all changed | 396987896 | 8.25723004341125
ten tiny fields, all changed | 396983768 | 8.24221706390381
ten tiny fields, all changed | 397012600 | 8.60816693305969
hundred tiny fields, all changed | 172327416 | 4.57576704025269
hundred tiny fields, all changed | 174669320 | 4.52080512046814
hundred tiny fields, all changed | 172696944 | 4.65672993659973
hundred tiny fields, half changed | 172323720 | 4.57278800010681
hundred tiny fields, half changed | 172330232 | 4.63164114952087
hundred tiny fields, half changed | 172326864 | 4.74219608306885
hundred tiny fields, half nulled | 85597408 | 3.78670310974121
hundred tiny fields, half nulled | 84742808 | 3.82968688011169
hundred tiny fields, half nulled | 84066936 | 3.86192607879639
9 short and 1 long, short changed | 100113080 | 2.54274320602417
9 short and 1 long, short changed | 100119440 | 2.4966151714325
9 short and 1 long, short changed | 100115960 | 2.63230085372925
(27 rows)

Aside from the WAL record format changes, this patch adds the "decoded
WAL record" infrastructure that we talked about with Andres. XLogReader
now has a new function, DecodeXLogRecord, which parses the block headers
etc. from the WAL record, and copies the data chunks to aligned buffers.
The redo routines are passed a pointer to the XLogReaderState, instead
of the plain XLogRecord, and the redo routines can use macros and
functions defined xlogreader.h to access the already-decoded WAL record.
The new WAL record format is difficult to parse in a piece-meal fashion,
so it really needs this separate decoding pass to be efficient.

Thoughts on this new WAL record format? I've attached the xlogrecord.h
file here separately for easy reading, if you want to take a quick look
at just that without applying the whole patch.

- Heikki

Attachments:

wal-format-and-api-changes-9.patch.gzapplication/gzip; name=wal-format-and-api-changes-9.patch.gz
xlogrecord.htext/x-chdr; name=xlogrecord.h
#92Andres Freund
Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#91)
Re: WAL format and API changes (9.5)

On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:

Here's a new version, with big changes again to the record format. Have a
look at xlogrecord.h for the details, but in a nutshell:

1. The overall format is now: XLogRecord, per-block headers, header for main
data portion, per-block data, main data.

2. I removed xl_len field from XLogRecord and rearranged the fields, to
shrink the XLogRecord struct from 32 to 24 bytes. (instead, there's a new 2-
or 5-byte header for the "main data", after the block headers).

3. No alignment padding. (the data chunks are copied to aligned buffers at
replay, so redo functions can still assume aligned access)

Whoa. That's a large amount of changes... Not bad at all.

In quick testing, this new WAL format is somewhat more compact than the 9.4
format.

Well, that's mixing different kinds of changes together to some
degree. Most of this could have been done independently as well. I'm not
generally objecting to doing that, but the space/performance comparison
isn't entirely fair. One could (I'm not!) argue that we should just do a
refactoring like you suggest above, without the additional block
information.

That also seems to have more than bought back the performance
regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

Here are results from my laptop, using the wal-update-testsuite.sh script:

Would be nice if that'd also compute the differences in wal_generated
and duration between two runs... :)

Aside from the WAL record format changes, this patch adds the "decoded WAL
record" infrastructure that we talked about with Andres.

Ah, cool.

XLogReader now has
a new function, DecodeXLogRecord, which parses the block headers etc. from
the WAL record, and copies the data chunks to aligned buffers. The redo
routines are passed a pointer to the XLogReaderState, instead of the plain
XLogRecord, and the redo routines can use macros and functions defined
xlogreader.h to access the already-decoded WAL record. The new WAL record
format is difficult to parse in a piece-meal fashion, so it really needs
this separate decoding pass to be efficient.

Hm. I'm not sure if there's really a point in exposing
DecodeXLogRecord() outside of xlogreader.c. In which case would a
xlogreader user not want to decode the record? I.e. couldn't we just
always do that and not bother exposing the function?

Sure, you could skip decoding if the record is of a "uninteresting"
rmgr, but since all validation but CRC now happens only during decoding
I'm not sure that's a good idea. You e.g. have a block

+		if (!DecodeXLogRecord(xlogreader_state, &errormsg))
+		{
+			fprintf(stderr, "error in WAL record at %X/%X: %s\n",
+						(uint32) (xlogreader_state->ReadRecPtr >> 32),
+						(uint32) xlogreader_state->ReadRecPtr,
+						errormsg);
+			/*
+			 * Parsing the record failed, but it passed CRC check, so in
+			 * theory we can continue reading.
+			 */
+			continue;
+		}
+

I think that's quite the bad idea. The CRC is only a part of the error
detection, not its entirety.

Thoughts on this new WAL record format? I've attached the xlogrecord.h file
here separately for easy reading, if you want to take a quick look at just
that without applying the whole patch.

After a quick skim I like it in general. There's some details I'd rather
change, but I think it's quite the step forward.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#91)
Re: WAL format and API changes (9.5)

On Thu, Nov 13, 2014 at 7:03 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

On 11/11/2014 04:42 PM, Amit Kapila wrote:

I have done some performance testing of this patch using attached
script and data is as below:

...

It seems to me that there is a regression of (4 ~ 8%) for small records,
refer two short fields tests.

Thanks for the testing!

Thoughts on this new WAL record format? I've attached the xlogrecord.h

file here separately for easy reading, if you want to take a quick look at
just that without applying the whole patch.

Apart from changes in XLogRecord to remove xl_len and padding, the new
format for block headers seems to be quite succinct and then by making
data length as variable for actual record the overall WAL record size
becomes smaller. However the increase in unaligned size (as mentined by
you up-thread as 2 bytes) seems to be still remain same.
Is the difference in size is due to additional block reference id this patch
requires or something else?

In new record format the Record header always start at aligned
boundary, however the actual record data doesn't necessarily be
at aligned boundary, can this make difference as currently we copy
both of these separately?

Overall, I think this format is a net improvement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#94Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#91)
Re: WAL format and API changes (9.5)

On Thu, Nov 13, 2014 at 10:33 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

In quick testing, this new WAL format is somewhat more compact than the 9.4
format. That also seems to have more than bought back the performance
regression I saw earlier. Here are results from my laptop, using the
wal-update-testsuite.sh script:
master:
[results]
(27 rows)
wal-format-and-api-changes-9.patch:
[results]
(27 rows)

So based on your series of tests, you are saving 6% up to 10%.
That's... Really cool!

Aside from the WAL record format changes, this patch adds the "decoded WAL
record" infrastructure that we talked about with Andres. XLogReader now has
a new function, DecodeXLogRecord, which parses the block headers etc. from
the WAL record, and copies the data chunks to aligned buffers. The redo
routines are passed a pointer to the XLogReaderState, instead of the plain
XLogRecord, and the redo routines can use macros and functions defined
xlogreader.h to access the already-decoded WAL record. The new WAL record
format is difficult to parse in a piece-meal fashion, so it really needs
this separate decoding pass to be efficient.

Thoughts on this new WAL record format? I've attached the xlogrecord.h file
here separately for easy reading, if you want to take a quick look at just
that without applying the whole patch.

The new format is neat, and that's a lot of code.. Grouping together
the blocks of data is a good thing for the FPW compression patch as
well.

Note that this patch conflicts with the recent commits 81c4508 and
34402ae. installcheck-world is passing, and standbys are able to
replay records, at least without crashes.

Here are some more comments:
1) with assertions enabled this does not compile because of a small
typo in xlogreader.h here:
+#define XLogRecHasAnyBlockRefs(decoder) ((decoder)->max_block id >= 0)
2) In xlogreader.c, XLogRecGetBlockData should return char * but a
boolean is returned here:
+XLogRecGetBlockData(XLogReaderState *record, uint8 block_id, Size *len)
+{
+       DecodedBkpBlock *bkpb;
+
+       if (!record->blocks[block_id].in_use)
+               return false;
As no blocks are in use in this case this should be NULL.
3) pg_xlogdump does not seem to work:
$ pg_xlogdump 00000001000000000000000D
pg_xlogdump: FATAL:  could not find a valid record after 0/D000000
4) A couple of NOT_USED blocks could be removed, no?
+#ifdef NOT_USED
        BlockNumber leftChildBlkno = InvalidBlockNumber;
+#endif
5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
+                       XLogBeginInsert();
+                       XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+                       XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);
6) There are FIXME blocks:
+               // FIXME
+               //if ((bkpb->fork_flags & BKPBLOCK_WILL_INIT) != 0 &&
mode != RBM_ZERO)
+               //      elog(PANIC, "block with WILL_INIT flag in WAL
record must be zeroed by redo routine");]
And that:
+               /* FIXME: translation? Although this shouldn't happen.. */
+               ereport(ERROR,
+                               (errmsg("error decoding WAL record"),
+                                errdetail("%s", errormsg)));
Regards,
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#94)
Re: WAL format and API changes (9.5)

On Fri, Nov 14, 2014 at 5:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

3) pg_xlogdump does not seem to work:
$ pg_xlogdump 00000001000000000000000D
pg_xlogdump: FATAL: could not find a valid record after 0/D000000

This one is a bad manipulation from my side. Please forget this comment.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#94)
Re: WAL format and API changes (9.5)

On 11/14/2014 10:31 AM, Michael Paquier wrote:

5) Here why not using the 2nd block instead of the 3rd (@_bt_getroot)?
+                       XLogBeginInsert();
+                       XLogRegisterBuffer(0, rootbuf, REGBUF_WILL_INIT);
+                       XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);

See the comment of the xl_btree_newroot struct. It explains the record
format of the BTREE_NEWROOT record type:

* Backup Blk 0: new root page (2 tuples as payload, if splitting old root)
* Backup Blk 1: left child (if splitting an old root)
* Backup Blk 2: metapage

When _bt_getroot creates a new root, there is no old root, but the same
record type is used in _bt_newroot, which uses block ID 1 to refer to
the old root page.

(Thanks for the comments, again! I'll post a new version soon)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#92)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 11/13/2014 05:04 PM, Andres Freund wrote:

On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:

In quick testing, this new WAL format is somewhat more compact than the 9.4
format.

Well, that's mixing different kinds of changes together to some
degree. Most of this could have been done independently as well. I'm not
generally objecting to doing that, but the space/performance comparison
isn't entirely fair. One could (I'm not!) argue that we should just do a
refactoring like you suggest above, without the additional block
information.

Yep. It would be interesting to just do the minimal changes to the
current record format, to get a similar reduction in WAL volume, and
then compare the performance and WAL volume of that vs. the big patch.
But I don't think it's really worth it; we do want the bigger changes,
anyway, even if it costs a little.

That also seems to have more than bought back the performance
regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

Will do.

XLogReader now has
a new function, DecodeXLogRecord, which parses the block headers etc. from
the WAL record, and copies the data chunks to aligned buffers. The redo
routines are passed a pointer to the XLogReaderState, instead of the plain
XLogRecord, and the redo routines can use macros and functions defined
xlogreader.h to access the already-decoded WAL record. The new WAL record
format is difficult to parse in a piece-meal fashion, so it really needs
this separate decoding pass to be efficient.

Hm. I'm not sure if there's really a point in exposing
DecodeXLogRecord() outside of xlogreader.c. In which case would a
xlogreader user not want to decode the record? I.e. couldn't we just
always do that and not bother exposing the function?

Sure, you could skip decoding if the record is of a "uninteresting"
rmgr, but since all validation but CRC now happens only during decoding
I'm not sure that's a good idea.

Well, extracting all the backup block data etc. is not free, so it would
be nice to be able to filter first. I'm thinking of logical decoding in
particular; it could avoid parsing index AM records. Then again, the
parsing isn't very expensive anyway.

I'll change the patch to decode as part of the XLogReadRecord call for
now; it does things a little bit simpler. We can change it back later if
it seems worthwhile.

I nevertheless kept the DecodeXLogRecord function. It can be used to
decode an XLogRecord that's not read from disk, using the xlogreader,
but already in memory. That was needed to support WAL_DEBUG at
insertion; it needs to rm_desc() a record that's not previously read
from disk with an xlogreader.

Thoughts on this new WAL record format? I've attached the xlogrecord.h file
here separately for easy reading, if you want to take a quick look at just
that without applying the whole patch.

After a quick skim I like it in general. There's some details I'd rather
change, but I think it's quite the step forward.

Glad to hear that.

Here is an updated version of the patch. It's rebased over current
master, and fixes a bunch of issues you and Michael pointed out, and a
ton of other cleanup.

I'll do more comprehensive performance testing with this, and post results.

BTW, I'm planning to replace the malloc() calls in xlogreader.c with
palloc(). It currently uses malloc because xlogreader.c needs to be
usable in frontend code, but nowadays we map palloc automatically to
malloc when building frontend code. A consequence of that is that you
get a straight exit() on out-of-memory, but I think that's acceptable.

- Heikki

Attachments:

wal-format-and-api-changes-10.patch.gzapplication/gzip; name=wal-format-and-api-changes-10.patch.gz
#98Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#97)
1 attachment(s)
Re: WAL format and API changes (9.5)

On 11/17/2014 03:22 PM, Heikki Linnakangas wrote:

Here is an updated version of the patch. It's rebased over current
master, and fixes a bunch of issues you and Michael pointed out, and a
ton of other cleanup.

That patch had two extra, unrelated patches applied to it. One to use
the Intel CRC instruction for the CRC calculation, and another to speed
up PageRepairFragmentation, by replacing the pq_qsort() call with a
custom sort algorithm. Sorry about that. (the latter is something that
showed up very high in a perf profile of replaying the WAL generated by
pgbench - I'll start a new thread about that)

Here is the same patch without those extra things mixed in.

- Heikki

Attachments:

wal-format-and-api-changes-11.patch.gzapplication/gzip; name=wal-format-and-api-changes-11.patch.gz
#99Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#92)
3 attachment(s)
Re: WAL format and API changes (9.5)

On 11/13/2014 05:04 PM, Andres Freund wrote:

On 2014-11-13 15:33:44 +0200, Heikki Linnakangas wrote:

That also seems to have more than bought back the performance
regression I saw earlier.

I think we really need to do the performance test with a different CRC
implementation. So far all the tests in this thread seem to be
correlating pretty linearly to the size of the record.

Ok, I ran some tests comparing this patch, plus a quick & dirty patch to
use the Intel CRC instruction for CRC calculation. The quick & dirty is
attached. It's based on code you sent me over IM some weeks ago, and
needs -msse4.2 to compile. The point of the CRC patch is to just "hide"
the effect of simply reducing the WAL volume, to concentrate on the
possible overhead on constructing and replaying the new format.

WAL insertion performance
=========================

To measure the performance of generating WAL, I ran the
wal-update-testsuite.sh that Amit also ran earlier. The cluster was
initialized with:

shared_buffers=512MB
checkpoint_segments=30
fsync=off
autovacuum=off
full_page_writes=off

master + intel-crc.patch
------------------------

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 388523496 | 5.98275089263916
two short fields, no change | 384400536 | 5.81247496604919
two short fields, no change | 394318384 | 5.83137106895447
two short fields, one changed | 424519456 | 5.94383692741394
two short fields, one changed | 427103960 | 6.0239098072052
two short fields, one changed | 431826936 | 5.99587488174438
two short fields, both changed | 424524232 | 6.24820590019226
two short fields, both changed | 424520616 | 6.27542090415955
two short fields, both changed | 429219024 | 6.24310803413391
one short and one long field, no change | 78215672 | 1.69482898712158
one short and one long field, no change | 79275744 | 1.70809984207153
one short and one long field, no change | 76444464 | 1.6900041103363
ten tiny fields, all changed | 468263328 | 7.33222103118896
ten tiny fields, all changed | 475685120 | 7.44711399078369
ten tiny fields, all changed | 466329808 | 7.44984602928162
hundred tiny fields, all changed | 181802520 | 3.48464608192444
hundred tiny fields, all changed | 169941080 | 3.57416915893555
hundred tiny fields, all changed | 170178472 | 3.51270413398743
hundred tiny fields, half changed | 184539560 | 3.50752401351929
hundred tiny fields, half changed | 184539488 | 3.51076292991638
hundred tiny fields, half changed | 182090680 | 3.50708913803101
hundred tiny fields, half nulled | 104020856 | 3.43046092987061
hundred tiny fields, half nulled | 100607288 | 3.2797839641571
hundred tiny fields, half nulled | 97194600 | 3.26747107505798
9 short and 1 long, short changed | 99243976 | 1.94056487083435
9 short and 1 long, short changed | 95552912 | 1.94364786148071
9 short and 1 long, short changed | 100634736 | 1.94284915924072
(27 rows)

wal-format-and-api-changes-12.patch + crc-intel.patch
-----------------------------------------------------

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 344717720 | 6.01146483421326
two short fields, no change | 351999048 | 5.94603586196899
two short fields, no change | 359280088 | 5.92386198043823
two short fields, one changed | 349783736 | 5.98722791671753
two short fields, one changed | 357064632 | 5.95086097717285
two short fields, one changed | 347568640 | 5.92789196968079
two short fields, both changed | 388404496 | 5.9715678691864
two short fields, both changed | 389122792 | 5.99439883232117
two short fields, both changed | 389840936 | 5.98658204078674
one short and one long field, no change | 56844552 | 1.69335913658142
one short and one long field, no change | 67099160 | 1.66509008407593
one short and one long field, no change | 65557168 | 1.67773389816284
ten tiny fields, all changed | 387205416 | 7.0555100440979
ten tiny fields, all changed | 396668040 | 7.05817604064941
ten tiny fields, all changed | 389353592 | 6.98230600357056
hundred tiny fields, all changed | 163675416 | 3.4382529258728
hundred tiny fields, all changed | 173227296 | 3.4594509601593
hundred tiny fields, all changed | 172800520 | 3.44062900543213
hundred tiny fields, half changed | 174070216 | 3.51688480377197
hundred tiny fields, half changed | 169965336 | 3.47681999206543
hundred tiny fields, half changed | 175331128 | 3.45153713226318
hundred tiny fields, half nulled | 84784272 | 3.26745700836182
hundred tiny fields, half nulled | 83732928 | 3.22160506248474
hundred tiny fields, half nulled | 80339600 | 3.22326803207397
9 short and 1 long, short changed | 98087384 | 2.00620484352112
9 short and 1 long, short changed | 101273968 | 1.99480700492859
9 short and 1 long, short changed | 93123424 | 1.99506211280823
(27 rows)

Summary: No meaningful difference in runtime.

WAL replay performance
======================

To test WAL replay performance, I ran pgbench with WAL archiving
enabled, and timed the replay of the generated WAL. I used the attached
script, replay-perf-test.sh for that. full_page_writes were disabled,
because replaying full page images is quite different from replaying
other records. (Performance of full-page images is interesting too, but
it's not expected that these WAL format changes make much difference to
that).

(This is the same test setup I used in 54611209.705@vmware.com.)

I repeated the replay three times, and grepped for the "redo starts" and
"redo done" messages in the log. The results:

master + crc-intel.patch
------------------------

2014-11-17 20:40:10.175 EET LOG: redo starts at 0/2000090
2014-11-17 20:41:04.327 EET LOG: redo done at 0/7A0000C8

2014-11-17 20:42:46.319 EET LOG: redo starts at 0/2000090
2014-11-17 20:43:37.973 EET LOG: redo done at 0/7A0000C8

2014-11-17 20:44:14.889 EET LOG: redo starts at 0/2000090
2014-11-17 20:45:07.161 EET LOG: redo done at 0/7A0000C8

wal-format-and-api-changes-12.patch + crc-intel.patch
-----------------------------------------------------

2014-11-17 20:20:54.198 EET LOG: redo starts at 0/2000090
2014-11-17 20:21:49.073 EET LOG: redo done at 0/6A0000C8

2014-11-17 20:23:09.438 EET LOG: redo starts at 0/2000090
2014-11-17 20:24:04.744 EET LOG: redo done at 0/6A0000C8

2014-11-17 20:25:24.076 EET LOG: redo starts at 0/2000090
2014-11-17 20:26:18.830 EET LOG: redo done at 0/6A0000C8

In summary, there is no significant difference in replay performance.
The amount of WAL generated is much smaller with the patch.

For completeness, attached is the version of the patch that I used for
these tests. It's the same as patch version 11, but I fixed one trivial
bug in a b-tree redo function.

This concludes my performance testing, until someone wants to see some
other scenario being tested. I'm happy with the results.

- Heikki

Attachments:

replay-perf-test.shapplication/x-shellscript; name=replay-perf-test.sh
crc-intel.patchtext/x-diff; name=crc-intel.patch
wal-format-and-api-changes-12.patch.gzapplication/gzip; name=wal-format-and-api-changes-12.patch.gz
#100Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#99)
Re: WAL format and API changes (9.5)

On Tue, Nov 18, 2014 at 4:31 AM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

WAL insertion performance
=========================
To measure the performance of generating WAL, I ran the
wal-update-testsuite.sh that Amit also ran earlier. The cluster was
initialized with:

shared_buffers=512MB
checkpoint_segments=30
fsync=off
autovacuum=off
full_page_writes=off

[results]
Summary: No meaningful difference in runtime.

If I am seeing that correctly, WAL generated is reduced for all the tests,
except for the case of "hundred tiny fields" where more WAL is generated.
Now the duration time seems to be generally reduced, some noise (?) making
it sometimes higher.

WAL replay performance
======================

To test WAL replay performance, I ran pgbench with WAL archiving enabled,
and timed the replay of the generated WAL. I used the attached script,
replay-perf-test.sh for that. full_page_writes were disabled, because
replaying full page images is quite different from replaying other

records.

(Performance of full-page images is interesting too, but it's not expected
that these WAL format changes make much difference to that).

In summary, there is no significant difference in replay performance. The
amount of WAL generated is much smaller with the patch.

This concludes my performance testing, until someone wants to see some

other

scenario being tested. I'm happy with the results.

I think you can, that's a great study, and this proves to be a gain on many
fields.

If this goes in, it is going to be one of the largest patches committed
ever.
$ git diff --stat | tail -n1
91 files changed, 3895 insertions(+), 4305 deletions(-)

There are still some XXX blocks here and there in the code.. But nothing
really huge, like here:
-                * checks could go wrong too.
+                * checks could go wrong too. XXX does this comment still
make sense?
                 */
-               Assert(xldata->blkno != xldata->blknoNew);
+               Assert(blkno != blknoNew);

Btw, did you do a run with the buffer capture facility and checked for page
differences?

Except that, after going through the code once again, ISTM that the patch
is in a nice state. It may be better to wait for some input from Andres, he
may catch some issues I haven't spotted.
Regards,
--
Michael

#101Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#100)
Re: WAL format and API changes (9.5)

On 11/18/2014 10:28 AM, Michael Paquier wrote:

Btw, did you do a run with the buffer capture facility and checked for page
differences?

Yes. That's how I bumped into the bug fixed in c73669c0. That tool has
been tremendously helpful.

Except that, after going through the code once again, ISTM that the patch
is in a nice state. It may be better to wait for some input from Andres, he
may catch some issues I haven't spotted.

Yeah, I'm sure there are tons of small things. I'll have to take a small
break and read through the patch myself after a few days. But I'd like
to get this committed pretty soon. I believe everyone agrees with the
big picture, even if there's some little tweaking left to do. It's a
pain to maintain as a patch, and I believe the FPW compression patch is
waiting for this to land.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#99)
Re: WAL format and API changes (9.5)

As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues
you may find.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#102)
Re: WAL format and API changes (9.5)

On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

As you may have noticed, I committed this (after some more cleanup). Of

course, feel free to still review it, and please point out any issues you
may find.

Few minor observations:
1. Readme

void XLogResetInsertion(void)

Clear any currently registered data and buffers from the WAL record
construction workspace. This is only needed if you have already
called XLogBeginInsert(), but decide to not insert the record after all.

I think above sentence could be slightly rephrased as the this function is
also getting called at end of XLogInsert().

2.
shiftList()
{
..
XLogEnsureRecordSpace(data.ndeleted + 1, 0);
..
}

Shouldn't above function call to XLogEnsureRecordSpace() be done
under if (RelationNeedsWAL(rel))?

3.
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;

/* XLogBeginInsert() must have been called.
*/
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");

As we are in critical section at this moment, so is it okay to have
elog(ERROR,). I think this can only happen due to some coding
mistake, but still not sure if elog(ERROR) is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#104Michael Paquier
Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#102)
Re: WAL format and API changes (9.5)

On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues you
may find.

This comment on top of XLogRecordAssemble is not adapted as
page_writes_omitted does not exist. Perhaps this is a remnant of some older
version of the patch?
* If there are any registered buffers, and a full-page image was not taken
* of all them, *page_writes_omitted is set to true. This signals that the
* assembled record is only good for insertion on the assumption that the
* RedoRecPtr and doPageWrites values were up-to-date.
*/
--
Michael

#105Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#104)
Re: WAL format and API changes (9.5)

On 11/21/2014 09:19 AM, Michael Paquier wrote:

On Fri, Nov 21, 2014 at 2:06 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

As you may have noticed, I committed this (after some more cleanup). Of
course, feel free to still review it, and please point out any issues you
may find.

This comment on top of XLogRecordAssemble is not adapted as
page_writes_omitted does not exist. Perhaps this is a remnant of some older
version of the patch?
* If there are any registered buffers, and a full-page image was not taken
* of all them, *page_writes_omitted is set to true. This signals that the
* assembled record is only good for insertion on the assumption that the
* RedoRecPtr and doPageWrites values were up-to-date.
*/

Yep, fixed thanks.
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Heikki Linnakangas
Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#103)
Re: WAL format and API changes (9.5)

On 11/21/2014 05:58 AM, Amit Kapila wrote:

On Thu, Nov 20, 2014 at 10:36 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

As you may have noticed, I committed this (after some more cleanup). Of

course, feel free to still review it, and please point out any issues you
may find.

Few minor observations:
1. Readme

void XLogResetInsertion(void)

Clear any currently registered data and buffers from the WAL record
construction workspace. This is only needed if you have already
called XLogBeginInsert(), but decide to not insert the record after all.

I think above sentence could be slightly rephrased as the this function is
also getting called at end of XLogInsert().

Well, yeah, but that's internal to the xloginsert machinery, not
relevant for someone learning about how to use it.

2.
shiftList()
{
..
XLogEnsureRecordSpace(data.ndeleted + 1, 0);
..
}

Shouldn't above function call to XLogEnsureRecordSpace() be done
under if (RelationNeedsWAL(rel))?

True, fixed. (It doesn't hurt to call it, but it's clearly unnecessary)

3.
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;

/* XLogBeginInsert() must have been called.
*/
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");

As we are in critical section at this moment, so is it okay to have
elog(ERROR,). I think this can only happen due to some coding
mistake, but still not sure if elog(ERROR) is okay.

In a critical section, the ERROR will be promoted automatically to a
PANIC. There isn't much else we can do if that happens; it is a coding
mistake as you say.

BTW, not all XLogInsert() calls are inside a critical section. All the
ones that modify data pages are, but there are others.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers