WAL format changes

Started by Heikki Linnakangasalmost 14 years ago31 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.com),
here are three patches that change the WAL format. The goal is to change
the format so that when you're inserting a WAL record of a given size,
you know exactly how much space it requires in the WAL.

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

2. Always include the xl_rem_len field, used for continuation records,
in the xlog page header. A continuation log record only contained that
one field, it's now included straight in the page header, so the concept
of a continuation record doesn't exist anymore. Because of alignment,
this wastes 4 bytes on every page that contains continued data from a
previous record, and 8 bytes on pages that don't. That's not very much,
and the next step will buy that back:

3. Allow WAL record header to be split across pages. Per Tom's
suggestion, move xl_tot_len to be the first field in XLogRecord, so that
even if the header is split, xl_tot_len is always on the first page.
xl_crc is moved to be the last field, and xl_prev is the second to last.
This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know
where exactly the record is inserted, so it's handy that it's the last
field before CRC. This patch doesn't try to take advantage of that,
however, and I'm not sure if that makes any difference once I finish the
patch to make XLogInsert scale better, which is the ultimate goal of all
this.

Those are the three patches I'd like to get committed in this
commitfest. To see where all this is leading to, I've included a rough
WIP version of the XLogInsert scaling patch. This version is quite
different from the one I posted in spring, it takes advantage of the WAL
format changes, and I'm also experimenting with a different method of
tracking how far each WAL insertion has progressed. But more on that later.

(Note to self: remember to bump XLOG_PAGE_MAGIC)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

1-use-uint64-got-segno.patchtext/x-diff; name=1-use-uint64-got-segno.patchDownload+419-563
2-move-continuation-record-to-page-header.patchtext/x-diff; name=2-move-continuation-record-to-page-header.patchDownload+27-37
3-allow-wal-record-header-to-be-split.patchtext/x-diff; name=3-allow-wal-record-header-to-be-split.patchDownload+216-157
4-WIP-xloginsert-scale.patchtext/x-diff; name=4-WIP-xloginsert-scale.patchDownload+1268-464
#2Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#1)
Re: WAL format changes

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
m), here are three patches that change the WAL format. The goal is to
change the format so that when you're inserting a WAL record of a given
size, you know exactly how much space it requires in the WAL.

I fear the patches need rebasing after the pgindent run... Even before that
(60801944fa105252b48ea5688d47dfc05c695042) it only applies with offsets?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#3Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#1)
Re: WAL format changes

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
m), here are three patches that change the WAL format. The goal is to
change the format so that when you're inserting a WAL record of a given
size, you know exactly how much space it requires in the WAL.

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.
API compatibility isn't a great argument either as code manually playing
around with those needs to be checked anyway. I think there might be some code
around that does XLogRecPtr addition manuall and such.

2. Always include the xl_rem_len field, used for continuation records,
in the xlog page header. A continuation log record only contained that
one field, it's now included straight in the page header, so the concept
of a continuation record doesn't exist anymore. Because of alignment,
this wastes 4 bytes on every page that contains continued data from a
previous record, and 8 bytes on pages that don't. That's not very much,
and the next step will buy that back:

3. Allow WAL record header to be split across pages. Per Tom's
suggestion, move xl_tot_len to be the first field in XLogRecord, so that
even if the header is split, xl_tot_len is always on the first page.
xl_crc is moved to be the last field, and xl_prev is the second to last.
This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know
where exactly the record is inserted, so it's handy that it's the last
field before CRC. This patch doesn't try to take advantage of that,
however, and I'm not sure if that makes any difference once I finish the
patch to make XLogInsert scale better, which is the ultimate goal of all
this.

Those are the three patches I'd like to get committed in this
commitfest. To see where all this is leading to, I've included a rough
WIP version of the XLogInsert scaling patch. This version is quite
different from the one I posted in spring, it takes advantage of the WAL
format changes, and I'm also experimenting with a different method of
tracking how far each WAL insertion has progressed. But more on that later.

(Note to self: remember to bump XLOG_PAGE_MAGIC)

Will review.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: WAL format changes

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#4)
Re: WAL format changes

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com> wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#5)
Re: WAL format changes

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>

wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is
unchanged, but it should now be understood as a plain 64-bit value,
just split into two 32-bit integers for historical reasons. On disk,
this means that there will be log files ending in FF, those were
skipped before.

Whats the reason for keeping that awkward split now? There aren't that
many users of xlogid/xcrecoff and many of those would be better served
by using helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: WAL format changes

On Mon, Jun 18, 2012 at 2:08 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>
 wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that
many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too.  There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll have
to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as
the on-disk representation, and convert between the 64-bit integer and
XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog
calculations would admittedly be simpler if it was an uint64.

Ugh. Well, that's a good reason for thinking twice before changing
it, if not abandoning the idea altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#6)
Re: WAL format changes

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#8)
Re: WAL format changes

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#9)
Re: WAL format changes

On 18.06.2012 21:45, Andres Freund wrote:

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.

We use the LSN to decide whether a full-page image need to be xlogged if
the page is modified. If you reset LSN every time you read in a page,
you'll be doing full page writes every time a page is read from disk and
modified, whether or not it's the first time the page is modified after
the last checkpoint.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#10)
Re: WAL format changes

On Monday, June 18, 2012 09:19:48 PM Heikki Linnakangas wrote:

On 18.06.2012 21:45, Andres Freund wrote:

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an
uint64.

I am out of my depth here, not having read any of the relevant code,
but couldn't we simply replace the lsn from disk with
InvalidXLogRecPtr without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and
verifying that the page has a valid header it seems to be possible to
replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page
dirty.
If the page isn't modified the lsn on disk won't be changed so you don't
loose debugging information in that case. If will be zero if it has been
written by a hint-bit write and thats arguable a win.

We use the LSN to decide whether a full-page image need to be xlogged if
the page is modified. If you reset LSN every time you read in a page,
you'll be doing full page writes every time a page is read from disk and
modified, whether or not it's the first time the page is modified after
the last checkpoint.

Yea, I somehow disregarded that hint-bit writes would make a problem with full
page writes in that case. Normal writes wouldn't be a problem...
This should be fixable but the result would be too ugly. So its either
converting the on-disk representation or keeping the duplicated
representation.

pd_lsn seems to be well enough abstracted, doing the conversion in
PageSet/GetLSN seems to be easy enough.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#5)
Re: WAL format changes

On 18.06.2012 21:08, Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>
wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is
unchanged,
but it should now be understood as a plain 64-bit value, just split
into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't
that many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

The LSNs on pages are still stored in the old format, to avoid changing
the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
control file and WAL are changed, however, so an initdb (or at least
pg_resetxlog) is required.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol. That would make it easier
to write a client that works across different architectures, in >= 9.3.
The contents of the WAL would of course be architecture-dependent, but
it would be nice if pg_receivexlog and similar tools could nevertheless
be architecture-independent.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now
that it represents just 64-bit value. Using UINT64_FORMAT would also
make the messages harder to translate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xlogrecptr-uint64-1.patchtext/x-diff; name=xlogrecptr-uint64-1.patchDownload+278-351
#13Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#12)
Re: WAL format changes

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

On 18.06.2012 21:08, Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>

wrote:

1. Use a 64-bit segment number, instead of the log/seg combination.
And don't waste the last segment on each logical 4 GB log file. The
concept of a "logical log file" is now completely gone. XLogRecPtr is
unchanged,
but it should now be understood as a plain 64-bit value, just split
into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't
that many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.

The LSNs on pages are still stored in the old format, to avoid changing
the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
control file and WAL are changed, however, so an initdb (or at least
pg_resetxlog) is required.

Sounds sensible.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?

I don't really see a point in keeping around a backward-compatible
representation just for the sake of running such tools on multiple versions. I
might not be pragmatic enough, but: Why would you want to do that *at the
moment*? Many of the other tools are already version specific, so...
When the protocol starts to be used by more tools, maybe, but imo were not
there yet.

But then its not hard to convert to the old representation for that.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now
that it represents just 64-bit value. Using UINT64_FORMAT would also
make the messages harder to translate.

No opinion on that. Its easier to see for me whether two values are exactly
the same or very similar with the 64bit representation but its harder to gauge
bigger differences. So ...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#12)
Re: WAL format changes

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now that
it represents just 64-bit value. Using UINT64_FORMAT would also make the
messages harder to translate.

I could go either way on this one, but I have no problem with the way
you did it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#13)
Re: WAL format changes

On 19.06.2012 18:46, Andres Freund wrote:

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.

I wasn't planning to, it didn't even occur to me that we might be able
to get rid of XLogSegNo to be honest. There's places that deal whole
segments, rather than with specific byte positions in the WAL, so I
think XLogSegNo makes more sense in that context. Take
XLogArchiveNotifySeg(), for example. It notifies the archiver that a
given segment is ready for archiving, so we pass an XLogSegNo to
identify that segment as an argument. I suppose we could pass an
XLogRecPtr that points to the beginning of the segment instead, but it
doesn't really feel like an improvement to me.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?

We only use pg_sendint() for the few values exchanged in the handshake
before we start replicating, but once we begin, we just send structs
around. For example, in ProcessStandbyReplyMessage():

static void
ProcessStandbyReplyMessage(void)
{
StandbyReplyMessage reply;

pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
...

After that, we just the fields in the reply struct like in any other struct.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#15)
Re: WAL format changes

Hi,

On Wednesday, June 20, 2012 12:24:54 AM Heikki Linnakangas wrote:

On 19.06.2012 18:46, Andres Freund wrote:

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not
sure if having two representations which just have a constant factor
inbetween really makes sense.

I wasn't planning to, it didn't even occur to me that we might be able
to get rid of XLogSegNo to be honest. There's places that deal whole
segments, rather than with specific byte positions in the WAL, so I
think XLogSegNo makes more sense in that context. Take
XLogArchiveNotifySeg(), for example. It notifies the archiver that a
given segment is ready for archiving, so we pass an XLogSegNo to
identify that segment as an argument. I suppose we could pass an
XLogRecPtr that points to the beginning of the segment instead, but it
doesn't really feel like an improvement to me.

I am not sure its a win either, was just wondering because they now are that
similar.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take
care of converting to big endian already. I don't think anything but the
wal itself is otherwise transported in a binary fashion? So I don't
think there is any such architecture dependency in the protocol
currently?

We only use pg_sendint() for the few values exchanged in the handshake
before we start replicating, but once we begin, we just send structs

around. For example, in ProcessStandbyReplyMessage():

static void
ProcessStandbyReplyMessage(void)
{

StandbyReplyMessage reply;

pq_copymsgbytes(&reply_message, (char *) &reply,
sizeof(StandbyReplyMessage));

...

After that, we just the fields in the reply struct like in any other
struct.

Yes, forgot that, true. I guess the best fix would be to actually send normal
messages instead of CopyData ones? Much more to type though...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#14)
Re: WAL format changes

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already.  I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now.  At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Magnus Hagander (#17)
Re: WAL format changes

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already.  I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now.  At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Regards,

--
Fujii Masao

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#18)
Re: WAL format changes

On 20.06.2012 20:43, Fujii Masao wrote:

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander<magnus@hagander.net> wrote:

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas<robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in>= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Yeah, you can't use pg_receivexlog from 9.2 against a 9.3 server. We
can't really promise compatibility when using an older client against a
newer server, but we can try to be backwards-compatible in the other
direction. I'm thinking of using a 9.3 pg_receivexlog against a 9.2 server.

But I guess Robert is right and we shouldn't worry about
backwards-compatibility at this point. Instead, let's try to get the
protocol right, so that we can more easily provide
backwards-compatibility in the future. Like, using a 9.4 pg_receivexlog
against a 9.3 server.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#14)
Re: WAL format changes

Ok, committed all the WAL format changes now.

On 19.06.2012 18:57, Robert Haas wrote:

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in>= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

I didn't touch the replication protocol yet, but I think we should do it
some time during 9.3.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#20)
#22Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#20)
#23Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#20)
#24Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#25)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#26)
#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Heikki Linnakangas (#1)
#31Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#1)