xlog.c: WALInsertLock vs. WALWriteLock

Started by fazool meinabout 15 years ago17 messages
#1fazool mein
fazoolmein@gmail.com

Hello guys,

I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Thanks a lot.

#2David Fetter
david@fetter.org
In reply to: fazool mein (#1)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Fri, Oct 22, 2010 at 12:08:54PM -0700, fazool mein wrote:

Hello guys,

I'm writing a function that will read data from the buffer in xlog
(i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make
sure that I am doing it correctly.

Got an example of what the function might look like?

For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'.
Can we use it for read purposes?

Help me understand. Do you foresee some kind of concurrency issue,
and if so, what?

Cheers,
David.

Thanks a lot.

--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#3Tallat Mahmood
tallat.mahmood@gmail.com
In reply to: David Fetter (#2)
Re: xlog.c: WALInsertLock vs. WALWriteLock

I'm writing a function that will read data from the buffer in xlog

(i.e. from XLogCtl->pages and XLogCtl->xlblocks). I want to make
sure that I am doing it correctly.

Got an example of what the function might look like?

Say something like this:

bool ReadLogFromBuffer(char *buf, int len, XLogRecPtr p)

which will mean that we want to read the log (records) into buf at position
p of length len.

For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'.
Can we use it for read purposes?

Help me understand. Do you foresee some kind of concurrency issue,
and if so, what?

Yes. For example, while a process is reading from the buffer, another
process may insert new records into the buffer. To give a specific example,
walsender might want to read data from the buffer instead of reading log
from disk. In parallel, there might be transactions on the server that
modify the buffer.

Regards,
Tallat

#4Robert Haas
robertmhaas@gmail.com
In reply to: fazool mein (#1)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Fri, Oct 22, 2010 at 3:08 PM, fazool mein <fazoolmein@gmail.com> wrote:

I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Holding WALInsertLock in shared mode prevents other processes from
inserting WAL, or in other words it keeps the "end" position from
moving, while holding WALWriteLock in shared mode prevents other
processes from writing the WAL from the buffers out to the operating
system, or in other words it keeps the "start" position from moving.
So you could probably take WALInsertLock in shared mode, figure out
the current end of WAL position, release the lock; then take
WALWriteLock in shared mode, read any WAL before the end of WAL
position, and release the lock. But note that this wouldn't guarantee
that you read all WAL as it's generated....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#4)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Mon, Oct 25, 2010 at 6:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 22, 2010 at 3:08 PM, fazool mein <fazoolmein@gmail.com> wrote:

I'm writing a function that will read data from the buffer in xlog (i.e.
from XLogCtl->pages and XLogCtl->xlblocks). I want to make sure that I am
doing it correctly.
For reading from the buffer, do I need to lock WALInsertLock or
WALWriteLock? Also, can you explain a bit the usage of 'LW_SHARED'. Can we
use it for read purposes?

Holding WALInsertLock in shared mode prevents other processes from
inserting WAL, or in other words it keeps the "end" position from
moving, while holding WALWriteLock in shared mode prevents other
processes from writing the WAL from the buffers out to the operating
system, or in other words it keeps the "start" position from moving.
So you could probably take WALInsertLock in shared mode, figure out
the current end of WAL position, release the lock;

Once you release the WALInsertLock, someone else can grab it and
overwrite the part of the buffer you think you are reading.
So I think you have to hold WALInsertLock throughout the duration of
the operation.

Of course it couldn't be overwritten if the wal record itself is not
yet written from buffer to the OS/disk. But since you are not yet
holding the WALWriteLock, this could be happening at any time.

then take
WALWriteLock in shared mode, read any WAL before the end of WAL
position, and release the lock.  But note that this wouldn't guarantee
that you read all WAL as it's generated....

I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

Jeff

#6Alvaro Herrera
alvherre@commandprompt.com
In reply to: Jeff Janes (#5)
Re: xlog.c: WALInsertLock vs. WALWriteLock

Excerpts from Jeff Janes's message of mar oct 26 12:22:38 -0300 2010:

I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

And horrible for performance, I imagine. Those locks are highly trafficked.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: xlog.c: WALInsertLock vs. WALWriteLock

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Jeff Janes's message of mar oct 26 12:22:38 -0300 2010:

I don't think that holding WALWriteLock accomplishes much. It
prevents part of the buffer from being written out to OS/disk, and
thus becoming eligible for being overwritten in the buffer, but the
WALInsertLock prevents it from actually being overwritten. And what
if the part of the buffer you want to read was already eligible for
overwriting but not yet actually overwritten? WALWriteLock won't
allow you to safely access it, but WALInsertLock will (assuming you
have a safe way to identify the record in the first place). For
either case, holding it in shared mode would be sufficient.

And horrible for performance, I imagine. Those locks are highly trafficked.

I think you might actually need *both* locks to ensure the WAL buffers
aren't changing underneath you. If you don't have the walwriter locked
out, it is free to change the state of a buffer from "dirty" to
"written" and then to "prepared to receive next page of WAL". If the
latter doesn't involve changing the content of the buffer today, it
still could tomorrow.

And on top of all that, there remains the problem that the piece of WAL
you want might already be gone from the buffers.

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

regards, tom lane

#8fazool mein
fazoolmein@gmail.com
In reply to: Tom Lane (#7)
Re: xlog.c: WALInsertLock vs. WALWriteLock

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers. Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?
The locking issue might not be a problem considering synchronous
replication. In synchronous replication, the primary will anyways wait for
the standby to send a confirmation before it can do more WAL inserts. Hence,
reading from buffers might be better in this case.

So, as I understand from the emails, we need to lock both WALWriteLock and
WALInsertLock in exclusive mode for reading from buffers. Agreed?

Thanks.

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: fazool mein (#8)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On 26.10.2010 21:03, fazool mein wrote:

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk? There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

Why not? If the reason is performance, I'd like to see some performance
numbers to show that it's worth the trouble. You could perhaps do a
quick and dirty hack that doesn't do the locking 100% correctly first,
and do some benchmarking on that to get a ballpark number of how much
potential there is. Or run oprofile on the current walsender
implementation to see how much time is currently spent reading WAL from
the kernel buffers.

Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot of
locking?

To avoid locking yes, and complexity in general.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 26.10.2010 21:03, fazool mein wrote:

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk?  There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

Why not? If the reason is performance, I'd like to see some performance
numbers to show that it's worth the trouble. You could perhaps do a quick
and dirty hack that doesn't do the locking 100% correctly first, and do some
benchmarking on that to get a ballpark number of how much potential there
is. Or run oprofile on the current walsender implementation to see how much
time is currently spent reading WAL from the kernel buffers.

Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot
of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11fazool mein
fazoolmein@gmail.com
In reply to: Robert Haas (#10)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Tue, Oct 26, 2010 at 11:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Can you please describe why
walsender reading directly from the buffers was given up? To avoid a lot
of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

I agree that the standby might get ahead, but this doesn't necessarily lead
to database corruption. Here, the interesting case is what happens when the
primary fails, which can lead to *either* of the following two cases:
1) The standby, due to some triggering mechanism, becomes the new primary.
In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will connect
again to the primary. At this point, *if* somehow we are able to detect that
the standby is ahead, then we should abort the standby and create a standby
from scratch.

I agree with Heikki that going through all this trouble only makes sense if
there is a huge performance boost.

#12Robert Haas
robertmhaas@gmail.com
In reply to: fazool mein (#11)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Tue, Oct 26, 2010 at 2:57 PM, fazool mein <fazoolmein@gmail.com> wrote:

On Tue, Oct 26, 2010 at 11:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 26, 2010 at 2:13 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Can you please describe why
walsender reading directly from the buffers was given up? To avoid a
lot
of
locking?

To avoid locking yes, and complexity in general.

And the fact that it might allow the standby to get ahead of the
master, leading to silent database corruption.

I agree that the standby might get ahead, but this doesn't necessarily lead
to database corruption. Here, the interesting case is what happens when the
primary fails, which can lead to *either* of the following two cases:
1) The standby, due to some triggering mechanism, becomes the new primary.
In this case, even if the standby was ahead, its fine.

True.

2) The primary comes back as primary. In this case, the standby will connect
again to the primary. At this point, *if* somehow we are able to detect that
the standby is ahead, then we should abort the standby and create a standby
from scratch.

Unless you set restart_after_crash=off, the master could
crash-and-restart before you can do anything about it. But that
doesn't exist in the 9.0 branch.

I agree with Heikki that going through all this trouble only makes sense if
there is a huge performance boost.

There's probably quite a large performance boost in the sync rep case
from allowing the master and standby to fsync() in parallel, but first
we need to get something that works at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Josh Berkus
josh@agliodbs.com
In reply to: fazool mein (#11)
Re: xlog.c: WALInsertLock vs. WALWriteLock

I agree that the standby might get ahead, but this doesn't necessarily
lead to database corruption. Here, the interesting case is what happens
when the primary fails, which can lead to *either* of the following two
cases:
1) The standby, due to some triggering mechanism, becomes the new
primary. In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will
connect again to the primary. At this point, *if* somehow we are able to
detect that the standby is ahead, then we should abort the standby and
create a standby from scratch.

Yes. And we weren't able to implement that for 9.0. It's worth
revisiting for 9.1. In fact, the issue of "is the standby ahead of the
master" has come up repeatedly in potential failure scenarios; I think
we're going to need a fairly bulletproof method to determine this.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Josh Berkus (#13)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Tue, Oct 26, 2010 at 3:00 PM, Josh Berkus <josh@agliodbs.com> wrote:

I agree that the standby might get ahead, but this doesn't necessarily
lead to database corruption. Here, the interesting case is what happens
when the primary fails, which can lead to *either* of the following two
cases:
1) The standby, due to some triggering mechanism, becomes the new
primary. In this case, even if the standby was ahead, its fine.
2) The primary comes back as primary. In this case, the standby will
connect again to the primary. At this point, *if* somehow we are able to
detect that the standby is ahead, then we should abort the standby and
create a standby from scratch.

Yes.  And we weren't able to implement that for 9.0.  It's worth
revisiting for 9.1.  In fact, the issue of "is the standby ahead of the
master" has come up repeatedly in potential failure scenarios; I think
we're going to need a fairly bulletproof method to determine this.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Markus Wanner
markus@bluegap.ch
In reply to: Alvaro Herrera (#6)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On 10/26/2010 05:52 PM, Alvaro Herrera wrote:

And horrible for performance, I imagine. Those locks are highly trafficked.

Note, however, that offloading this to the file-system just moves
congestion there. So we are effectively saying that we expect
filesystems to do a better job (in that aspect) than our WAL implementation.

(Note that I'm not claiming that is or is not true - I didn't measure).

Regards

Markus Wanner

#16Alvaro Herrera
alvherre@commandprompt.com
In reply to: Markus Wanner (#15)
Re: xlog.c: WALInsertLock vs. WALWriteLock

Excerpts from Markus Wanner's message of mié oct 27 11:44:20 -0300 2010:

On 10/26/2010 05:52 PM, Alvaro Herrera wrote:

And horrible for performance, I imagine. Those locks are highly trafficked.

Note, however, that offloading this to the file-system just moves
congestion there. So we are effectively saying that we expect
filesystems to do a better job (in that aspect) than our WAL implementation.

Well, you can just read at your pace from the filesystem; the data is
going to stay there for a long time. WAL buffers are constantly moving,
and aren't as big.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Fujii Masao
masao.fujii@gmail.com
In reply to: fazool mein (#8)
Re: xlog.c: WALInsertLock vs. WALWriteLock

On Wed, Oct 27, 2010 at 3:03 AM, fazool mein <fazoolmein@gmail.com> wrote:

Might I suggest adopting the same technique walsender does, ie just read
the data back from disk?  There's a reason why we gave up trying to have
walsender read directly from the buffers.

That is exactly what I do not want to do, i.e. read from disk, as long as
the piece of WAL is available in the buffers.

I implemented before the patch which makes walsender read WAL from the buffer
without holding neither WALInsertLock nor WALWriteLock. That might be helpful
for you. Please see the following post.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00661.php

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center