CeBit
Is anyone on this list in Hannover for CeBit? Maybe we could arrange a
meeting.
Michael
--
Michael Meskes
Michael@Fam-Meskes.De
Go SF 49ers! Go Rhein Fire!
Use Debian GNU/Linux! Use PostgreSQL!
Hello,
maybe I missed something, but in last days I was thinking how would I
write my own sql server. I got several ideas and because these are not
used in PG they are probably bad - but I can't figure why.
1) WAL
We have buffer manager, ok. So why not to use WAL as part of it and don't
log INSERT/UPDATE/DELETE xlog records but directly changes into buffer
pages ? When someone dirties page it has to inform bmgr about dirty region
and bmgr would formulate xlog record. The record could be for example
fixed bitmap where each bit corresponds to part of page (of size
pgsize/no-of-bits) which was changed. These changed regions follows.
Multiple writes (by multiple backends) can be coalesced together as long
as their transactions overlaps and there is enough memory to keep changed
buffer pages in memory.
Pros: upper layers can think thet buffers are always safe/logged and there
is no special handling for indices; very simple/fast redo
Cons: can't implement undo - but in non-overwriting is not needed (?)
2) SHM vs. MMAP
Why don't use mmap to share pages (instead of shm) ? There would be no
problem with tuning pg's buffer cache size - it is balanced by OS.
When using SHM there are often two copies of page: one in OS' page cache
and one in SHM (vaste of memory).
When using mmap the data goes (almost) directly from HDD into your memory
page - now you need to copy it from OS' page to PG's page.
There is one problem: how to assure that dirtied page is not flushed
before its xlog. One can use mlock but you often need root privileges to
use it. Another way is to implement own COW (copy on write) to create
intermediate buffers used only until xlog is flushed.
Are there considerations correct ?
regards, devik
2) SHM vs. MMAP
Why don't use mmap to share pages (instead of shm) ? There would be no
problem with tuning pg's buffer cache size - it is balanced by OS.
When using SHM there are often two copies of page: one in OS' page cache
and one in SHM (vaste of memory).
When using mmap the data goes (almost) directly from HDD into your memory
page - now you need to copy it from OS' page to PG's page.
There is one problem: how to assure that dirtied page is not flushed
before its xlog. One can use mlock but you often need root privileges to
use it. Another way is to implement own COW (copy on write) to create
intermediate buffers used only until xlog is flushed.
This was brought up a week ago, and I consider it an interesting idea.
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.
The good news is that most/all OS's are smart enought that if two
processes mmap() the same file, they see each other's changes, so in a
sense it is shared memory, but a much larger, smarter pool of shared
memory than what we have now. We would still need buffer headers and
stuff because we need to synchronize access to the buffers.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
This was brought up a week ago, and I consider it an interesting idea.
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.
Yes. As I work on linux kernel I know something about it. When page is
accessed the CPU sets one bit in PTE. The OS writes the page when it
needs page frame. It also tries to launder pages periodicaly but actual
alghoritm changes too often in recent kernels ;-)
Also page write is not atomic - several buffer heads are filled for the
page and asynchronously posted for write. Elevator then sort and coalesce
these buffers heads and create actual scsi/ide write requests. But there
is no guarantee that buffer heads from one page will be coalested to one
write request ...
You can call mlock (PageLock on Win32) to lock page in memory. You can
postpone write using it. It is ok under Win32 and many unices but under
linux only admin or one with CAP_MEMLOCK (not exact name) can mlock.
The good news is that most/all OS's are smart enought that if two
processes mmap() the same file, they see each other's changes, so in a
yes, when using SHARED flag to mmap then IMHO it is mandatory for an OS
sense it is shared memory, but a much larger, smarter pool of shared
memory than what we have now. We would still need buffer headers and
stuff because we need to synchronize access to the buffers.
Also some smart algorithm which tries to mmap several pages in one
continuous block. You can mmap each page at its own but OSes stores mmap
informations per page range. You need to minimize number of such ranges.
devik
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.
Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.
Bummer.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Wed, Mar 07, 2001 at 11:21:37AM -0500, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.
Not so fast!
It is possible to build a logging system so that you mostly don't care
when the data blocks get written; a particular data block on disk is
considered garbage until the next checkpoint, so that you might as well
allow the blocks to be written any time, even before the log entry.
Letting the OS manage sharing of disk block images via mmap should be
an enormous win vs. a fixed shm and manual scheduling by PG. If that
requires changes in the logging protocol, it's worth it.
(What supported platforms don't have mmap?)
Nathan Myers
ncm@zembu.com
It is possible to build a logging system so that you mostly don't care
when the data blocks get written; a particular data block on disk is
considered garbage until the next checkpoint, so that you
How to know if a particular data page was modified if there is no
log record for that modification?
(Ie how to know where is garbage? -:))
might as well allow the blocks to be written any time,
even before the log entry.
And what to do with index tuples pointing to unupdated heap pages
after that?
Vadim
Import Notes
Resolved by subject fallback
""Mikheev, Vadim"" <vmikheev@SECTORBASE.COM> wrote in message
news:8F4C99C66D04D4118F580090272A7A234D32FA@sectorbase1.sectorbase.com...
It is possible to build a logging system so that you mostly don't care
when the data blocks get written; a particular data block on disk is
considered garbage until the next checkpoint, so that youHow to know if a particular data page was modified if there is no
log record for that modification?
(Ie how to know where is garbage? -:))
You could store a log sequence number in the data page header that indicates
the log address of the last log record that was applied to the page. This is
described in Bernstein and Newcomer's book (sec 8.5 operation logging).
Sorry if I'm misunderstanding the question. Back to lurking mode...
1) WAL
We have buffer manager, ok. So why not to use WAL as part of
it and don't log INSERT/UPDATE/DELETE xlog records but directly
changes into buffer pages ? When someone dirties page it has to
inform bmgr about dirty region and bmgr would formulate xlog record.
The record could be for example fixed bitmap where each bit corresponds
to part of page (of size pgsize/no-of-bits) which was changed. These
changed regions follows. Multiple writes (by multiple backends) can be
coalesced together as long as their transactions overlaps and there is
enough memory to keep changed buffer pages in memory.Pros: upper layers can think thet buffers are always safe/logged and
there is no special handling for indices; very simple/fast redo
Cons: can't implement undo - but in non-overwriting is not needed (?)
But needed if we want to get rid of vacuum and have savepoints.
Vadim
Import Notes
Resolved by subject fallback
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.Bummer.
BTW, what means "bummer" ?
But for many OSes you CAN control when to write data - you can mlock
individual pages.
On Thu, 8 Mar 2001, Martin Devera wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.Bummer.
BTW, what means "bummer" ?
It's a Postgres-specific extension to the SQL standard. It means "I am
disappointed". As far as I can tell, you _may_ use it as a column or table
name. :-)
Tim
--
-----------------------------------------------
Tim Allen tim@proximity.com.au
Proximity Pty Ltd http://www.proximity.com.au/
http://www4.tpg.com.au/users/rita_tim/
Bruce Momjian <pgman@candle.pha.pa.us> writes:
The only problem is that we would no longer have control over which
pages made it to disk. The OS would perhaps write pages as we modified
them. Not sure how important that is.Unfortunately, this alone is a *fatal* objection. See nearby
discussions about WAL behavior: we must be able to control the relative
timing of WAL write/flush and data page writes.Bummer.
BTW, what means "bummer" ?
Sorry, it means, "Oh, I am disappointed."
But for many OSes you CAN control when to write data - you can mlock
individual pages.
mlock() controls locking in physical memory. I don't see it controling
write().
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
BTW, what means "bummer" ?
Sorry, it means, "Oh, I am disappointed."
thanks :)
But for many OSes you CAN control when to write data - you can mlock
individual pages.mlock() controls locking in physical memory. I don't see it controling
write().
When you mmap, you don't use write() !
mlock actualy locks page in memory and as long as the page is locked
the OS doesn't attempt to store the dirty page. It is intended also
for security app to ensure that sensitive data are not written to unsecure
storage (hdd). It is definition of mlock so that you can be probably sure
with it.
There is way to do it without mlock (fallback):
You definitely need some kind of page headers. The header should has info
whether the page can be mmaped or is in "dirty pool". Pages in dirty pool
are pages which are dirty but not written yet and are waiting to
appropriate log record to be flushed. When log is flushed the data at
dirty pool can be copied to its regular mmap location and discarded.
If dirty pool is too large, simply sync log and whole pool can be
discarded.
mmap version could be faster when loading data from hdd and will result in
better utilization of memory (because you are directly working with data
at OS' page-cache instead of having duplicates in pg's buffer cache).
Also page cache expiration is handled by OS and it will allow pg to use as
much memory as is available (no need to specify buffer page size).
devik
Pros: upper layers can think thet buffers are always safe/logged and
there is no special handling for indices; very simple/fast redo
Cons: can't implement undo - but in non-overwriting is not needed (?)But needed if we want to get rid of vacuum and have savepoints.
Hmm. How do you implement savepoints ? When there is rollback to savepoint
do you use xlog to undo all changes which the particular transaction has
done ? Hmmm it seems nice ... these resords are locked by such transaction
so that it can safely undo them :-)
Am I right ?
But how can you use xlog to get rid of vacuum ? Do you treat all delete
log records as candidates for free space ?
regards, devik
But needed if we want to get rid of vacuum and have savepoints.
Hmm. How do you implement savepoints ? When there is rollback
to savepoint do you use xlog to undo all changes which the particular
transaction has done ? Hmmm it seems nice ... these resords are locked by
such transaction so that it can safely undo them :-)
Am I right ?
Yes, but there is no savepoints in 7.1 - hopefully in 7.2
But how can you use xlog to get rid of vacuum ? Do you treat
all delete log records as candidates for free space ?
Vaccum removes deleted records *and* records inserted by aborted
transactions - last ones will be removed by UNDO.
Vadim
Import Notes
Resolved by subject fallback
When you mmap, you don't use write() ! mlock actualy locks page in
memory and as long as the page is locked the OS doesn't attempt to
store the dirty page. It is intended also for security app to
ensure that sensitive data are not written to unsecure storage
(hdd). It is definition of mlock so that you can be probably sure
with it.
News to me ... can you please point to such a definition? I see no
reference to such semantics in the mlock() manual page in UNIX98
(Single Unix Standard, version 2).
mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.
I would expect an OS that doesn't have a unified buffer cache but
tries to keep a consistent view for mmap() and read()/write() to
update the file.
Regards,
Giles
When you mmap, you don't use write() ! mlock actualy locks page in
memory and as long as the page is locked the OS doesn't attempt to
store the dirty page. It is intended also for security app to
ensure that sensitive data are not written to unsecure storage
(hdd). It is definition of mlock so that you can be probably sure
with it.News to me ... can you please point to such a definition? I see no
reference to such semantics in the mlock() manual page in UNIX98
(Single Unix Standard, version 2).
sorry, maybe I'm biased toward Linux. The statement above is from Linux's
man page and as I looked into mm code it seems to be right.
I'm not sore about other unices.
mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.
yes, probably it depends on OS in question. In Linux kernel the page is
not written when mlocked (but I'm not sure about msync here).
I would expect an OS that doesn't have a unified buffer cache but
tries to keep a consistent view for mmap() and read()/write() to
update the file.
hmm but why to mlock page then ? Only to be sure the page is not wsapped
out ?
regards, devik
It is possible to build a logging system so that you
mostly don't care when the data blocks get written;
a particular data block on disk is considered garbage
until the next checkpoint, so that youHow to know if a particular data page was modified if there is no
log record for that modification?
(Ie how to know where is garbage? -:))You could store a log sequence number in the data page header
that indicates the log address of the last log record that was
applied to the page.
We do. But how to know at the time of recovery that there is
a page in multi-Gb index file with tuple pointing to uninserted
table row?
Well, actually we could make some improvements in this area:
a buffer without "first after checkpoint" modification could be
written without flushing log records: entire block will be
rewritten on recovery. Not sure how much we get, though -:)
Vadim
Import Notes
Resolved by subject fallback
Sorry for taking so long to reply...
On Wed, Mar 07, 2001 at 01:27:34PM -0800, Mikheev, Vadim wrote:
Nathan wrote:
It is possible to build a logging system so that you mostly don't care
when the data blocks get written
[after being changed, as long as they get written by an fsync];
a particular data block on disk is
considered garbage until the next checkpoint, so that youHow to know if a particular data page was modified if there is no
log record for that modification?
(Ie how to know where is garbage? -:))
In such a scheme, any block on disk not referenced up to (and including)
the last checkpoint is garbage, and is either blank or reflects a recent
logged or soon-to-be-logged change. Everything written (except in the
log) after the checkpoint thus has to happen in blocks not otherwise
referenced from on-disk -- except in other post-checkpoint blocks.
During recovery, the log contents get written to those pages during
startup. Blocks that actually got written before the crash are not
changed by being overwritten from the log, but that's ok. If they got
written before the corresponding log entry, too, nothing references
them, so they are considered blank.
might as well allow the blocks to be written any time,
even before the log entry.And what to do with index tuples pointing to unupdated heap pages
after that?
Maybe index pages are cached in shm and copied to mmapped blocks
after it is ok for them to be written.
What platforms does PG run on that don't have mmap()?
Nathan Myers
ncm@zembu.com
Giles Lean <giles@nemeton.com.au> wrote:
When you mmap, you don't use write() ! mlock actualy locks page in
memory and as long as the page is locked the OS doesn't attempt to
store the dirty page. It is intended also for security app to
ensure that sensitive data are not written to unsecure storage
(hdd). It is definition of mlock so that you can be probably sure
with it.News to me ... can you please point to such a definition? I see no
reference to such semantics in the mlock() manual page in UNIX98
(Single Unix Standard, version 2).mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.
I've wondered about this myself. It _is_ true on Linux that mlock prevents
writes to the backing store, and this is used as a security feature for
cryptography software. The code for gnupg assumes that if you have mlock()
on any operating system, it does mean this--which doesn't mean it's true,
but perhaps whoever wrote it does have good reason to think so.
But I don't know about other systems. Does anybody know what the POSIX.1b
standard says?
It was even suggested to me on the linux-fsdev mailing list that mlock() was
a good way to insure the write-ahead condition.
Ken Hirsch
Giles Lean <giles@nemeton.com.au> wrote:
When you mmap, you don't use write() ! mlock actualy locks page in
memory and as long as the page is locked the OS doesn't attempt to
store the dirty page. It is intended also for security app to
ensure that sensitive data are not written to unsecure storage
(hdd). It is definition of mlock so that you can be probably sure
with it.News to me ... can you please point to such a definition? I see no
reference to such semantics in the mlock() manual page in UNIX98
(Single Unix Standard, version 2).mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.
I've wondered about this myself. It _is_ true on Linux that mlock prevents
writes to the backing store, and this is used as a security feature for
cryptography software. The code for gnupg assumes that if you have mlock()
on any operating system, it does mean this--which doesn't mean it's true,
but perhaps whoever wrote it does have good reason to think so.
But I don't know about other systems. Does anybody know what the POSIX.1b
standard says?
It was even suggested to me on the linux-fsdev mailing list that mlock() was
a good way to insure the write-ahead condition.
Ken Hirsch
On Tue, 13 Mar 2001, Ken Hirsch wrote:
mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.I've wondered about this myself. It _is_ true on Linux that mlock
prevents writes to the backing store,
I don't believe that this is true. The manpage offers no
such promises, and the semantics are not useful.
and this is used as a security feature for cryptography software.
mlock() is used to prevent pages being swapped out. Its
use for crypto software is essentially restricted to anon
memory (allocated via brk() or mmap() of /dev/zero).
If my understanding is accurate, before 2.4 Linux would
never swap out pages which had a backing store. It would
simply write them back or drop them (if clean). (This is
why you need around twice as much swap with 2.4.)
The code for gnupg assumes that if you have mlock() on any operating
system, it does mean this--which doesn't mean it's true, but perhaps
whoever wrote it does have good reason to think so.
strace on gpg startup says:
mmap(0, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
getuid() = 500
mlock(0x40015000) = -1 EPERM (Operation not permitted)
so whatever the authors think, it does not require this semantic.
Matthew.
* Matthew Kirkwood <matthew@hairy.beasts.org> [010313 13:12] wrote:
On Tue, 13 Mar 2001, Ken Hirsch wrote:
mlock() guarantees that the locked address space is in memory. This
doesn't imply that updates are not written to the backing file.I've wondered about this myself. It _is_ true on Linux that mlock
prevents writes to the backing store,I don't believe that this is true. The manpage offers no
such promises, and the semantics are not useful.
Afaik FreeBSD's Linux emulator:
revision 1.13
date: 2001/02/28 04:30:27; author: dillon; state: Exp; lines: +3 -1
Linux does not filesystem-sync file-backed writable mmap pages on
a regular basis. Adjust our linux emulation to conform. This will
cause more dirty pages to be left for the pagedaemon to deal with,
but our new low-memory handling code can deal with it. The linux
way appears to be a trend, and we may very well make MAP_NOSYNC the
default for FreeBSD as well (once we have reasonable sequential
write-behind heuristics for random faults).
(will be MFC'd prior to 4.3 freeze)
Suggested by: Andrew Gallatin
Basically any mmap'd data doesn't seem to get sync()'d out on
a regular basis.
and this is used as a security feature for cryptography software.
mlock() is used to prevent pages being swapped out. Its
use for crypto software is essentially restricted to anon
memory (allocated via brk() or mmap() of /dev/zero).
What about userland device drivers that want to send parts
of a disk backed file to a driver's dma routine?
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/
On Tue, 13 Mar 2001, Alfred Perlstein wrote:
[..]
Linux does not filesystem-sync file-backed writable mmap pages on a
regular basis.
Very intersting. I'm not sure that is necessarily the case in
2.4, though -- my understanding is that the new all-singing,
all-dancing page cache makes very little distinction between
mapped and unmapped dirty pages.
Basically any mmap'd data doesn't seem to get sync()'d out on
a regular basis.
Hmm.. I'd call that a bug, anyway.
and this is used as a security feature for cryptography software.
mlock() is used to prevent pages being swapped out. Its
use for crypto software is essentially restricted to anon
memory (allocated via brk() or mmap() of /dev/zero).What about userland device drivers that want to send parts
of a disk backed file to a driver's dma routine?
And realtime software. I'm not disputing that mlock is useful,
but what it can do be security software is not that huge. The
Linux manpage says:
Memory locking has two main applications: real-time algo�
rithms and high-security data processing.
Matthew.
Michael Meskes wrote:
Is anyone on this list in Hannover for CeBit? Maybe we could arrange a
meeting.
Looks pretty much that I'll be still in Hamburg by then. What
are the days you planned?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com