Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Started by The Hermit Hackeralmost 28 years ago20 messages
#1The Hermit Hacker
scrappy@hub.org

Redirected to 'the proper list' - pgsql-hackers@postgresql.org

On Wed, 11 Mar 1998, Chris Albertson wrote:

Also, is anyone working on storage mangers? I was thinking that
a raw partition manager would be good to have. Could be faster
then one that uses the file system. Give it two partitions and
it could do stripping and gain some real speed.

stripping can be done from the operating system level to give you
that 'boost'...and Oracle, in fact, moved away from the raw partition
level to just using the Unix file system...I believe it would
overcomplicate the backend, and give a negligible boost in performance, if
we had to build a 'low level drive interface'...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#2Bruce Momjian
maillist@candle.pha.pa.us
In reply to: The Hermit Hacker (#1)

I looked at the Magnetic Disk Storage manager code and it appears
that it at least tries to support tables larger then the maximum
file size. I heard some talk however about this feature being either
broken or untested for many years. Anyone know the real status?

I is broken, but we are not sure how. Someone recently got over 2GB,
and the table got messed up.

Yes I can test it myself but I need to buy another drive first so
I can have 6GB of free space.

Also, is anyone working on storage mangers? I was thinking that
a raw partition manager would be good to have. Could be faster
then one that uses the file system. Give it two partitions and
it could do stripping and gain some real speed.

No one is working on storage managers that I know of. The code works,
but it just uses the default storage manager.

Next question is the storagemanager switch working?

Yes, I believe that code has not changed recently.

I'll need this to work in about a year. I'll do the work
myself if that option is best but I would not want to put in the
effort unless others would use it.

No one has discussed working on that aspect, but some people have
suggested a raw storage manager.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#3Chris Albertson
chris@topdog.pas1.logicon.com
In reply to: The Hermit Hacker (#1)

The Hermit Hacker wrote:

Redirected to 'the proper list' - pgsql-hackers@postgresql.org

On Wed, 11 Mar 1998, Chris Albertson wrote:

Also, is anyone working on storage mangers? I was thinking that
a raw partition manager would be good to have. Could be faster
then one that uses the file system. Give it two partitions and
it could do stripping and gain some real speed.

stripping can be done from the operating system level to give you
that 'boost'...and Oracle, in fact, moved away from the raw partition
level to just using the Unix file system...I believe it would
overcomplicate the backend, and give a negligible boost in performance, if
we had to build a 'low level drive interface'...

I know you must have looked at far more Postgresql code then I have but
I was browsing the storage manager. Apparently it is fairly easy to
assign a class to a manager as each class is tagged in the system catalog
with a storage method. What I really want is a >2GB table. I was trying
to see if this was supported by reading the source. Looks like it may be.
The note in the To Do list includes testing. I would test it but for
lack of disk space. (I'll have more in a while.)

I need the >2GB feature bad enough that I'd implement it myself. My thought
was that I may to easier to write a new manager then understand and fix
a broken one. A manager is just given a class name and block number and
told to either fetch or get it. (well not quite so simple but close).

I don't think it needs to look inside the 8K (adjustable now) blocks.
Anyway, if I wrote such a beast my real motivation would be to have big
tables. Faster big tables would be just a plus. What I really hope for is
that somebody else fixes the existing code :-)
--
--Chris Albertson

chris@topdog.logicon.com Voice: 626-351-0089 X127
Logicon RDA, Pasadena California Fax: 626-351-0699

#4Noname
dg@illustra.com
In reply to: The Hermit Hacker (#1)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Redirected to 'the proper list' - pgsql-hackers@postgresql.org

On Wed, 11 Mar 1998, Chris Albertson wrote:

Also, is anyone working on storage mangers? I was thinking that
a raw partition manager would be good to have. Could be faster
then one that uses the file system. Give it two partitions and
it could do stripping and gain some real speed.

stripping can be done from the operating system level to give you
that 'boost'...and Oracle, in fact, moved away from the raw partition
level to just using the Unix file system...I believe it would
overcomplicate the backend, and give a negligible boost in performance, if
we had to build a 'low level drive interface'...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

I have had the pleasure to work on the guts of one of the major databases
raw partition storage managers over the last ten years (hint, not my
current domain), and guess what? It implements a file system. And, not a
particularly good filesystem at that. Think about something like "FAT",
but not quite that nice. It was also a major source of pain in that it
was complex, heavily concurrent, and any errors show up as massive data
loss or corruption. Be careful what you wish for.

Most of the supposed benefit comes from integrating the buffer cache
management and the writeahead log so that you can defer or avoid I/O (as
long as the log records get to disk, there is no reason to ever write the
data page unless you need the buffer for something else). You can also
convert random I/O to semi sequential I/O if most writes are done by a page
cleaner or by a checkpoint as this gives you lots of I/O to sort.

I don't know the current state of Postgres so I cannot comment on it, but at
least with Illustra, the lack of a traditional writeahead log style
transaction system was a major performance hit as it forced an fsync at
every commit. A good WAL system gets many commits per log I/O, but
Illusta was stuck with many writes per transaction. If Postgres still does
this (and the recent elimination of time travel suggests that it might not),
it would be well worth fixing.

A last point, the raw disk, implement our own filesystem architecture used
by some systems is much more compelling if the filesystems are slow and
inflexible, and the filesystem caching is ineffective. These things were
more true back in the early 80's when these systems were being designed.
Things are not as bad now, in particular ext2 has quite good performance.

Sorry for the ramble...

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- I realize now that irony has no place in business communications.

#5Zeugswetter Andreas
andreas.zeugswetter@telecom.at
In reply to: Noname (#4)
AW: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Redirected to 'the proper list' - pgsql-hackers@postgresql.org

On Wed, 11 Mar 1998, Chris Albertson wrote:

Also, is anyone working on storage mangers? I was thinking that
a raw partition manager would be good to have. Could be faster
then one that uses the file system. Give it two partitions and
it could do stripping and gain some real speed.

stripping can be done from the operating system level to give you
that 'boost'...and Oracle, in fact, moved away from the raw partition
level to just using the Unix file system...I believe it would
overcomplicate the backend, and give a negligible boost in performance, if
we had to build a 'low level drive interface'...

Actually Oracle *is* suggesting the use of RAW devices for heavy IO databases.
RAW devices have the advantage that the operating system does not cache
the pages. This can at once be a drastic disadvantage if the database does not
have it's own read/write optimization (only read/write whole pages or better multiples thereof)
an intelligent buffer cache and intelligent read ahead (here only the db engine can be really intelligent about it).
If you use file system files and have the above features in the database engine,
then all pages will be in memory twice, once in the file system buffer cache and
once in the database cache. This results in only half of the pages beeing cached and also
adds the additional overhead of 2 memcpy's.

I think the first step in the direction of RAW devices needs to be the implementation of the
tablespace idea, i.e. more than one table or index in one file. On the driver side,
if there is a logical volume manager, then accessing a raw device is the same as
accessing a file system file (read/write/seek).

Andreas

#6Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#4)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

I have had the pleasure to work on the guts of one of the major databases
raw partition storage managers over the last ten years (hint, not my
current domain), and guess what? It implements a file system. And, not a
particularly good filesystem at that. Think about something like "FAT",
but not quite that nice. It was also a major source of pain in that it
was complex, heavily concurrent, and any errors show up as massive data
loss or corruption. Be careful what you wish for.

Interesting.

Most of the supposed benefit comes from integrating the buffer cache
management and the writeahead log so that you can defer or avoid I/O (as
long as the log records get to disk, there is no reason to ever write the
data page unless you need the buffer for something else). You can also
convert random I/O to semi sequential I/O if most writes are done by a page
cleaner or by a checkpoint as this gives you lots of I/O to sort.

Yes, I have heard that the standard file system read-ahead is often
useless for a database, so on a raw partition you know the next block
that is going to be requested, so you can prefetch there rather than
having the file system prefetch the next sequential block.

Also nice so you can control what gets written to disk/fsync'ed and what doesn't
get fsync'ed.

I don't know the current state of Postgres so I cannot comment on it, but at
least with Illustra, the lack of a traditional writeahead log style
transaction system was a major performance hit as it forced an fsync at
every commit. A good WAL system gets many commits per log I/O, but
Illusta was stuck with many writes per transaction. If Postgres still does
this (and the recent elimination of time travel suggests that it might not),
it would be well worth fixing.

Our idea is to control when pg_log gets written to disk. We keep active
pg_log pages in shared memory, and every 30-60 seconds, we make a memory
copy of the current pg_log active pages, do a system sync() (which
happens anyway at that interval), update the pg_log file with the saved
changes, and fsync() the pg_log pages to disk. That way, after a crash,
the current database only shows transactions as committed where we are
sure all the data has made it to disk.

I have a more detailed posting if you are interested.

A last point, the raw disk, implement our own filesystem architecture used
by some systems is much more compelling if the filesystems are slow and
inflexible, and the filesystem caching is ineffective. These things were
more true back in the early 80's when these systems were being designed.
Things are not as bad now, in particular ext2 has quite good performance.

Sorry for the ramble...

No ramble at all. It is not every day we get someone with real-world
experience in changing from a filesystem to a raw partition database
storage manager.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#7Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#4)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Here is an archive of the pg_log discussion.

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711170542.AAA24561@candle.pha.pa.us>
Subject: [HACKERS] Bufferd loggins/pg_log
To: hackers@postgreSQL.org (PostgreSQL-development)
Date: Mon, 17 Nov 1997 00:42:18 -0500 (EST)
Cc: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Here is my current idea for doing bufferd logging, and exists between
the normal fsync on every transaction and no-fsync options. I believe
it will be very popular, because it mimicks the Unix file system
reliability structure.

---------------------------------------------------------------------------

On startup, the postmaster makes a copy of pg_log, called pg_log_live.
Each postgres backend mmaps() this new file into its address space. A
lock is gotten to make changes to the file. All backend use pg_log_live
rather than pg_log. Only the postmaster write to pg_log. (I will
someday remove the exec() from postmaster, so backends will get this
address space automatically.)

The first 512 bytes of pg_log and pg_log_live are used for log managment
information. We add a new field to pg_log_live called min_xid_commit
which records the lowest transaction id that any backend has committed
since the start of the last sync pass of the postmater. We also add
fields to record current pg_variable oid and xid at the same time. (xid
may have to be moved into pg_variable so backends can fsync it (see
below).)

Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
the minimum xid set in the start of pg_log, and resets its value. It
records the current oid and xid from pg_variable. It then clears the
lock, and starts reading from the minimum recorded xid changed to the
end of pg_log_live, and copies it into allocated memory. It then does a
sync (twice?), waits for completion, and then writes the pg_log_live
partial copy it made to pg_log. We update the copies of oid and xid we
saved before the sync to the bottom of pg_log_live.

We can change the 60-90 seconds to be longer, but the system does it
every 30 seconds anyway.

When the postmaster stops, it does this same operation before shutting
down, and pg_log_live is removed.

We make a copy of the current xid and oid in the front of pg_log_live,
so that if the postmaster starts up, and pg_log_live exists, the
postmaster adds 10,000 to xid and oid of pg_variable, so no previously
used but unsynced values are used.

We know that the current values of pg_variable could not have been
exceeded by 10,000, because each backend consults the pg_log copies of
these variable to make sure they do not exceed 10,000 from the value
before the last sync. They exceed those values only by fscyn'ing every
10,000 increments.

Said another way, if a postgres backend exceeds the pg_log last xid or
oid of pg_log, or any 10,000 multiple, it must fsync the change to
pg_variable. This way, a crash skips over any unsynced oid/xid's used,
and this is done without having to keep fsyncing pg_variable. In most
cases, the 10,000 will never be exceeded by a backend before the
postmaster does a sync and increases the last xid/oid again.

I think this is a very clean way to give us no-fync performance with
full-rollback buffered logging. The specification is clean and almost
complete enough for coding.

I think this gives us what we need, by having a mmap'ed() pg_log_live,
which backends can use, and a postmaster-controlled pg_log, which is
used on startup, with xid/oid controls in a crash situation to skip over
partially committed transactions.

Comments?

--
Bruce Momjian
maillist@candle.pha.pa.us

---------------------------------------------------------------------------

Sender: root@www.krasnet.ru
Message-ID: <346FF895.167EB0E7@sable.krasnoyarsk.su>
Date: Mon, 17 Nov 1997 14:56:05 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: PostgreSQL-development <hackers@postgreSQL.org>,
"Vadim B. Mikheev" <vadim@post.krasnet.ru>
Subject: Re: Bufferd loggins/pg_log
References: <199711170542.AAA24561@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:

On startup, the postmaster makes a copy of pg_log, called pg_log_live.
Each postgres backend mmaps() this new file into its address space. A
lock is gotten to make changes to the file. All backend use pg_log_live
rather than pg_log. Only the postmaster write to pg_log. (I will
someday remove the exec() from postmaster, so backends will get this
address space automatically.)

What are advantages of mmaping entire pg_log over "online" pg_log
pages ?
pg_log may be very big (tens of Mb) - why we have to spend
process address space for tens of Mb of mostly unused data ?
Also, do all systems have mmap ?

Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
the minimum xid set in the start of pg_log, and resets its value. It
records the current oid and xid from pg_variable. It then clears the
lock, and starts reading from the minimum recorded xid changed to the
end of pg_log_live, and copies it into allocated memory. It then does a
sync (twice?), waits for completion, and then writes the pg_log_live

^^^^^
man sync:

The sync() function forces a write of dirty (modified) buffers in the
^^^^^^
block buffer cache out to disk...
...

BUGS
Sync() may return before the buffers are completely flushed.

Vadim

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711171346.IAA01964@candle.pha.pa.us>
Subject: [HACKERS] Re: Bufferd loggins/pg_log
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Mon, 17 Nov 1997 08:46:29 -0500 (EST)
Cc: hackers@postgreSQL.org (PostgreSQL-development)
In-Reply-To: <346FF895.167EB0E7@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 17, 97 02:56:05 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Bruce Momjian wrote:

On startup, the postmaster makes a copy of pg_log, called pg_log_live.
Each postgres backend mmaps() this new file into its address space. A
lock is gotten to make changes to the file. All backend use pg_log_live
rather than pg_log. Only the postmaster write to pg_log. (I will
someday remove the exec() from postmaster, so backends will get this
address space automatically.)

What are advantages of mmaping entire pg_log over "online" pg_log
pages ?
pg_log may be very big (tens of Mb) - why we have to spend
process address space for tens of Mb of mostly unused data ?
Also, do all systems have mmap ?

I believe you are correct that it would be better keeping the last few
pages of pg_log in shared memory rather than using mmap().

I think the important new ideas are keeping track of the oid/xid before
sync so we can accurately add 10,000 after a crash.

I am a little foggy on race condiions of growing the pg_log region while
other backends are running, and modifying non-shared memory pages, but
you seem to have a handle on it.

We don't need pg_log_live if only the postmaster writes those last two
pages to pg_log, and if we keep track of a crash status somewhere else,
perhaps at the start of pg_log.

Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
the minimum xid set in the start of pg_log, and resets its value. It
records the current oid and xid from pg_variable. It then clears the
lock, and starts reading from the minimum recorded xid changed to the
end of pg_log_live, and copies it into allocated memory. It then does a
sync (twice?), waits for completion, and then writes the pg_log_live

^^^^^
man sync:

The sync() function forces a write of dirty (modified) buffers in the
^^^^^^
block buffer cache out to disk...
...

BUGS
Sync() may return before the buffers are completely flushed.

Vadim

My BSD/OS doesn't mention this, but twice is a good idea.

--
Bruce Momjian
maillist@candle.pha.pa.us

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#8Noname
ocie@paracel.com
In reply to: Bruce Momjian (#6)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Bruce Momjian wrote:

I have had the pleasure to work on the guts of one of the major databases
raw partition storage managers over the last ten years (hint, not my
current domain), and guess what? It implements a file system. And, not a
particularly good filesystem at that. Think about something like "FAT",
but not quite that nice. It was also a major source of pain in that it
was complex, heavily concurrent, and any errors show up as massive data
loss or corruption. Be careful what you wish for.

Interesting.

Perhaps we could:

a) Incorporate an existing filesystem into the code (ext2?). By
Incorporate, I mean that we would just take the latest version of the
code and link it into the executable, or into a library and make calls
to some of the lower level access and allocation routines.

b) suggest that for higher performance, the user should format the
disk partition with ext2 (or whatever) and turn off caching and set the
block size to the maximum possible.

I know for a fact that ext2 lets the user select the block size, and
it looks like Linux at least supports a sync mount option which makes
all I/O to this FS synchronous (which I assume would turn off write
caching at least). If caching could be disabled, then option b would
seem to provide performance equivalent to a.

Ocie Mitchell

#9Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#8)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Bruce Momjian wrote:

I have had the pleasure to work on the guts of one of the major databases
raw partition storage managers over the last ten years (hint, not my
current domain), and guess what? It implements a file system. And, not a
particularly good filesystem at that. Think about something like "FAT",
but not quite that nice. It was also a major source of pain in that it
was complex, heavily concurrent, and any errors show up as massive data
loss or corruption. Be careful what you wish for.

Interesting.

Perhaps we could:

a) Incorporate an existing filesystem into the code (ext2?). By
Incorporate, I mean that we would just take the latest version of the
code and link it into the executable, or into a library and make calls
to some of the lower level access and allocation routines.

b) suggest that for higher performance, the user should format the
disk partition with ext2 (or whatever) and turn off caching and set the
block size to the maximum possible.

I know for a fact that ext2 lets the user select the block size, and
it looks like Linux at least supports a sync mount option which makes
all I/O to this FS synchronous (which I assume would turn off write
caching at least). If caching could be disabled, then option b would
seem to provide performance equivalent to a.

I checked, and under BSD/OS, the readahead call for ufs looks like:

error = breadn(vp, lbn, size,
(daddr_t *)&nextlbn, &nextsize, 1, NOCRED, &bp);

The '1' is requesting a read-ahead of one block past the requested
block. Clearly this is not tunable, tough a read-ahead of one is not a
significant performance problem. In most cases, the block was already
read as part of the disk scan, but this gives us the next block in cases
where we are reading sequentially.

The sync option is not really desired because we do our own syncs on
transaction completion. Don't want a sync on every write. Don't think
you can disable caching.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#10Noname
dg@illustra.com
In reply to: Bruce Momjian (#6)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Bruce Momjian writes:

Yes, I have heard that the standard file system read-ahead is often
useless for a database, so on a raw partition you know the next block
that is going to be requested, so you can prefetch there rather than
having the file system prefetch the next sequential block.

At least on the systems I am intimately familiar with, the prefetch that the
OS does (assuming a modern OS like Linux) is pretty hard to beat. If you have
a table that was bulk loaded in key order, a sequential scan is going to
result in a sequential access pattern to the underlying file and the OS
prefetch does the right thing. If you have an unindexed table with rows
inserted at the end, the OS prefetch still works. If you are using a secondary
index on some sort of chopped up table with rows inserted willy-nilly, it
then, it may be worth doing async reads in a burst and let the disk request
sort make the best of it.

As far as I am aware, Postgres does not do async I/O. Perhaps it should.

Also nice so you can control what gets written to disk/fsync'ed and what doesn't
get fsync'ed.

This is really the big win.

Our idea is to control when pg_log gets written to disk. We keep active
pg_log pages in shared memory, and every 30-60 seconds, we make a memory
copy of the current pg_log active pages, do a system sync() (which
happens anyway at that interval), update the pg_log file with the saved
changes, and fsync() the pg_log pages to disk. That way, after a crash,
the current database only shows transactions as committed where we are
sure all the data has made it to disk.

OK as far as it goes, but probably bad for concurrancy if I have understood
you.

I have a more detailed posting if you are interested.

Thanks, I will read it. Probably should hold more comments until after that ;-)

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- I realize now that irony has no place in business communications.

#11Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#10)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

At least on the systems I am intimately familiar with, the prefetch that the
OS does (assuming a modern OS like Linux) is pretty hard to beat. If you have
a table that was bulk loaded in key order, a sequential scan is going to
result in a sequential access pattern to the underlying file and the OS
prefetch does the right thing. If you have an unindexed table with rows
inserted at the end, the OS prefetch still works. If you are using a secondary
index on some sort of chopped up table with rows inserted willy-nilly, it
then, it may be worth doing async reads in a burst and let the disk request
sort make the best of it.

As far as I am aware, Postgres does not do async I/O. Perhaps it should.

I am adding this to the TODO list:

* Do async I/O to do better read-ahead of data

Because we are not threaded, we really can't do anything else while we
are waiting for I/O, but we can pre-request data we know we will need.

Also nice so you can control what gets written to disk/fsync'ed and what doesn't
get fsync'ed.

This is really the big win.

Yep, and this is what we are trying to work around in our buffered
pg_log change. Because we have the transaction ids all compact in one
place, this seems like a workable solution to our lack of write-to-disk
control. We just control the pg_log writes.

Our idea is to control when pg_log gets written to disk. We keep active
pg_log pages in shared memory, and every 30-60 seconds, we make a memory
copy of the current pg_log active pages, do a system sync() (which
happens anyway at that interval), update the pg_log file with the saved
changes, and fsync() the pg_log pages to disk. That way, after a crash,
the current database only shows transactions as committed where we are
sure all the data has made it to disk.

OK as far as it goes, but probably bad for concurrancy if I have understood
you.

Interesed in hearing your comments.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#12Noname
dg@illustra.com
In reply to: Bruce Momjian (#11)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Bruce Momjian writes:

I am adding this to the TODO list:

* Do async I/O to do better read-ahead of data

Good.

Because we are not threaded, we really can't do anything else while we
are waiting for I/O, but we can pre-request data we know we will need.

Threading is a bit like raw devices. It sounds like a really good idea,
particularly with M$ banging the "NT, now with threads" drum, but in real
life there are some very good reasons not to thread. Particularly with an
extensible product like Postgres where J-Random routine gets loaded at
runtime. In a threaded system, J-Random routine needs to be pretty well
perfect or the whole system comes down. In a process based system, unless
it trashes something in the shared memory, only the one connection instance
needs to come down. My experience with Illustra says that this is fairly
important.

The other big problem with threading is that now the stacks and all dynamically
allocated data are in the shared memory and are not easily extendable. So,
if some recursive procedure (like in the rewriter) uses a bit of extra stack
some other thread gets its stack trashed. This is never pretty.

Or if some user function loads a giant large object (like an mpeg say), that
memory has to come out of the shared memory, now if they leak that memory,
it is gone for good. In a per process system, it just ends up using a little
more swap space.

The other thing threading does is introduce new synchronization requirements
into things that never needed it before. For example, Postgres is pretty
free with memory allocation and deallocation (think Nodes!). With threading
each palloc() and pfree() is going to have to take a round trip through a
mutex. This makes an expensive operation even more costly. By and large,
the dbms's that are threaded have had pretty static (ie pre-allocate every
thing in arrays at boot time) memory models. Postgres does not fit this
picture very well.

Ultimately, threading may buy some performance, but not very much compared
to how hard it is to get right and how brittle it tends to make the system.

Unless I have misunderstood the state of Postgres, there is a vast amount
of performance improvement to be had without even thinking about threading.
If it were me, I would pick up the easy stuff, then the middle-hard stuff
with the really big wins like a proper transaction log, and leave the very
hard stuff like threading until last.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Of course, someone who knows more about this will correct me if I'm wrong,
and someone who knows less will correct me if I'm right."
--David Palmer (palmer@tybalt.caltech.edu)

#13Noname
ocie@paracel.com
In reply to: Noname (#12)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

David Gould wrote:

Bruce Momjian writes:

I am adding this to the TODO list:

* Do async I/O to do better read-ahead of data

Good.

Because we are not threaded, we really can't do anything else while we
are waiting for I/O, but we can pre-request data we know we will need.

Threading is a bit like raw devices. It sounds like a really good idea,
particularly with M$ banging the "NT, now with threads" drum, but in real
life there are some very good reasons not to thread. Particularly with an
extensible product like Postgres where J-Random routine gets loaded at
runtime. In a threaded system, J-Random routine needs to be pretty well
perfect or the whole system comes down. In a process based system, unless
it trashes something in the shared memory, only the one connection instance
needs to come down. My experience with Illustra says that this is fairly
important.

I tend to agree. I think threads are more usefull in applications
software, where the programmer has at least some idea of the use of
the program.

On a more practical/implementation note, it might be nice to have one
process that can perform prefetches for the others. I.E. If I know
that I will need page X, I queue a request to this process (in shared
memory?) and it will read in this page for us (or even write a page to
disk for us).

Asynchronous I/O is interesting, but I don't know how widely it has
been implemented (Linux has none, unless the development version is
working on it, and Solaris has aio, but not the POSIX flavor yet).

Ocie Mitchell

#14Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#12)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Threading is a bit like raw devices. It sounds like a really good idea,
particularly with M$ banging the "NT, now with threads" drum, but in real
life there are some very good reasons not to thread. Particularly with an
extensible product like Postgres where J-Random routine gets loaded at
runtime. In a threaded system, J-Random routine needs to be pretty well
perfect or the whole system comes down. In a process based system, unless
it trashes something in the shared memory, only the one connection instance
needs to come down. My experience with Illustra says that this is fairly
important.

Yes, the threading topic has come up before, and I have never considered
it a big win. We want to remove the exec() from the startup, so we just
do a fork. Will save 0.001 seconds of startup.

That is a very easy win for us. I hadn't considered the synchonization
problems with palloc/pfree, and that could be a real problem.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#15Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Bruce Momjian (#14)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Bruce Momjian wrote:

Yes, the threading topic has come up before, and I have never considered
it a big win. We want to remove the exec() from the startup, so we just
do a fork. Will save 0.001 seconds of startup.

As I was always biased to threading I would note that in many cases it
is a big win. First of all, today it's the IO which is usually the
slowest part of the database. Most of good IO designs have different
optimisations like tag queuing etc. But, to make use of all this stuff
you have to generate your IO requests as soon as possible. If you
process your query in one thread, most of the time your processor waits
for the IO, but sometimes your IO waits for the processor. If you
populate your IO system with a bunch of IO requests paralelly,
controllers may try to requeue them to optimise heads movements etc.
However, if you do some IO, then some processing, then some IO.... you
loose the capability of optimising your requests. Unless your system is
heavily loaded. In this case, it actually doesn't matter how you
schedule IO requests. But if your system is not loaded too heavily, it's
good to parallelize IO tasks. And the easiest way to accomplish it is to
use threads for parallel execution of tasks.

But I notice that many people still think of threads as a replacement
for fork. Of course, in such case it's pretty useless since fork is fast
enough. But the key to the success is to parallelize single queries not
only to leverage processor usage, but also to push IO to it's maximum.

That is a very easy win for us. I hadn't considered the synchonization
problems with palloc/pfree, and that could be a real problem.

Few months ago I was thinking about it. Actually I don't see much
problems with things like palloc/pfree. I don't see any problems with
any heap data that is used locally. But it is a big problem when you
take a look at global variables and global data that is accessed and
modified in many places. This is potential source of troubles.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#16Noname
dg@illustra.com
In reply to: Michal Mosiewicz (#15)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Michal Mosiewicz writes:

As I was always biased to threading I would note that in many cases it
is a big win. First of all, today it's the IO which is usually the
slowest part of the database. ...

...

However, if you do some IO, then some processing, then some IO.... you
loose the capability of optimising your requests. ...
... But if your system is not loaded too heavily, it's
good to parallelize IO tasks. And the easiest way to accomplish it is to
use threads for parallel execution of tasks.

Agreed, but what you are talking about here is decomposing a query into
it's parallel components and executing them in parallel. This is a win
of course, but the optimizer and executor have to support it. Also, you
start to want things like table fragmentation across devices to make this
work. A big job. As a shortcut, you can just do some lookahead on index scans
and do prefetch. Doesn't buy as much, but could probably be done very
quickly.

But I notice that many people still think of threads as a replacement
for fork. Of course, in such case it's pretty useless since fork is fast
enough. But the key to the success is to parallelize single queries not
only to leverage processor usage, but also to push IO to it's maximum.

This is indeed what I was thinking about. The process per connection scheme
of Postgres is often criticised vs a thread per connection scheme as in
Sybase for example. I was responding to that criticism.

That is a very easy win for us. I hadn't considered the synchonization
problems with palloc/pfree, and that could be a real problem.

Few months ago I was thinking about it. Actually I don't see much
problems with things like palloc/pfree. I don't see any problems with

If you have multiple threads each allocing memory at the same time, the
allocator data structures have to be protected.

any heap data that is used locally. But it is a big problem when you
take a look at global variables and global data that is accessed and
modified in many places. This is potential source of troubles.

Too right.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- I realize now that irony has no place in business communications.

#17Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#16)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?t

Agreed, but what you are talking about here is decomposing a query into
it's parallel components and executing them in parallel. This is a win
of course, but the optimizer and executor have to support it. Also, you
start to want things like table fragmentation across devices to make this
work. A big job. As a shortcut, you can just do some lookahead on index scans
and do prefetch. Doesn't buy as much, but could probably be done very
quickly.

Yes, this is what I was thinking. Could be done pretty easily, and
hence the TODO addition.

If you have multiple threads each allocing memory at the same time, the
allocator data structures have to be protected.

any heap data that is used locally. But it is a big problem when you
take a look at global variables and global data that is accessed and
modified in many places. This is potential source of troubles.

Too right.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#18Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#16)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Yes, the threading topic has come up before, and I have never considered
it a big win. We want to remove the exec() from the startup, so we just
do a fork. Will save 0.001 seconds of startup.

That is a very easy win for us. I hadn't considered the synchonization
problems with palloc/pfree, and that could be a real problem.

I was wrong here. Removing exec() will save 0.01 seconds, not 0.001
seconds. Typical backend startup and a single query is 0.08 seconds.
Removal of exec() will take this down to 0.07 seconds.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#19Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Bruce Momjian (#18)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

I have not had a chance yet to browse the postgres code, some silliness about
my day job keeps interfering, so it may be a few weeks before I get to this.

My thought is that comitting transactions can't release locks until the log
is written. So, if the log is not flushed except every so often, locks are
held longer harming concurrancy.

Of course, I am probably thinking about "normal" WAL type transactions not
the "unique" postgres log thing which truth to tell I may never have
really understood.

You have good point here. We are telling an application that the
transaction is committed, even though if there is a crash in the next 10
seconds, on reboot, it will show it as not committed.

It is my understanding that all buffered logging database systems have
this problem, so I think we will just have to live with it. I believe
we will still continue to offer the original fsync() after commit
behavior.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#20Vadim B. Mikheev
vadim@sable.krasnoyarsk.su
In reply to: The Hermit Hacker (#1)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Chris Albertson wrote:

I know you must have looked at far more Postgresql code then I have but
I was browsing the storage manager. Apparently it is fairly easy to
assign a class to a manager as each class is tagged in the system catalog
with a storage method. What I really want is a >2GB table. I was trying
to see if this was supported by reading the source. Looks like it may be.
The note in the To Do list includes testing. I would test it but for
lack of disk space. (I'll have more in a while.)

^^^^^^^^^^^^^^^^^^

src/backend/storage/smgr/md.c:

#define RELSEG_SIZE ...

- just decrease it!

Vadim