Problem with the numbers I reported yesterday

Started by Michael Meskesalmost 28 years ago15 messages

meskes@topsystem.de

almost 28 years ago

I ran my performance tests some more times and it seems the numbers are not
really comparable. When I run PostgreSQL without -F I get a sync after every
insert. With -F I get no sync at all as all inserts fit well into the
buffer. However, Oracle in comparison does sync. Simply hearing the disk
access it seems as if they sync every two or three seconds.

Does anyone know a way to really check both DBMSs?

Michael

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Michael Meskes (#1)

Re: [HACKERS] Problem with the numbers I reported yesterday

I ran my performance tests some more times and it seems the numbers are not
really comparable. When I run PostgreSQL without -F I get a sync after every
insert. With -F I get no sync at all as all inserts fit well into the
buffer. However, Oracle in comparison does sync. Simply hearing the disk
access it seems as if they sync every two or three seconds.

Does anyone know a way to really check both DBMSs?

Many dbms's do buffered logging, that is they sync after the buffer gets
full or after a minute or so. We have the logic to add buffered logging
to PostgreSQL and will be doing it later. Right now, we only have
non-buffered logging, and no logging.

--
Bruce Momjian
maillist@candle.pha.pa.us

Noname

ocie@paracel.com

almost 28 years ago

In reply to: Bruce Momjian (#2)

Re: [HACKERS] Problem with the numbers I reported yesterday (fwd)

Bruce Momjian wrote:

I ran my performance tests some more times and it seems the numbers are not
really comparable. When I run PostgreSQL without -F I get a sync after every
insert. With -F I get no sync at all as all inserts fit well into the
buffer. However, Oracle in comparison does sync. Simply hearing the disk
access it seems as if they sync every two or three seconds.

Does anyone know a way to really check both DBMSs?

Many dbms's do buffered logging, that is they sync after the buffer gets
full or after a minute or so. We have the logic to add buffered logging
to PostgreSQL and will be doing it later. Right now, we only have
non-buffered logging, and no logging.

Will there be (or is there) a provision for using raw disk devices,
rather than going through the UNIX filesystem? This might be able to
reduce some of the overhead. Also, if a transaction log were appended
serially to a seperate raw disk, there would be verry little seeking
needed and this (the log) could easily be fsync'd for each transaction
while the data itself is buffered.

Of course not everyone has two raw disks to devote to a database (or
even one raw disk), but for those who do, this could provide speed and
reliability.

Ocie Mitchell

Import Notes

Resolved by subject fallback

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Noname (#3)

Re: [HACKERS] Problem with the numbers I reported yesterday

Hi,

perhaps a stupid question:

What will happen, if you start PostgreSQL with -F and write a script
which is a loop that fsyncs every 2 seconds? Does this lead to a
database which is "almost" correct? Does this seem like a good
compromise?

Well, you really have to sync the data pages BEFORE sync'ing pg_log.
Our buffered logging idea for post-6.3 will do exactly that. I don't
think it is a workaround. You could get pg_log to disk saying a
transaction is complete without getting all the data pages to disk if
the crash happened during the sync.

--
Bruce Momjian
maillist@candle.pha.pa.us

Import Notes

Reply to msg id not found: 199802121443.PAA22726@www.vocalweb.de | Resolved by subject fallback

Boersenspielteam

boersenspiel@vocalweb.de

almost 28 years ago

In reply to: Bruce Momjian (#2)

Re: [HACKERS] Problem with the numbers I reported yesterday

Hi,

perhaps a stupid question:

What will happen, if you start PostgreSQL with -F and write a script
which is a loop that fsyncs every 2 seconds? Does this lead to a
database which is "almost" correct? Does this seem like a good
compromise?

I ran my performance tests some more times and it seems the numbers are not
really comparable. When I run PostgreSQL without -F I get a sync after every
insert. With -F I get no sync at all as all inserts fit well into the
buffer. However, Oracle in comparison does sync. Simply hearing the disk
access it seems as if they sync every two or three seconds.

Does anyone know a way to really check both DBMSs?

Many dbms's do buffered logging, that is they sync after the buffer gets
full or after a minute or so. We have the logic to add buffered logging
to PostgreSQL and will be doing it later. Right now, we only have
non-buffered logging, and no logging.

--
Bruce Momjian
maillist@candle.pha.pa.us

Ciao

Das Boersenspielteam.

---------------------------------------------------------------------------
http://www.boersenspiel.de
Das Boersenspiel im Internet
*Realitaetsnah* *Kostenlos* *Ueber 6000 Spieler*
---------------------------------------------------------------------------

Noname

ocie@paracel.com

almost 28 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian wrote:

Hi,

perhaps a stupid question:

What will happen, if you start PostgreSQL with -F and write a script
which is a loop that fsyncs every 2 seconds? Does this lead to a
database which is "almost" correct? Does this seem like a good
compromise?

Well, you really have to sync the data pages BEFORE sync'ing pg_log.

Why should this be necessary? If the transaction is considered
committed once the log has be written, and the system crashes before
the data are written, then postgres can look at the data and logs when
it is next started up and apply all the transactions that were logged
but not committed to the data pages.

Am I missing something? It seems to me if you sync the data pages
first, then what good is the log? (other than being able to retrace
your steps)

Ocie Mitchell

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Noname (#6)

Re: [HACKERS] Problem with the numbers I reported yesterday

Bruce Momjian wrote:

Hi,

perhaps a stupid question:

What will happen, if you start PostgreSQL with -F and write a script
which is a loop that fsyncs every 2 seconds? Does this lead to a
database which is "almost" correct? Does this seem like a good
compromise?

Well, you really have to sync the data pages BEFORE sync'ing pg_log.

Why should this be necessary? If the transaction is considered
committed once the log has be written, and the system crashes before
the data are written, then postgres can look at the data and logs when
it is next started up and apply all the transactions that were logged
but not committed to the data pages.

No, on restart, you can't identify the old/new data. Remember, pg_log
is just the transaction id and a flag. The superceeded/new rows are
mixed on the data pages, with transaction id's as markers.

Am I missing something? It seems to me if you sync the data pages
first, then what good is the log? (other than being able to retrace
your steps)

Again, the log is just a list of transaction ids, and their statuses.

--
Bruce Momjian
maillist@candle.pha.pa.us

Kent S. Gordon

kgor@inetspace.com

almost 28 years ago

In reply to: Noname (#6)

Re: [HACKERS] Problem with the numbers I reported yesterday

"ocie" == ocie <ocie@paracel.com> writes:

Bruce Momjian wrote:

Hi, > > perhaps a stupid question: > > What will happen,

if you start PostgreSQL with -F and write a script > which is a
loop that fsyncs every 2 seconds? Does this lead to a >
database which is "almost" correct? Does this seem like a good

compromise?

Well, you really have to sync the data pages BEFORE sync'ing
pg_log.

Why should this be necessary? If the transaction is considered
committed once the log has be written, and the system crashes
before the data are written, then postgres can look at the data
and logs when it is next started up and apply all the
transactions that were logged but not committed to the data
pages.

Am I missing something? It seems to me if you sync the data
pages first, then what good is the log? (other than being able
to retrace your steps)

I do not think that pg_log is used like a normal 'log' device in other
databases. My quick look at the code looks like pg_log only has a
list of transactions and not the actual data blocks. Notice that
TransRecover is commented out in backent/access/transam/transam.c.

Most database log has the before images and after images of any page
that has been modified in a transaction followed by commit/abort
record. This allows for only this file to have to be synced. The
rest of the database can float (generally checkpoints are done every
so often to reduce recover time). The method of recovering from a
crash is to replay the log from the last checkpoint until the end of
the log by applying the before/after images (as needed based on
weather the transaction commited) to the actual database relations.

I would appreciate anyone correcting any mistakes in my understanding
of how postgres works.

Ocie Mitchell

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Kent S. Gordon (#8)

Re: [HACKERS] Problem with the numbers I reported yesterday

This is 100% correct. See my other posting describing the issues.

"ocie" == ocie <ocie@paracel.com> writes:

Bruce Momjian wrote:

Hi, > > perhaps a stupid question: > > What will happen,

if you start PostgreSQL with -F and write a script > which is a
loop that fsyncs every 2 seconds? Does this lead to a >
database which is "almost" correct? Does this seem like a good

compromise?

Well, you really have to sync the data pages BEFORE sync'ing
pg_log.

Why should this be necessary? If the transaction is considered
committed once the log has be written, and the system crashes
before the data are written, then postgres can look at the data
and logs when it is next started up and apply all the
transactions that were logged but not committed to the data
pages.

Am I missing something? It seems to me if you sync the data
pages first, then what good is the log? (other than being able
to retrace your steps)

I do not think that pg_log is used like a normal 'log' device in other
databases. My quick look at the code looks like pg_log only has a
list of transactions and not the actual data blocks. Notice that
TransRecover is commented out in backent/access/transam/transam.c.

Most database log has the before images and after images of any page
that has been modified in a transaction followed by commit/abort
record. This allows for only this file to have to be synced. The
rest of the database can float (generally checkpoints are done every
so often to reduce recover time). The method of recovering from a
crash is to replay the log from the last checkpoint until the end of
the log by applying the before/after images (as needed based on
weather the transaction commited) to the actual database relations.

I would appreciate anyone correcting any mistakes in my understanding
of how postgres works.

Ocie Mitchell

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

--
Bruce Momjian
maillist@candle.pha.pa.us

#10

Noname

jwieck@debis.com

almost 28 years ago

In reply to: Kent S. Gordon (#8)

DB logging (was: Problem with the numbers I reported yesterday)

Kent wrote:

I do not think that pg_log is used like a normal 'log' device in other
databases. My quick look at the code looks like pg_log only has a
list of transactions and not the actual data blocks. Notice that
TransRecover is commented out in backent/access/transam/transam.c.

Most database log has the before images and after images of any page
that has been modified in a transaction followed by commit/abort
record. This allows for only this file to have to be synced. The
rest of the database can float (generally checkpoints are done every
so often to reduce recover time). The method of recovering from a
crash is to replay the log from the last checkpoint until the end of
the log by applying the before/after images (as needed based on
weather the transaction commited) to the actual database relations.

I would appreciate anyone correcting any mistakes in my understanding
of how postgres works.

Ocie Mitchell

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

Totally right, PostgreSQL doesn't have a log mechanism that
collects all the information to recover a corrupted database
from a backup.

I hacked around on that a little bit.

When doing a complete after image logging, that is taking all
the tuples that are stored on insert/update, the tuple id's
of deletes plus the information about transaction id's that
commit, the regression tests produce log data that is more
than the size of the final regression database. The
performance increase when only syncing the log- and
controlfiles (2 control files on different devices and the
logfile on a different device from the database files) and
running the backends with -F is about 15-20% for the
regression test.

I thought this is far too much logging data and so I didn't
spent much time trying to implement a recovery. But as far as
I got it I can tell that the updates to system catalogs and
keeping the indices up to date will be really tricky.

Another possible log mechanism I'll try sometimes after v6.3
release is to log the queries and data from copy commands
along with informations about Oid and Tid allocations.

Until later, Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #

#11

Kent S. Gordon

kgor@inetspace.com

almost 28 years ago

In reply to: Noname (#10)

Re: DB logging (was: Problem with the numbers I reported yesterday)

"jwieck" == Jan Wieck <jwieck@debis.com> writes:

Kent wrote:

description of current logging method deleted.

Totally right, PostgreSQL doesn't have a log mechanism that
collects all the information to recover a corrupted database
from a backup.

I hacked around on that a little bit.

When doing a complete after image logging, that is taking
all the tuples that are stored on insert/update, the tuple id's
of deletes plus the information about transaction id's that
commit, the regression tests produce log data that is more than
the size of the final regression database. The performance
increase when only syncing the log- and controlfiles (2 control
files on different devices and the logfile on a different device
from the database files) and running the backends with -F is
about 15-20% for the regression test.

Log files do get very big with image logging. I would not expect a
huge win in performance unless the log device is a raw device. On a
cooked device (file system) buffer cache effects are very large (all
disk data is being buffered both by postgresql and the OS buffer
cache. The buffer cache is actual harmfully in this case, since data
is not reused, and the writes are synced. The number of writes to the
log also flush out other buffer from the cache leading to even more
io.). If a system does not have raw devices (linux, NT), it would be
very useful if a flag exists to tell the OS that the file will be read
sequential like in the madvise() call for mmap. Is your code
available anywhere?

I thought this is far too much logging data and so I didn't
spent much time trying to implement a recovery. But as far as I
got it I can tell that the updates to system catalogs and
keeping the indices up to date will be really tricky.

I have not looked at this area of the code. Do the system catalogs
have a separate storage manager? I do not see why the could not be
handled like any other data except for keeping the buffer in the cache.

Kent S. Gordon
Architect
iNetSpace Co.
voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com

#12

Noname

ocie@paracel.com

almost 28 years ago

In reply to: Kent S. Gordon (#11)

Re: DB logging (was: Problem with the numbers I reported yesterday)

Kent S. Gordon wrote:

[SNIP]

Log files do get very big with image logging. I would not expect a
huge win in performance unless the log device is a raw device. On a
cooked device (file system) buffer cache effects are very large (all
disk data is being buffered both by postgresql and the OS buffer
cache. The buffer cache is actual harmfully in this case, since data
is not reused, and the writes are synced. The number of writes to the
log also flush out other buffer from the cache leading to even more
io.). If a system does not have raw devices (linux, NT), it would be

^^^^^
What exactly do you mean by "raw devices" that it is not offered by
Linux? If I take a partition of one of my hard drives and I don't
make a filesystem on it, I can perform reads and writes on the "raw
device" /dev/hd?? or /dev/sd?? I didn't think these writes were
buffered (if that's what you're referring to), but I could be wrong.

Ocie Mitchell.

#13

Bruce Momjian

maillist@candle.pha.pa.us

almost 28 years ago

In reply to: Noname (#12)

Re: DB logging (was: Problem with the numbers I reported yesterday)

What exactly do you mean by "raw devices" that it is not offered by
Linux? If I take a partition of one of my hard drives and I don't
make a filesystem on it, I can perform reads and writes on the "raw
device" /dev/hd?? or /dev/sd?? I didn't think these writes were
buffered (if that's what you're referring to), but I could be wrong.

Your /dev/hd* goes through the the buffer cache, the raw versions
/dev/rhd* does not.

--
Bruce Momjian
maillist@candle.pha.pa.us

#14

Noname

ocie@paracel.com

almost 28 years ago

In reply to: Bruce Momjian (#13)

Re: DB logging (was: Problem with the numbers I reported yesterday)

Bruce Momjian wrote:

What exactly do you mean by "raw devices" that it is not offered by
Linux? If I take a partition of one of my hard drives and I don't
make a filesystem on it, I can perform reads and writes on the "raw
device" /dev/hd?? or /dev/sd?? I didn't think these writes were
buffered (if that's what you're referring to), but I could be wrong.

Your /dev/hd* goes through the the buffer cache, the raw versions
/dev/rhd* does not.

Actually on Linux, there is no raw/cooked drive interface as in
Solaris. In Solaris, the /dev/dsk/ devices are buffered by the OS,
while their counterparts in /dev/rdsk are not. Linux only has the one
interface to the partition, which is raw. Code which uses these raw
devices (the filesystem code) must supply its own buffering.

Anyway, I don't want to prolong this tangential topic. Linux should
provide raw devices, but does not (as Solaris does) provide buffered
or cooked access to disks.

Ocie Mitchell

#15

Noname

jwieck@debis.com

almost 28 years ago

In reply to: Kent S. Gordon (#11)

Re: [HACKERS] Re: DB logging (was: Problem with the numbers I reported yesterday)

Kent wrote:

"jwieck" == Jan Wieck <jwieck@debis.com> writes:

When doing a complete after image logging, that is taking
all the tuples that are stored on insert/update, the tuple id's
of deletes plus the information about transaction id's that
commit, the regression tests produce log data that is more than
the size of the final regression database. The performance
increase when only syncing the log- and controlfiles (2 control
files on different devices and the logfile on a different device
from the database files) and running the backends with -F is
about 15-20% for the regression test.

Log files do get very big with image logging. I would not expect a
huge win in performance unless the log device is a raw device. On a
cooked device (file system) buffer cache effects are very large (all
disk data is being buffered both by postgresql and the OS buffer
cache. The buffer cache is actual harmfully in this case, since data
is not reused, and the writes are synced. The number of writes to the
log also flush out other buffer from the cache leading to even more
io.). If a system does not have raw devices (linux, NT), it would be
very useful if a flag exists to tell the OS that the file will be read
sequential like in the madvise() call for mmap. Is your code
available anywhere?

I don't have that code any more. It wasn't that much so I can
redo it if at least you would like to help on that topic. But
since this will be a feature we should wait for the 6.3
release before touching anything.

I thought this is far too much logging data and so I didn't
spent much time trying to implement a recovery. But as far as I
got it I can tell that the updates to system catalogs and
keeping the indices up to date will be really tricky.

I have not looked at this area of the code. Do the system catalogs
have a separate storage manager? I do not see why the could not be
handled like any other data except for keeping the buffer in the cache.

I just had some problems on the system catalogs (maybe due to
the system caching). I think that it can be handled somehow.

There are other details in the logging we should care about
when we implement it.

The logging should be configurable per database. Some
databases have logging enabled while others are unprotected.

It must be able to do point in time recovery (restore the
database from a backup and recover until an absolute time or
transaction ID).

The previous two produce a problem for shared system
relations. If a backend running on an unlogged database
updates pg_user for example, this time it must go into the
log!

We should give query logging instead of image logging a try.

Until later, Jan