Delete Performance
Tom Lane suggested I look at EXPLAIN output, which showed
that both the catalog (fast delete case) and toasted text
table (slow delete case) were using sequential scans when
deleting any significant number of records.\
But even with sequential scan, the catalog entries are
deleted quickly (30K records in just a couple of seconds),
vice slow deletes (2 per second) for the toasted text.
The catalog entries are about 200 bytes (integers, timestamps,
a couple of short fixed length strings), while the toasted
text table has one short text field, one timestamp, and one
long (2K to 20K bytes) toasted text field.
Both will use index scans when a very small number (< 1%)
of records would be selected. But relative delete performance
stays the same.
--
P. J. "Josh" Rovero Sonalysts, Inc.
Email: rovero@sonalysts.com www.sonalysts.com 215 Parkway North
Work: (860)326-3671 or 442-4355 Waterford CT 06385
***********************************************************************
"P.J. \"Josh\" Rovero" <rovero@sonalysts.com> writes:
[ complains that deletes are slow in table containing toasted data ]
I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies. This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.
I did:
create table foo (f1 text);
insert into foo values ('a short entry');
insert into foo select * from foo;
-- repeat enough times to build up 32K rows total
delete from foo;
The "delete" took about 2 seconds. I then did it over with the
'value' being a 5K chunk of text, which according to octet_length
got compressed to 3900 bytes. (This'd require two rows in the TOAST
table.) This time the delete took 127 seconds. I was expecting
about a 3X penalty since we needed to delete three rows not one,
but what I got was a 60X penalty.
Trying to understand this, I did some profiling and found that most
of the time was going into XLogInsert and XLOG I/O. That's when I
remembered that the actual data volume involved is considerably
different in the two cases. Allowing for tuple header overhead and
so forth, the small-data case involves about 1.8MB, the large-data
case about 131MB, or about 70 times as much data.
I believe this indicates that what's determining the runtime is the fact
that the XLOG code writes out an image of each page modified in the
transaction. These page images will be the bulk of the XLOG traffic
for the TOAST table (since there are only four or so tuples on each
TOAST page, the actual XLOG delete records take little space by
comparison).
I've worried for some time that the decision to XLOG page images was
costing us a lot more performance than could be justified...
One trick we could perhaps pull is to postpone deletion of TOAST tuples
until VACUUM, so that the bulk of the work is done in a noncritical path
(from the point of view of the application anyway). I'm not sure how
this interacts with the way that we re-use a TOAST entry when other
fields in the row are updated, however. It might be too difficult for
VACUUM to tell when to delete a TOAST item.
regards, tom lane
Tom Lane wrote
I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies. This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.
At least now I know I wasn't imagining things.... :-)
Which brings up the question, what is the best way to deal with many
thousands of variable-length binary chunks. Net input == net output
over the course of a day. The new vacuum should help (both lo_ and
toasted tables take a long time to vacuum full), but I'm running into
the "Hotel California" situation. Data goes in fast, but can't be
deleted fast enough to keep the database from continuously growing
in size.
The "delete" took about 2 seconds. I then did it over with the
'value' being a 5K chunk of text, which according to octet_length
got compressed to 3900 bytes. (This'd require two rows in the TOAST
table.) This time the delete took 127 seconds. I was expecting
about a 3X penalty since we needed to delete three rows not one,
but what I got was a 60X penalty.
Wow. Can someone remind me why we take page images on delete? We
aren't really writing anything special to the page except a transction
id.
I've worried for some time that the decision to XLOG page images was
costing us a lot more performance than could be justified...
Is it because we take a snapshot of the page before we write it in case
we only write part of the page?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is it because we take a snapshot of the page before we write it in case
we only write part of the page?
AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.
Of course we can't be certain of that. But I'm wondering if there isn't
a cheaper solution.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is it because we take a snapshot of the page before we write it in case
we only write part of the page?AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.Of course we can't be certain of that. But I'm wondering if there isn't
a cheaper solution.
Could we add code to detect a partial write when we recover from one
using WAL so we can know if these partial writes are ever
happening?
I am with you on this. There has to be a better way.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Could we add code to detect a partial write when we recover from one
using WAL so we can know if these partial writes are ever
happening?
What's your point? It clearly *can* happen during power-failure
scenarios. All the monitoring in the world won't disprove that.
regards, tom lane
"P.J. \"Josh\" Rovero" <rovero@sonalysts.com> writes:
But even with sequential scan, the catalog entries are
deleted quickly (30K records in just a couple of seconds),
vice slow deletes (2 per second) for the toasted text.
The catalog entries are about 200 bytes (integers, timestamps,
a couple of short fixed length strings), while the toasted
text table has one short text field, one timestamp, and one
long (2K to 20K bytes) toasted text field.
I observed over in pg-hackers that deletion speed seems to be
proportional to total volume of data deleted, but that's not enough
to explain your results. You're reporting a 10000X speed difference
with only 10-100X difference in data volume, so there's still a large
factor to be accounted for.
Are you sure you don't have any rules, triggers, foreign keys involving
the slower table?
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Could we add code to detect a partial write when we recover from one
using WAL so we can know if these partial writes are ever
happening?What's your point? It clearly *can* happen during power-failure
scenarios. All the monitoring in the world won't disprove that.
My point is uh, um, eh, I think it is a very important point that I
should make ... um. :-)
Seriously, how do OS's handle partial page write, especially to
directories?
Another item I was considering is that INSERT and UPDATE, because they
append to the tables, don't really cause lots of pre-page writes, while
DELETE could affect all page in a table and would require pre-page
writes on all of them.
However, deletes are only marking the XID status of the rows.
Unfortunately I can't think of a way of recording those new XID's in WAL
and preventing a possible failure while the XID's are written to the
page. Can someone help me here?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seriously, how do OS's handle partial page write, especially to
directories?
... fsck ...
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seriously, how do OS's handle partial page write, especially to
directories?... fsck ...
But how can it handle partial writes to a directory when many files
exist in that single block?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seriously, how do OS's handle partial page write, especially to
directories?
I realize UPDATE also requires pre-page writes for the old tuples. What
bothers me is that unlike INSERT and UPDATE of new rows, DELETE and
UPDATE of old rows is not writing new data but just setting transaction
ID's. I wish there was a way to store those XID's somewhere else so the
page wouldn't have to be modified.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote:
I observed over in pg-hackers that deletion speed seems to be
proportional to total volume of data deleted, but that's not enough
to explain your results. You're reporting a 10000X speed difference
with only 10-100X difference in data volume, so there's still a large
factor to be accounted for.Are you sure you don't have any rules, triggers, foreign keys involving
the slower table?
Hmm, there is a foreign key defined in the "fast" table:
CREATE TABLE grib_catalog (
edition INTEGER NOT NULL CHECK(edition IN(1, 2)),
discipline INTEGER,
generating_center INTEGER NOT NULL CHECK(generating_center
BETWEEN 7 AND 99),
sub_center INTEGER NOT NULL,
scale_factor INTEGER,
grib_product_id INTEGER REFERENCES grib_product,
prod_category INTEGER CHECK (prod_category BETWEEN 0 AND 19),
grib_model_id INTEGER REFERENCES grib_model,
run_time TIMESTAMP NOT NULL,
fcst_time INTEGER NOT NULL CHECK(fcst_time >= 0),
grib_region_id INTEGER REFERENCES grib_region,
level INTEGER NOT NULL,
level_units CHAR(8) NOT NULL,
projection CHAR(16) NOT NULL,
bmp_usage BOOLEAN NOT NULL,
wx_usage BOOLEAN NOT NULL,
gds_usage BOOLEAN NOT NULL,
file_name TEXT ,
parse_time TIMESTAMP ,
gds_offset INTEGER CHECK(gds_offset >= 0),
pds_offset INTEGER NOT NULL CHECK(pds_offset >= 0),
drs_offset INTEGER CHECK(drs_offset >= 0),
ds_offset INTEGER NOT NULL CHECK(ds_offset >= 0),
bms_offset INTEGER CHECK(bms_offset >= 0),
PRIMARY
KEY(discipline,generating_center,sub_center,grib_product_id,grib_model_id,
run_time,fcst_time,grib_region_id,level,bmp_usage,gds_usage),
FOREIGN KEY (file_name,parse_time) REFERENCES grib_file
);
which results in pg_dump reporting an unnamed delete trigger. I guess this
means that a delete on grib_file refers back to grib_catalog
CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER DELETE ON "grib_file" FROM
"grib_catalog" NOT DEFERRABLE INITIALLY IMMEDIATE FOR EACH ROW EXECUTE
PROCEDURE "RI_FKey_noaction_del" ('<unnamed>', 'grib_catalog',
'grib_file', 'UNSPECIFIED', 'file_name', 'name', 'parse_time',
'parse_time');
Will reformulate without the foreign key and see if this helps.
--
P. J. "Josh" Rovero Sonalysts, Inc.
Email: rovero@sonalysts.com www.sonalysts.com 215 Parkway North
Work: (860)326-3671 or 442-4355 Waterford CT 06385
***********************************************************************
Josh Rovero wrote:
Tom Lane wrote
I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies. This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.
Can't we somehow WAL only metadata and not the actual pages for
DELETEs - as delete is essentially (though currently not technically)
just metadata it should be a possible thing to do.
------------------
Hannu
On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote:
Tom Lane wrote
I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies. This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.Can't we somehow WAL only metadata and not the actual pages for
DELETEs - as delete is essentially (though currently not technically)
just metadata it should be a possible thing to do.
Is it possible to do ordered writes, the way ext3 does?
http://www-106.ibm.com/developerworks/linux/library/l-fs7/
Is an interesting article discussing the approach.
Regards,
Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Andrew McMillan wrote:
On Tue, 2001-11-20 at 10:11, Hannu Krosing wrote:
Tom Lane wrote
I did some experimentation here and found a rather surprising
dependency: the time to delete a bunch of data is pretty much
directly proportional to the disk space it occupies. This says
that we're paying through the nose for having XLOG make copies
of about-to-be-modified pages.Can't we somehow WAL only metadata and not the actual pages for
DELETEs - as delete is essentially (though currently not technically)
just metadata it should be a possible thing to do.Is it possible to do ordered writes, the way ext3 does?
I remember it being discussed on this list that you have very little
control over writing order if you operate above filesystem/cache level.
I guess that is the article that sparked the idea of journalling only
metadata for deletes (including the delete half of update)
Using the Journaling Block Device described there could actually be
a good (though currently not portable) solution if you run linux.
-------------
Hannu
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Could we add code to detect a partial write when we recover from one
using WAL so we can know if these partial writes are ever
happening?What's your point? It clearly *can* happen during power-failure
scenarios. All the monitoring in the world won't disprove that.
What bothers me about this is that we have the original page with the
old data. It would be nice if we could write the new page in a
different location, make the new page active and recycle the old page at
some later time.
We are storing the pre-page image in WAL, but it seems like a waste
because we already have a pre-image. It is just that we are overwriting
it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
What bothers me about this is that we have the original page with the
old data. It would be nice if we could write the new page in a
different location, make the new page active and recycle the old page at
some later time.
I don't see how that reduces the total amount of disk traffic?
It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios. I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
What bothers me about this is that we have the original page with the
old data. It would be nice if we could write the new page in a
different location, make the new page active and recycle the old page at
some later time.I don't see how that reduces the total amount of disk traffic?
It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios. I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.
Well, right now we write the pre-image to WAL, then write the new page
over the old one. In my case, you just write the new, and somewhere
record that the old page is no longer active. Sounds a little like
VACUUM, but for pages.
With DELETE FROM foo, let's suppose you have 10 pages in the table. To
modify page 1, you write to page 11, then record in WAL that page 1 is
inactive. To write page 2, you write to page 1 and record page 2 as
inactive, etc. You basically are writing your new data one behind.
One problem I see is that you don't really know the pages are on disk so
I am not sure how to be safe when over-writing the inactive pages.
Of course, I am just throwing out ideas, looking for a solution. Help!
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't see how that reduces the total amount of disk traffic?
Well, right now we write the pre-image to WAL, then write the new page
over the old one. In my case, you just write the new, and somewhere
record that the old page is no longer active.
The devil is in the details of that last little bit. How is "mark a
page inactive" cheaper than "mark a tuple dead"? More specifically,
how do you propose to avoid WAL-logging the page you are going to do
this marking in? Seems you still end up with a WAL page image for
something.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't see how that reduces the total amount of disk traffic?
Well, right now we write the pre-image to WAL, then write the new page
over the old one. In my case, you just write the new, and somewhere
record that the old page is no longer active.The devil is in the details of that last little bit. How is "mark a
page inactive" cheaper than "mark a tuple dead"? More specifically,
how do you propose to avoid WAL-logging the page you are going to do
this marking in? Seems you still end up with a WAL page image for
something.
I was thinking of just throwing the inactive page number into WAL. Much
smaller than the entire page image. You don't touch the page. Does
that help?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I was thinking of just throwing the inactive page number into WAL. Much
smaller than the entire page image. You don't touch the page. Does
that help?
I don't think so. Somehow you have to tell the other backends that that
page is dead; merely recording it in WAL doesn't do that.
More to the point, you can't recycle (overwrite) that page until you've
checkpointed or WAL-logged the replacement page; so you still end up
with disk I/O for the replacement.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I don't see how that reduces the total amount of disk traffic?
Well, right now we write the pre-image to WAL, then write the new page
over the old one. In my case, you just write the new, and somewhere
record that the old page is no longer active.The devil is in the details of that last little bit. How is "mark a
page inactive" cheaper than "mark a tuple dead"? More specifically,
how do you propose to avoid WAL-logging the page you are going to do
this marking in? Seems you still end up with a WAL page image for
something.
Assuming that we WAL with the granularity of disk sector (512b) I think
that restructuring of database heap page (8kb) would be a big win for
delete/update.
The idea is to move metadata (oid,tableoid,xmin,cmin,xmax,cmax,ctid)
to the beginning of heap page to the same space with tuple pointers.
It's easy (<grin>) as all of it is fixed length.
Then a change in metadata like setting xmax for deleted/updated tuple
will dirty only the first disk page and not all of them.
The new structure of ItemId will be (
itemId-pointer nbits
itemId-flags 32-n bits
oid,
tableoid,
xmin,
cmin,
xmax,
cmax,
ctid
)
Assuming that we do account of dirty pages and WAL with the granularity
of database page we may get a big win by just moving to smaller
gramularity.
The win from increasing cranularity was not very big before WAL, as the
database pages are continuous on disk, but will be significant when we
have to log all dirty pages.
------------------
Hannu
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
What bothers me about this is that we have the original page with the
old data. It would be nice if we could write the new page in a
different location, make the new page active and recycle the old page at
some later time.I don't see how that reduces the total amount of disk traffic?
It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios. I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.
IIRC the double space requrement is what has kept us from implementing
DROP COLUMN.
-----------
Hannu
With DELETE FROM foo, let's suppose you have 10 pages in the table.
To
modify page 1, you write to page 11
But what with the indexes ? They would all need to be modified
accordingly.
If you did something like chaining, then before long all tuples would be
chained, even those that were not touched.
If you really want to avoid the page writes to WAL, imho the best way
would be
to revive the original PG page design where the physical position of
slots in a
heap page where only changed by vacuum.
Then, a heap page that was only partly written would only be a problem
iff
the hardware wrote wrong data, not if it only skipped part of the write.
Reasonable hardware does detect such corrupted pages.
E.g. on AIX if you reduce the PG pagesize to 4k, an only partly written
page
that stays undetected can be ruled out.
Then you would only need to write index pages to WAL, but not heap
pages.
Maybe a better idea would be to only conditionally write pages to WAL if
slot
positions changed. In the "delete" example heap slot positions certainly
do
not need to change.
To be extra safe it would probably be necessary to not split tuple
headers
(at least the xact info) across physical pages. Then it would also be
safe to
use a pg pagesize that is a multiple of the physical page size.
or so ? ...
Andreas
Import Notes
Resolved by subject fallback
It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios. I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.IIRC the double space requrement is what has kept us from implementing
DROP COLUMN.
The correct solution then, according methinks to my old Human Computer
Interaction lecturer, is to implement the feature anyway, and warn the DBA
what the consequences are. That way, the DBA can do it if she wants, unlike
the current situation where it's next to impossible (with lots of
referencing foreign keys).
Chris
It's also kind of unclear how to do it without doubling (or worse) the
amount of table space used in many common scenarios. I doubt many
people will be happy if "DELETE FROM foo" requires transient space equal
to twice the original size of foo.IIRC the double space requrement is what has kept us from implementing
DROP COLUMN.The correct solution then, according methinks to my old Human Computer
Interaction lecturer, is to implement the feature anyway, and warn the DBA
what the consequences are. That way, the DBA can do it if she wants, unlike
the current situation where it's next to impossible (with lots of
referencing foreign keys).
Yes, I personally am going to try this for 7.3, as well as fix CLUSTER.
I think someone has already started on CLUSTER anyway.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is it because we take a snapshot of the page before we write it in case
we only write part of the page?AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.Of course we can't be certain of that. But I'm wondering if there isn't
a cheaper solution.
I have added these TODO items to summarize this discussion:
* Reduce number of pre-page WAL writes; they exist only to gaurd against
partial page writes
* Turn off pre-page writes if fsync is disabled (?)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
If "pre-page WAL write" means the value of the page before the current
changes, then there is generally another reason for writing it out.
Imagine this sequence of events:
1. transaction A starts
2. transaction B starts
3. tran A makes a change
4. tran B makes a change
5. tran A commits
6. all changes get written to disk (this can happen even without fsync,
for example tran C might do a full table scan which fills the buffer cache
before B commits)
7. the system crashes
When the system comes back up, we need to do a rollback on
transaction B since it did not commit and we need the "pre-page"
to know how to undo the change for B that got saved in step 6 above.
At least this is what happens in most DBMSs...
Brian Beuning
Bruce Momjian wrote:
Show quoted text
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Is it because we take a snapshot of the page before we write it in case
we only write part of the page?AFAIR, the partial-page-write problem is the entire reason for doing it.
If we could be certain that writes to datafile pages were atomic, we'd
not need this.Of course we can't be certain of that. But I'm wondering if there isn't
a cheaper solution.I have added these TODO items to summarize this discussion:
* Reduce number of pre-page WAL writes; they exist only to gaurd against
partial page writes
* Turn off pre-page writes if fsync is disabled (?)-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
If "pre-page WAL write" means the value of the page before the current
changes, then there is generally another reason for writing it out.Imagine this sequence of events:
1. transaction A starts
2. transaction B starts
3. tran A makes a change
4. tran B makes a change
5. tran A commits
6. all changes get written to disk (this can happen even without fsync,
for example tran C might do a full table scan which fills the buffer cache
before B commits)
7. the system crashesWhen the system comes back up, we need to do a rollback on
transaction B since it did not commit and we need the "pre-page"
to know how to undo the change for B that got saved in step 6 above.At least this is what happens in most DBMSs...
Because we have a non-overwriting storage manager, I don't think this
issue applies to us.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026