[WIP] Performance Improvement by reducing WAL for Update Operation

Started by Amit Kapilaover 13 years ago25 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

Problem statement:

-----------------------------------

Reducing wal size for an update operation for performance improvement.

Advantages:

---------------------
1. Observed increase in performance with pgbench when server is running in sync_commit off mode.
a. with pgbench (tpc_b) - 13%
b. with modified pgbench (such that size of modified columns are less than all row) - 83%

2. WAL size is reduced

Design/Impementation:

------------------------------

Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's

Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
Note: Wal update header is modified to denote whether wal update optimization is done or not.
WAL update header + Tuple header(no change from previous format) +
[offset(2bytes)] [length(2 bytes)] [changed data value]
[offset(2bytes)] [length(2 bytes)] [changed data value]
....
....

Recovery:

----------------
The following steps are only incase of the tuple is optimized.

6. For forming the new tuple, old tuple is required.(including if the old tuple does not require any modifications also).
7. Form the new tuple based on the format specified in the 5th point.
8. once new tuple is framed, follow the exisiting behavior.

Frame the new tuple from old tuple and WAL record:

1. The length of the data which is needs to be copied from old tuple is calculated as
the difference of offset present in the WAL record and the old tuple offset.
(for the first time, the old tuple offset value is zero)
2. Once the old tuple data copied, then increase the offset for old tuple by the
copied length.
3. Get the length and value of modified column from WAL record, copy it into new tuple.
4. Increase the old tuple offset with the modified column length.
5. Repeat this procedure until the WAL record reaches the end.
6. If any remaining left out old tuple data will be copied.

Test results:

----------------------
1. The pgbench test run for 10min.

2. pgbench result for tpc-b is attached with this mail as pgbench_org

3. modified pgbench(such that size of modified columns are less than all row) result for tpc-b is attached with this mail as pgbench_1800_300

Modified pgbench code:

---------------------------------------
1. Schema of the tables are modified as added some extra fields to increase the record size to 1800.
2. The tcp_b benchmark suite to do only update operations.
3. The update operation changed as to update 3 columns with 300 bytes out of total size of 1800 bytes.
4. During initialization of tables removed the NULL value insertions.

I am working on solution to handle other scenarios like variable length columns, tuple contain NULLs, handling for before triggers.

Please provide suggestions/objections?

With Regards,
Amit Kapila.

Attachments:

wal_update_changes.patchtext/plain; name=wal_update_changes.patchDownload+415-49
pgbench_modified.ctext/plain; name=pgbench_modified.cDownload
pgbench_org.htmtext/html; name=pgbench_org.htmDownload
pgbench_1800_300.htmtext/html; name=pgbench_1800_300.htmDownload
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 03.08.2012 14:46, Amit kapila wrote:

Currently the change is done only for fixed length columns for simple tables and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be changed based on final design required for handling other scenario's

Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update triggers)
2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length columns are modified, if so this optimization is not applied.
5. Identify the offset and length for the modified columns and store it as an optimized WAL tuple in the following format.
Note: Wal update header is modified to denote whether wal update optimization is done or not.
WAL update header + Tuple header(no change from previous format) +
[offset(2bytes)] [length(2 bytes)] [changed data value]
[offset(2bytes)] [length(2 bytes)] [changed data value]
....
....

The performance will need to be re-verified after you fix these
limitations. Those limitations need to be fixed before this can be applied.

It would be nice to use some well-known binary delta algorithm for this,
rather than invent our own. OTOH, we have more knowledge of the
attribute boundaries, so a custom algorithm might work better. In any
case, I'd like to see the code to do the delta encoding/decoding to be
put into separate functions, outside of heapam.c. It would be good for
readability, and we might want to reuse this in other places too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: Saturday, August 04, 2012 1:33 AM
On 03.08.2012 14:46, Amit kapila wrote:

Currently the change is done only for fixed length columns for simple

tables and the tuple should not contain NULLS.

This is a Proof of concept, the design and implementation needs to be

changed based on final design required for handling other scenario's

Update operation:
-----------------------------
1. Check for the simple table or not.(No toast, No before update

triggers)

2. Works only for not null tuples.
3. Identify the modified columns from the target entry.
4. Based on the modified column list, check for any variable length

columns are modified, if so this optimization is not applied.

5. Identify the offset and length for the modified columns and store it

as an optimized WAL tuple in the following format.

Note: Wal update header is modified to denote whether wal update

optimization is done or not.

WAL update header + Tuple header(no change from previous format)

+

[offset(2bytes)] [length(2 bytes)] [changed data value]
[offset(2bytes)] [length(2 bytes)] [changed data value]
....
....

The performance will need to be re-verified after you fix these
limitations. Those limitations need to be fixed before this can be

applied.

Yes, I agree that solution should fix these limitations and performance
numbers needs to be re-verified.
Currently in my mind the work to be done is as follows:

1. Solution which can handle Variable length columns and NULLs
2. Handling of Before Triggers
3. Can the solution for fixed length columns be same as Variable length
columns and NULLS.
4. Make the final code patch which addresses all the above.

Please suggest if there are more things that needs to be handled?

For the 3rd point, currently the solution for fixed length columns cannot
handle the case of variable length columns and NULLS. The reason is for
fixed length columns there is no need of diff technology between old and new
tuple, however for other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.

It would be nice to use some well-known binary delta algorithm for this,
rather than invent our own. OTOH, we have more knowledge of the
attribute boundaries, so a custom algorithm might work better.

I shall work on this and post after initial work.

In any case, I'd like to see the code to do the delta encoding/decoding to

be

put into separate functions, outside of heapam.c. It would be good for
readability, and we might want to reuse this in other places too.

Agreed. I shall take care of doing it in suggested way.

With Regards,
Amit Kapila.

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#2)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 04.08.2012 11:01, Amit Kapila wrote:

Missed one point which needs to be handled is pg_upgrade

I don't think there's anything to do for pg_upgrade. This doesn't change
the on-disk data format, just the WAL format, and pg_upgrade isn't
sensitive to WAL format changes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#4)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Sat, Aug 4, 2012 at 05:21:06PM +0300, Heikki Linnakangas wrote:

On 04.08.2012 11:01, Amit Kapila wrote:

Missed one point which needs to be handled is pg_upgrade

I don't think there's anything to do for pg_upgrade. This doesn't
change the on-disk data format, just the WAL format, and pg_upgrade
isn't sensitive to WAL format changes.

Correct.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#5)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Bruce Momjian [mailto:bruce@momjian.us]
Sent: Saturday, August 04, 2012 8:06 PM
On Sat, Aug 4, 2012 at 05:21:06PM +0300, Heikki Linnakangas wrote:
On 04.08.2012 11:01, Amit Kapila wrote:

Missed one point which needs to be handled is pg_upgrade

I don't think there's anything to do for pg_upgrade. This doesn't
change the on-disk data format, just the WAL format, and pg_upgrade
isn't sensitive to WAL format changes.

Correct.

Thanks Bruce and Heikki for this information.

I need your feedback on the below design point, as it will make my further
work on this performance issue more clear.
Also let me know if the explanation below is not clear, I shall try to use
some examples to explain my point.

Currently the solution for fixed length columns cannot handle the case of
variable length columns and NULLS. The reason is for fixed length columns
there is no need of diff technology between old and new tuple, however for
other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.

With Regards,
Amit Kapila.

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#4)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 06.08.2012 06:10, Amit Kapila wrote:

Currently the solution for fixed length columns cannot handle the case of
variable length columns and NULLS. The reason is for fixed length columns
there is no need of diff technology between old and new tuple, however for
other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the replay
of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and doesn't
contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also we
can do the whole work in 2 parts, one for fixed length columns and second to
handle other cases.

Let's keep it simple and use the same diff format for all tuples, at
least for now. If it turns out that you can indeed get even more gain
for fixed length tuples by something like that, then let's do that later
as a separate patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: Monday, August 06, 2012 2:32 PM
To: Amit Kapila
Cc: 'Bruce Momjian'; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [WIP] Performance Improvement by reducing WAL for
Update Operation
On 06.08.2012 06:10, Amit Kapila wrote:

Currently the solution for fixed length columns cannot handle the case of
variable length columns and NULLS. The reason is for fixed length columns
there is no need of diff technology between old and new tuple, however

for

other cases it will be required.
For fixed length columns, if we just note the OFFSET, LENGTH, VALUE of
changed columns of new tuple in WAL, it will be sufficient to do the

replay

of WAL. However to handle other cases we need to use diff mechanism.

Can we do something like if the changed columns are fixed length and

doesn't

contain NULL's, then store [OFFSET, LENGTH, VALUE] format in WAL and for
other cases store diff format.

This has advantage that for Updates containing only fixed length columns
don't have to pay penality of doing diff between new and old tuple. Also

we

can do the whole work in 2 parts, one for fixed length columns and second

to

handle other cases.

Let's keep it simple and use the same diff format for all tuples, at
least for now. If it turns out that you can indeed get even more gain
for fixed length tuples by something like that, then let's do that later
as a separate patch.

Okay, I shall first try to design and implement the same format for all
tuples
and discuss the results of same with community.

With Regards,
Amit Kapila.

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote:

Frame the new tuple from old tuple and WAL record:

Sounds good.

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

You don't mention whether or not the old and the new tuple are on the
same data block.

Personally, I think it will be important to ensure the above,
otherwise recovery will require much additional code for that case.
And that code will be prone to race conditions and performance issues.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns. I think I'd want
them to be there for debugging purposes so we can prove this code is
correct in production, since otherwise this could be a source of data
loss bugs.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#9)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Thursday, August 09, 2012 12:36 PM
On 3 August 2012 12:46, Amit kapila <amit.kapila@huawei.com> wrote:

Frame the new tuple from old tuple and WAL record:

Sounds good.

Thanks.

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of
total length?

You don't mention whether or not the old and the new tuple are on the
same data block.

WAL reduction is done for the case even when old and new are on different
data blocks as well.

Personally, I think it will be important to ensure the above,
otherwise recovery will require much additional code for that case.

In recovery currently also, it handles the case when old and new are on
different page such that
it has to read old page to get the old tuple.

The modifications needs to ensure handling of following cases:

a. When there is backup block,and old-new tuples are on different page
Currently it doesn't read the old page,
However for new implementation it needs to read old page for this case
also.

b. When changes are already applied on page [line : if (XLByteLE(lsn,
PageGetLSN(page))); function: heap_xlog_update]
Currently it doesn't read the old page,
However for new implementation it needs to read old page for this case
also.

And that code will be prone to race conditions and performance issues.

Are you telling performance issues, as now we may need to read old page in
some of the cases
when earlier it was not reading?
If yes, then I think as I have mentioned above, according to me above 2
cases are not very usual cases.
However the benefit of Update operation on running server is good enough
as it reduces the WAL volume.
If other then above, then please suggest me.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns.

Agreed. However once the implementation by Andres is done I can merge both
codes and
take the performance data again, based on which we can take decision.

With Regards,
Amit Kapila.

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote:

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of
total length?

Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
removal of rows in all cases would not be worth it, so we need a fast
path way of saying lets just take all of the columns.

You don't mention whether or not the old and the new tuple are on the
same data block.

WAL reduction is done for the case even when old and new are on different
data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns.

Agreed. However once the implementation by Andres is done I can merge both
codes and
take the performance data again, based on which we can take decision.

It won't happen like that because there won't be a single point where
Andres is done. If you agree, then its worth doing it that way to
begin with, rather than requiring us to revisit the same section of
code twice.

One huge point that needs to be thought through is how we prove this
code actually works on WAL/recovery side. A normal regression test
won't prove that and we don't have a framework in place for that.

If you think about what you'll need to do to prove you haven't made
some fatal corruption of WAL, its going to look a lot like logical
replication tests. Worst case here is that mistakes on this patch will
show up as Andres' mistakes. So there is a stronger connection to
Andres' work than it first appears.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#11)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 09.08.2012 12:18, Simon Riggs wrote:

On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com> wrote:

WAL reduction is done for the case even when old and new are on different
data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

That was my first instinctive reaction too. But if the mechanism works
just as well for cross-page updates, seems a bit strange to not use it.

One argument would be that if for some reason the old block is corrupt
or lost, you would not be able to recover the new version of the tuple
from the WAL alone. At the moment, it's nice that the WAL record
contains all the information required to reconstruct the new tuple,
regardless of the old data block contents. But then again, full-page
writes cover that too. There will be a full-page image of the old block
in the WAL anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#12)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 9 August 2012 11:30, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 09.08.2012 12:18, Simon Riggs wrote:

On 9 August 2012 09:49, Amit Kapila<amit.kapila@huawei.com> wrote:

WAL reduction is done for the case even when old and new are on
different
data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

That was my first instinctive reaction too. But if the mechanism works just
as well for cross-page updates, seems a bit strange to not use it.

One argument would be that if for some reason the old block is corrupt or
lost, you would not be able to recover the new version of the tuple from the
WAL alone. At the moment, it's nice that the WAL record contains all the
information required to reconstruct the new tuple, regardless of the old
data block contents.

Exactly. If we lose the first block in a checkpoint, we could lose all
updates to rows in that page and all other pages linked to it over a
whole checkpoint duration. Basically, page corruption will propogate
from block to block if we do this.

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).

But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#11)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Simon Riggs
Sent: Thursday, August 09, 2012 2:49 PM
On 9 August 2012 09:49, Amit Kapila <amit.kapila@huawei.com> wrote:

I'd suggest we do this only when the saving is large enough for
benefit, rather than do this every time.

Do you mean to say that when length of updated values of tuple is less
than some threshold(1/3 or 2/3, etc..) value of
total length?

Some heuristic, yes, similar to TOAST's minimum threshold. To attempt
removal of rows in all cases would not be worth it, so we need a fast
path way of saying lets just take all of the columns.

Yes, it has to be done. Currently I have 2 ideas to take care of this:
a. Based on number of updated columns
b. Based on length of updated values
If you have any other idea or you favor among one of the above, let me
know your opinion.

You don't mention whether or not the old and the new tuple are on the
same data block.

WAL reduction is done for the case even when old and new are on

different

data blocks as well.

That makes me feel nervous. I doubt the marginal gain is worth it.
Most updates don't cross blocks.

How can it be proved whether gain is marginal or substantial to handle the
case.

One way is test after modification:
I have updated pg_bench tpc_b case:
1. Schema is such that it contains 1800 length rows
2. tpc_b only has updates
3. length of updated column values is 300.
4. All tables has 100% fill factor.
5. Vacuum is OFF

So in such a run, I think many should be updates are across blocks. But not
sure, neither I have verified it in any way.
The above run has given a good performance improvement.

Please also bear in mind that Andres will be looking to include the PK
columns in every WAL record for BDR. That could be an option, but I
doubt there is much value in excluding PK columns.

Agreed. However once the implementation by Andres is done I can merge

both

codes and
take the performance data again, based on which we can take decision.

It won't happen like that because there won't be a single point where
Andres is done. If you agree, then its worth doing it that way to
begin with, rather than requiring us to revisit the same section of
code twice.

This optimization is to reduce the amount of WAL and definitely adding
anything extra will
have some impact.
However if there is no better way other than by including PK in WAL, then I
don't have any problem.

One huge point that needs to be thought through is how we prove this
code actually works on WAL/recovery side. A normal regression test
won't prove that and we don't have a framework in place for that.

My initial idea to validate recovery :
1. Manual Test: a. To generate enough scenarios for update operation.
b. For each scenario, make sure Replay happens properly.
2. Community Review.

With Regards,
Amit Kapila.

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#13)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 09.08.2012 14:11, Simon Riggs wrote:

Given the marginal gain because of a low percentage of cross-block
updates, I'm not keen. Low percentage because HOT tries hard to keep
things on same block - even for non-HOT updates (which is the case,
even though it sounds weird).

That depends entirely on the workload. If you do a bulk update that
updates every row on the table, most are going to be cross-block
updates, and the WAL size does matter.

But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

I don't think we're going to get rid of full-page images any time soon.
I guess you could easily check if full-page writes are enabled, though,
and only do it for cross-page updates if it is.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote:

This optimization is to reduce the amount of WAL and definitely adding
anything extra will have some impact.

Of course. The question is "How much impact?". Each tweak has
progressively less and less gain. This isn't a binary choice.

Squeezing the last ounce of performance at the expense of all other
concerns is not a sensible goal, IMHO, nor do we attempt that
elsewhere.

Given we're making no attempt to remove full page writes, which is
clearly the biggest source of WAL volume currently, micro optimisation
of other factors seems unwarranted at this stage.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#15)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:

But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

I don't think we're going to get rid of full-page images any time soon.
I guess you could easily check if full-page writes are enabled, though,
and only do it for cross-page updates if it is.

According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.
Based on that, even if full-page image is removed it will be maintained by
double buffer write[an alternative solution to full-page writes for some of
the paths] for the case of corrupt page handling.

With Regards,
Amit Kapila.

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 09.08.2012 15:56, Amit Kapila wrote:

From: Heikki Linnakangas [mailto:heikki.linnakangas@enterprisedb.com]
Sent: Thursday, August 09, 2012 4:59 PM
On 09.08.2012 14:11, Simon Riggs wrote:

But then again, full-page writes cover that too. There
will be a full-page image of the old block in the WAL anyway.

Right, but we're planning to remove that, so its not a safe assumption
to use when building new code.

I don't think we're going to get rid of full-page images any time soon.
I guess you could easily check if full-page writes are enabled, though,
and only do it for cross-page updates if it is.

According to my understanding you are talking about corruption due to
partial page writes which can be handled by full-page image of WAL. Correct
me if I misunderstood.

I meant corruption caused by anything, like disk failure, bugs, cosmic
rays, etc. The point is that currently the WAL record contains all the
information required to reconstruct the old tuple. With a diff method,
that's no longer the case, so if the old tuple gets corrupt for whatever
reason, that error will be propagated to the new tuple.

It's not an issue as long as everything works correctly, but some
redundancy is nice when you're trying to resurrect a corrupt database.
That's what we're talking about here. That said, I don't think it's a
big deal for this patch, at least not as long as full-page writes are
enabled.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#16)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

From: Simon Riggs [mailto:simon@2ndQuadrant.com]
Sent: Thursday, August 09, 2012 5:29 PM
On 9 August 2012 12:17, Amit Kapila <amit.kapila@huawei.com> wrote:

This optimization is to reduce the amount of WAL and definitely adding
anything extra will have some impact.

Of course. The question is "How much impact?". Each tweak has
progressively less and less gain. This isn't a binary choice.

Squeezing the last ounce of performance at the expense of all other
concerns is not a sensible goal, IMHO, nor do we attempt that
elsewhere.

Given we're making no attempt to remove full page writes, which is
clearly the biggest source of WAL volume currently, micro optimisation
of other factors seems unwarranted at this stage.

What I am pointing from WAL reduction is about Update operation performance
and
full-page writes doesn't have direct correlation with Update operation
except for
a case of first time update of page after checkpoint.

With Regards,
Amit Kapila.

#20Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thu, Aug 9, 2012 at 9:09 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I meant corruption caused by anything, like disk failure, bugs, cosmic rays,
etc. The point is that currently the WAL record contains all the information
required to reconstruct the old tuple. With a diff method, that's no longer
the case, so if the old tuple gets corrupt for whatever reason, that error
will be propagated to the new tuple.

It's not an issue as long as everything works correctly, but some redundancy
is nice when you're trying to resurrect a corrupt database. That's what
we're talking about here. That said, I don't think it's a big deal for this
patch, at least not as long as full-page writes are enabled.

So suppose that the following sequence of events occurs:

1. Tuple A on page 1 is updated. The new version, tuple B, is placed on page 2.
2. The table is vacuumed, removing tuple A.
3. Page 1 is written durably to disk.
4. Crash.

If reconstructing tuple B requires possession of tuple A, it seems
that we are now screwed.

No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#21)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#1)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#24)