What is the relationship between checkpoint and wal

Started by 高健over 12 years ago9 messagesgeneral
Jump to latest
#1高健
luckyjackgao@gmail.com

Hello:

Sorry for disturbing.
I have one question : Will checkpoint cause wal written happen?

I found the following info at:
http://www.postgresql.org/docs/9.2/static/wal-configuration.html

...
Checkpoints are fairly expensive, first because they require writing out
all currently dirty buffers, and second because they result in extra
subsequent WAL traffic as discussed above.
...

What confused me is that: (checkpoint)result in extra subsequent WAL
traffic as discussed above...

I haven't found any more information can describe it in the "above" of that
page.

What I understand now is:

In Oracle, when checkpoint occur, it will flush data back to datafile.
At that time,
in oracle, even fulsh back to datafile will also cause log writer writing
to redo segment(rollback segment).

But for Postgreql,
I haven't found any information says that:
When checkpoint occurs, before flush to datafile, Postgresql must first
activate some logwriter to write to WAL, And then activate bgwriter or
backend process to write to data file.

Can anybody please give some url link which can say:
That Checkpoint will result in extra subsequent WAL traffic ?

I need some evidence to show to some guys in my team and to my boss.

Best Regards

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: 高健 (#1)
Re: What is the relationship between checkpoint and wal

高健 escribió:

...
Checkpoints are fairly expensive, first because they require writing out
all currently dirty buffers, and second because they result in extra
subsequent WAL traffic as discussed above.
...

What confused me is that: (checkpoint)result in extra subsequent WAL
traffic as discussed above...

I haven't found any more information can describe it in the "above" of that
page.

It means that the first change following the checkpoint that affects any
particular page will require a full page image of that page to be
written to WAL. See the discussion in the preceding paragraph about
"full_page_writes". It's not the checkpoint itself that writes this
extra WAL, but the rest of the system.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3高健
luckyjackgao@gmail.com
In reply to: Alvaro Herrera (#2)
Re: What is the relationship between checkpoint and wal

Hi :

Thanks to Alvaro! Sorry for replying lately.

I have understood a little about it.

But the description of full_page_write made me even confused. Sorry that
maybe I go to another problem:

It is said:
http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
----------
When this parameter is on, the PostgreSQL server writes the entire content
of each disk page to WAL during the first modification of that page after a
checkpoint. This is needed because a page write that is in process during
an operating system crash might be only partially completed, leading to an
on-disk page that contains a mix of old and new data.
-------

Let me imagine that:
On a disk page, there are following data:

id=1 val=1 with transaction id of 1001
id=2 val=2 with transaction id of 1002
id=3 val=3 with transaction id of 1003

If I start DB,
And begin with transaction id of 2002 deal with data of id=2 ,making val to
20
Then with trsansaction id of 2003 deal with data of id=3,making val to 30

If With full_page_write =off,
When my checkpoint occur, it succeed with transaction 2002 but failed with
2003 because of crash.
Then disk page will be of:

id=1 val=1 with transaction id of 1001------maybe this is the very old data
id=2 val=20 with transaction id of 2002------This is now new data
id=3 val=3 with transaction id of 1003------This is old data.

When DB restart from crash,
I think that there are wal data of transaction id 2002 and 2003 beause
that wal written to wal_buffer is before data written to shared_buffer.
So if Online wal log file is ok, there will be no data lost, and
roll-forward and roll-back can happen.
If some online wal log file is dmaged during crash:
There might be some data lost,but if we have archive log, we can restore
back due to archive wal log's latest transaction id.

If With full_page_write =on,
When my checkpoint occur, it succeed with transaction 2002,then no matter
transaction 2003's data written is ok or failed during checkpoint, the
transaction 2003's data will be written to wal file.
Then disk page will be of:

id=1 val=1 with transaction id of 1001------maybe this is the very old data
id=2 val=2 with transaction id of 2002------This is now new data
id=3 val=3 with transaction id of 1003------This is old data.

But in WAL there are following image:

id=1 val=1 with transaction id of 1001------maybe this is the very old data
id=2 val=20 with transaction id of 2002------This is now new data
id=3 val=30 with transaction id of 2003------This is old data.

I think that maybe it is because of this that data lost is less than
full_page_writes = off.
But why "writes the entire content of each disk page to WAL "?

the id=1 val=1 data is "very old", and even not in read into memory.
Why it should be from disk-->memory-->wal by wal writer?

maybe I have many mis-understanding about it. Thanks for replying!

Best Regards

2013/8/23 Alvaro Herrera <alvherre@2ndquadrant.com>

Show quoted text

高健 escribió:

...
Checkpoints are fairly expensive, first because they require writing out
all currently dirty buffers, and second because they result in extra
subsequent WAL traffic as discussed above.
...

What confused me is that: (checkpoint)result in extra subsequent WAL
traffic as discussed above...

I haven't found any more information can describe it in the "above" of

that

page.

It means that the first change following the checkpoint that affects any
particular page will require a full page image of that page to be
written to WAL. See the discussion in the preceding paragraph about
"full_page_writes". It's not the checkpoint itself that writes this
extra WAL, but the rest of the system.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Luca Ferrari
fluca1978@infinito.it
In reply to: 高健 (#3)
Re: What is the relationship between checkpoint and wal

On Mon, Aug 26, 2013 at 4:57 AM, 高健 <luckyjackgao@gmail.com> wrote:

But why "writes the entire content of each disk page to WAL "?

The documentation states that: "The row-level change data normally
stored in WAL will not be enough to completely restore such a page
during post-crash recovery.". I guess that a mixed page (i.e., a page
that contains old and new data) cannot be safely recovered with
deltas, so you need to have a "clean" page image to which start
recovery.

Someone with more internal knowledge can comment on this.

Luca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Florian Weimer
fweimer@redhat.com
In reply to: Luca Ferrari (#4)
Re: What is the relationship between checkpoint and wal

On 08/26/2013 11:37 AM, Luca Ferrari wrote:

On Mon, Aug 26, 2013 at 4:57 AM, 高健 <luckyjackgao@gmail.com> wrote:

But why "writes the entire content of each disk page to WAL "?

The documentation states that: "The row-level change data normally
stored in WAL will not be enough to completely restore such a page
during post-crash recovery.". I guess that a mixed page (i.e., a page
that contains old and new data) cannot be safely recovered with
deltas, so you need to have a "clean" page image to which start
recovery.

Correct, the full-page image is needed for restoring the known state of
a page.

It also speeds up recovery because you can just fire off writes to the
pages under recovery, followed by incremental updates to the cached
copy. Without full page writes, you'd have to read the current version
of the page from the disk first, often resulting in somewhat random read
activity during recovery.

--
Florian Weimer / Red Hat Product Security Team

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Jeff Janes
jeff.janes@gmail.com
In reply to: 高健 (#3)
Re: What is the relationship between checkpoint and wal

On Sun, Aug 25, 2013 at 7:57 PM, 高健 <luckyjackgao@gmail.com> wrote:

Hi :

Thanks to Alvaro! Sorry for replying lately.

I have understood a little about it.

But the description of full_page_write made me even confused. Sorry that
maybe I go to another problem:

It is said:
http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
----------
When this parameter is on, the PostgreSQL server writes the entire content
of each disk page to WAL during the first modification of that page after a
checkpoint. This is needed because a page write that is in process during an
operating system crash might be only partially completed, leading to an
on-disk page that contains a mix of old and new data.
-------

Let me imagine that:
On a disk page, there are following data:

id=1 val=1 with transaction id of 1001
id=2 val=2 with transaction id of 1002
id=3 val=3 with transaction id of 1003

If I start DB,
And begin with transaction id of 2002 deal with data of id=2 ,making val to
20
Then with trsansaction id of 2003 deal with data of id=3,making val to 30

If With full_page_write =off,
When my checkpoint occur, it succeed with transaction 2002 but failed with
2003 because of crash.

A checkpoint either succeeds or fails. It cannot succeed with some
transactions and fail with others.

Then disk page will be of:

id=1 val=1 with transaction id of 1001------maybe this is the very old data
id=2 val=20 with transaction id of 2002------This is now new data
id=3 val=3 with transaction id of 1003------This is old data.

Postgres does not do in-place updates, it marks the old row as
obsolete and creates a new one.

id=1 val= 1 with transaction id of 1001
id=2 val= 2 with transaction id of 1002 xmax of 2002
id=3 val= 3 with transaction id of 1003 xmax of 2003
id=2 val=20 with transaction id of 2002
id=3 val=30 with transaction id of 2003

Of course the whole point of a torn page write is that you don't how
much got written, so you don't know what is actually on the disk.

When DB restart from crash,
I think that there are wal data of transaction id 2002 and 2003 beause
that wal written to wal_buffer is before data written to shared_buffer.

The wal is written out of wal_buffer and flushed, before the
corresponding block is written out of shared_buffer.

So if Online wal log file is ok, there will be no data lost, and
roll-forward and roll-back can happen.
If some online wal log file is dmaged during crash:
There might be some data lost,but if we have archive log, we can restore
back due to archive wal log's latest transaction id.

Most WAL records would have no problem being applied to a block that
is an otherwise uncorrupted mix of old and new.

But some WAL records have instructions that amount to "grab 134 bytes
from offset 7134 in the block and move them to offset 1623". If the
block is an unknown mix of old and new data, that can't be carried out
safely.

Cheers,

Jeff

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7高健
luckyjackgao@gmail.com
In reply to: Jeff Janes (#6)
Re: What is the relationship between checkpoint and wal

Hello

Thanks for replying.

It is really a complicated concept.
So I think that in a mission critical environment , it is not a good choice
to turn full_page_writes on.

Best Regards

2013/8/27 Jeff Janes <jeff.janes@gmail.com>

Show quoted text

On Sun, Aug 25, 2013 at 7:57 PM, 高健 <luckyjackgao@gmail.com> wrote:

Hi :

Thanks to Alvaro! Sorry for replying lately.

I have understood a little about it.

But the description of full_page_write made me even confused. Sorry that
maybe I go to another problem:

It is said:

http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-FULL-PAGE-WRITES

----------
When this parameter is on, the PostgreSQL server writes the entire

content

of each disk page to WAL during the first modification of that page

after a

checkpoint. This is needed because a page write that is in process

during an

operating system crash might be only partially completed, leading to an
on-disk page that contains a mix of old and new data.
-------

Let me imagine that:
On a disk page, there are following data:

id=1 val=1 with transaction id of 1001
id=2 val=2 with transaction id of 1002
id=3 val=3 with transaction id of 1003

If I start DB,
And begin with transaction id of 2002 deal with data of id=2 ,making val

to

20
Then with trsansaction id of 2003 deal with data of id=3,making val to 30

If With full_page_write =off,
When my checkpoint occur, it succeed with transaction 2002 but failed

with

2003 because of crash.

A checkpoint either succeeds or fails. It cannot succeed with some
transactions and fail with others.

Then disk page will be of:

id=1 val=1 with transaction id of 1001------maybe this is the very old

data

id=2 val=20 with transaction id of 2002------This is now new data
id=3 val=3 with transaction id of 1003------This is old data.

Postgres does not do in-place updates, it marks the old row as
obsolete and creates a new one.

id=1 val= 1 with transaction id of 1001
id=2 val= 2 with transaction id of 1002 xmax of 2002
id=3 val= 3 with transaction id of 1003 xmax of 2003
id=2 val=20 with transaction id of 2002
id=3 val=30 with transaction id of 2003

Of course the whole point of a torn page write is that you don't how
much got written, so you don't know what is actually on the disk.

When DB restart from crash,
I think that there are wal data of transaction id 2002 and 2003 beause
that wal written to wal_buffer is before data written to shared_buffer.

The wal is written out of wal_buffer and flushed, before the
corresponding block is written out of shared_buffer.

So if Online wal log file is ok, there will be no data lost, and
roll-forward and roll-back can happen.
If some online wal log file is dmaged during crash:
There might be some data lost,but if we have archive log, we can restore
back due to archive wal log's latest transaction id.

Most WAL records would have no problem being applied to a block that
is an otherwise uncorrupted mix of old and new.

But some WAL records have instructions that amount to "grab 134 bytes
from offset 7134 in the block and move them to offset 1623". If the
block is an unknown mix of old and new data, that can't be carried out
safely.

Cheers,

Jeff

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: 高健 (#7)
Re: What is the relationship between checkpoint and wal

高健 <luckyjackgao@gmail.com> wrote:

So I think that in a mission critical environment, it is not a
good choice to turn full_page_writes on.

If full_page_writes is off, your database can be corrupted in the
event of a crash of the hardware, OS, or VM (for example a power
failure).  The only exception is if your environment somehow
guarantees that in such failures it is not possible to write part
of a 4KB write request without the entire 4KB being written.  Such
corruption may be hidden and result in inconsistent or incorrect
results, without generating an error; so you would be well-advised
to restore from backup if there is such a crash.

full_page_writes = on is required for protection of database
integrity in most environments.

In the mission critical environments I've worked with, it has
generally been desirable to preserve database integrity and to be
able to recover from an OS crash faster than can be done from
backup.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9高健
luckyjackgao@gmail.com
In reply to: Kevin Grittner (#8)
Re: What is the relationship between checkpoint and wal

Hello:

Thank you all.

I have understood this.

Best Regards

2013/8/31 Kevin Grittner <kgrittn@ymail.com>

Show quoted text

高健 <luckyjackgao@gmail.com> wrote:

So I think that in a mission critical environment, it is not a
good choice to turn full_page_writes on.

If full_page_writes is off, your database can be corrupted in the
event of a crash of the hardware, OS, or VM (for example a power
failure). The only exception is if your environment somehow
guarantees that in such failures it is not possible to write part
of a 4KB write request without the entire 4KB being written. Such
corruption may be hidden and result in inconsistent or incorrect
results, without generating an error; so you would be well-advised
to restore from backup if there is such a crash.

full_page_writes = on is required for protection of database
integrity in most environments.

In the mission critical environments I've worked with, it has
generally been desirable to preserve database integrity and to be
able to recover from an OS crash faster than can be done from
backup.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company