Re: New CRC algorithm: Slicing by 8

Started by Gurjeet Singhabout 19 years ago68 messages

singh.gurjeet@gmail.com

about 19 years ago

1 attachment(s)

Resending... The previous 3 attempt haven't succeeded, probably because of
size restrictions; so sending in two parts.

On 10/18/06, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:

Hi all,

Please refer to the following link for a new algorithm to calculate
CRC, developed by Intel.

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements
this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8
).

I tested it using pgbench, and found an average improvement of about
15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of
clients: 10, and 1000 transactions per client (I don't remember these
pgbench params exactly; it's been quite a while now).

The author of the algo says that it works three times faster
(theoretically) than the conventional CRC calculation methods, but the tps
improvements didn't reflect that; so, to prove the speed, I extracted the
source files (both, PG and SB8) and made a saparate project ( test.tar.gz
).

The resulting binary's comand format is:

<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size>
] ] ]

algo_name : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer
blocks.
block_size : buffer size (in KBs); default is 4.
list_size : number of data blocks to create for the test; default is
100.

So the command : a.out p 3 1024 300
will test the PG's algo over 900 blocks of 1 MB each.
</snip>

The test.sh script is a wrapper around this binary, to test each of PG
and SB8, and both algos simultaneously, three times each. I performed two
tests using this script to prove SB8's effectiveness on large and small
sized data blocks:

sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size
800 MB
sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective
data size 800 MB

The new algo shined in both test configurations. It performs 3 to 4
times better than regular CRC algo in both types of tests. The results are
in the files results_1_1024_800.out and results_800_1_1024.out,
respectively.

To test it yourself, extract all the files from test.tar.gz, and issue
the following commands (assuming bash):

sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024

(Note: you would like to reduce the effective data size on a Windows box.
I used 500 MB on windows+MinGW. I tested with these params since I have only
1 GB of RAM, and increasing the data size beyond these numbers caused the
memory contents to be spilled over into swap-space/pagefile, and this disk
I/O causes a severe distortion in results.)

The tests were performed on Fedora Core 5 and MinGW on WinXP
Professional.

If possible, people should test it on different platforms, so as to
ensure that it doesn't perform any worse than older implementation on any
supported platform (please post the results, if you do test it). Also,
recommendations for a better (than pgbench) benhmark are welcome, so that we
can show the improvement when it is used as a part of PG.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Import Notes

Reply to msg id not found: 65937bea0610180811l1e324302qbeec112c4cdbb185@mail.gmail.comReference msg id not found: 65937bea0610180811l1e324302qbeec112c4cdbb185@mail.gmail.com

Gurjeet Singh

singh.gurjeet@gmail.com

about 19 years ago

In reply to: Gurjeet Singh (#1)

Fwd: New CRC algorithm: Slicing by 8

Resending... The previous 3 attempt haven't succeeded, probably because
of size restrictions; so sending in two parts.

This is part 2; unfortunately, the size of the gzipped file is still a
bit big (26 KB), so I have uploaded it to

http://www.geocities.com/gurjeet79/crc_sb8_test.tar.gz

Please download and test it from there.

Can someone let us all know what is the limit on the attachments sent to
the -hackers list?

Best regards,
--
gurjeet@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Show quoted text

On 10/18/06, Gurjeet Singh < singh.gurjeet@gmail.com> wrote:

Hi all,

Please refer to the following link for a new algorithm to calculate
CRC, developed by Intel.

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements
this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8
).

I tested it using pgbench, and found an average improvement of about
15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of
clients: 10, and 1000 transactions per client (I don't remember these
pgbench params exactly; it's been quite a while now).

The author of the algo says that it works three times faster
(theoretically) than the conventional CRC calculation methods, but the tps
improvements didn't reflect that; so, to prove the speed, I extracted the
source files (both, PG and SB8) and made a saparate project ( test.tar.gz
).

The resulting binary's comand format is:

<snip>
<prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size>
] ] ]

algo_name : PG or SB8 or BOTH (just the first char will also do!)
scaling_factor: the multiple to use when creating the list of buffer
blocks.
block_size : buffer size (in KBs); default is 4.
list_size : number of data blocks to create for the test; default is
100.

So the command : a.out p 3 1024 300
will test the PG's algo over 900 blocks of 1 MB each.
</snip>

The test.sh script is a wrapper around this binary, to test each of PG
and SB8, and both algos simultaneously, three times each. I performed two
tests using this script to prove SB8's effectiveness on large and small
sized data blocks:

sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size
800 MB
sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective
data size 800 MB

The new algo shined in both test configurations. It performs 3 to 4
times better than regular CRC algo in both types of tests. The results are
in the files results_1_1024_800.out and results_800_1_1024.out,
respectively.

To test it yourself, extract all the files from test.tar.gz, and issue
the following commands (assuming bash):

sh make.sh
sh test.sh 1 1024 800
sh test.sh 800 1 1024

(Note: you would like to reduce the effective data size on a Windows box.
I used 500 MB on windows+MinGW. I tested with these params since I have only
1 GB of RAM, and increasing the data size beyond these numbers caused the
memory contents to be spilled over into swap-space/pagefile, and this disk
I/O causes a severe distortion in results.)

The tests were performed on Fedora Core 5 and MinGW on WinXP
Professional.

If possible, people should test it on different platforms, so as to
ensure that it doesn't perform any worse than older implementation on any
supported platform (please post the results, if you do test it). Also,
recommendations for a better (than pgbench) benhmark are welcome, so that we
can show the improvement when it is used as a part of PG.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Andrew Dunstan

andrew@dunslane.net

about 19 years ago

In reply to: Gurjeet Singh (#2)

Re: Fwd: New CRC algorithm: Slicing by 8

Gurjeet Singh wrote:

Can someone let us all know what is the limit on the attachments
sent to the -hackers list?

Generally it's probably best not to send attachments to -hackers at all.
If you have a patch, send it to -hackers. Or you can publish it
somewhere on the web and post a link to it (this is one of the things I
think a developers' wiki is actually useful for).

cheers

andrew

Andrew Dunstan

andrew@dunslane.net

about 19 years ago

In reply to: Andrew Dunstan (#3)

Re: Fwd: New CRC algorithm: Slicing by 8

Andrew Dunstan wrote:

If you have a patch, send it to -hackers.

I mean -patches of course. Sorry.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Gurjeet Singh (#1)

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

Please refer to the following link for a new algorithm to calculate
CRC, developed by Intel.

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

I can't help wondering what the patent status of this algorithm is.

(This from a guy who still remembers very well how Unisys published the
LZW algorithm in a manner that made it look like it was free for anyone
to use...)

regards, tom lane

Alvaro Herrera

alvherre@commandprompt.com

about 19 years ago

In reply to: Tom Lane (#5)

Tom Lane wrote:

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

Please refer to the following link for a new algorithm to calculate
CRC, developed by Intel.

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

I can't help wondering what the patent status of this algorithm is.

(This from a guy who still remembers very well how Unisys published the
LZW algorithm in a manner that made it look like it was free for anyone
to use...)

This article

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

seems to imply it's provided "free". It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

The "files" section contains a zip archive, inside which there are three
source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Gurjeet Singh

singh.gurjeet@gmail.com

about 19 years ago

In reply to: Alvaro Herrera (#6)

On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:

It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

Yes. I had already mentioned that link in my posts.

The "files" section contains a zip archive, inside which there are three

source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

I'll try to get in touch with the author and get the clarification.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Bruce Momjian

bruce@momjian.us

about 19 years ago

In reply to: Gurjeet Singh (#7)

Gurjeet Singh wrote:

On 10/20/06, Alvaro Herrera <alvherre@commandprompt.com> wrote:

It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

Yes. I had already mentioned that link in my posts.

The "files" section contains a zip archive, inside which there are three

source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

I'll try to get in touch with the author and get the clarification.

Yep, the license and patent status seem to be separate issues.

However, I am not sure getting a clarification from the author even
helps us legally. Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community? The scary answer
is that in many ways they are the same, and have the same risks.

So unless we hear about a problem, I think we should use the code.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Bruce Momjian (#8)

Bruce Momjian <bruce@momjian.us> writes:

However, I am not sure getting a clarification from the author even
helps us legally. Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community?

Bitter experience with other companies.

So unless we hear about a problem, I think we should use the code.

It hasn't even been tested. One thing I'd want to know about is the
performance effect on non-Intel machines.

regards, tom lane

#10

Gregory Maxwell

gmaxwell@gmail.com

about 19 years ago

In reply to: Tom Lane (#9)

On 10/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
[snip]

It hasn't even been tested. One thing I'd want to know about is the
performance effect on non-Intel machines.

On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.

It looks like the new code may have a larger cache footprint, so
actual performance may differ from the microbenchmark.

#11

Gurjeet Singh

singh.gurjeet@gmail.com

about 19 years ago

In reply to: Gregory Maxwell (#10)

Hi all,

Michael Kounavis has given a green signal (please refer the forwarded
message).

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

---------- Forwarded message ----------
From: Kounavis, Michael E <michael.e.kounavis@intel.com>
Date: Oct 20, 2006 10:43 PM
Subject: RE: CRC algorithm, Slicing By 8; concerns about patents
infringements
To: Gurjeet Singh <singh.gurjeet@gmail.com>, Michael Kounavis <
mekounav@users.sourceforge.net>

Hi,

Thank you for your interest on Slicing-by-8. No. Intel decided not to file a
patent on this algorithm but to make it freely available under BSD license.
You should have no problem using it.

MIke

------------------------------

*From:* Gurjeet Singh [mailto:singh.gurjeet@gmail.com]
*Sent:* Friday, October 20, 2006 5:50 AM
*To:* Michael Kounavis
*Subject:* CRC algorithm, Slicing By 8; concerns about patents infringements

Hi Michael,

Please refer the following post, and the conversation that followed:

http://archives.postgresql.org/pgsql-hackers/2006-10/msg01015.php

As is evident form that post, I have used the source code provided by
you on sf.net, and modified it to suit PostgreSQL.

Now the community is concerned if we will be infringing any patents. We
all understand that that you (on behalf of Intel) have released the
algorithm, and the code, under BSD license, which allows anybody to use it
commercially, but we need assurance that we will not be infringing any
patents owned by Intel by including your algorithm in PostgreSQL, because
PostgreSQL is used/sold by many companies for profit.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

Import Notes

Resolved by subject fallback

#12

Gurjeet Singh

singh.gurjeet@gmail.com

about 19 years ago

In reply to: Gregory Maxwell (#10)

On 10/22/06, Gregory Maxwell <gmaxwell@gmail.com> wrote:

On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.

I think you meant

800*1024 blocks of 1 >MB< each workload.

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

#13

Martijn van Oosterhout

kleptog@svana.org

about 19 years ago

In reply to: Gurjeet Singh (#11)

On Sun, Oct 22, 2006 at 08:10:56PM +0530, Gurjeet Singh wrote:

Hi all,

Michael Kounavis has given a green signal (please refer the forwarded
message).

I don't think that helps. The publishing date of this article was March
2006. If this is really the first time this algorithm was published, that
means that anyone else (or even the author) has the option of patenting
this algorithm sometime before March 2007 and still claiming they
invented it first.

And realistically we would wait at least a year or three after that,
because you don't get to see patents as they're applied for.

Maybe March 2010 we can look into it...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#14

Marko Kreen

markokr@gmail.com

about 19 years ago

In reply to: Martijn van Oosterhout (#13)

On 10/22/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Sun, Oct 22, 2006 at 08:10:56PM +0530, Gurjeet Singh wrote:

Hi all,

Michael Kounavis has given a green signal (please refer the forwarded
message).

I don't think that helps. The publishing date of this article was March
2006. If this is really the first time this algorithm was published, that
means that anyone else (or even the author) has the option of patenting
this algorithm sometime before March 2007 and still claiming they
invented it first.

And realistically we would wait at least a year or three after that,
because you don't get to see patents as they're applied for.

Maybe March 2010 we can look into it...

Consudering the author has OK-d it and given how easy
is to replace the algorithm, I don't see the reason for
such carefulness?

--
marko

#15

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Gurjeet Singh (#1)

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

If possible, people should test it on different platforms, so as to
ensure that it doesn't perform any worse than older implementation on any
supported platform (please post the results, if you do test it).

I didn't particularly trust the timing calculations in your benchmark
program, so I made up my own low-tech test version (attached).
I get the following timings for 1 million iterations of
INIT_CRC32/COMP_CRC32/FIN_CRC32 on different buffer sizes,
using "gcc -O2" on some machines laying about the house:

Std CRC Slice8 CRC

HPPA (HPUX 10.20)

8192 bytes 44.897212 35.191499
1024 bytes 5.639081 4.669850
64 bytes 0.377416 0.613195

PPC (Mac G4, Darwin 10.4.8)

8192 bytes 12.663135 12.158293
1024 bytes 1.579940 1.614967
64 bytes 0.104310 0.229401

Intel Xeon EM64T (Fedora Core 5)

8192 bytes 4.420879 7.633120
1024 bytes 0.571794 0.819372
64 bytes 0.047354 0.071906

Intel Pentium 4 (Fedora Core 5)

8192 bytes 6.942324 28.848572 (yes, really)
1024 bytes 0.905259 3.625360
64 bytes 0.068065 0.260224

It's worth pointing out that this test program is biased in favor of the
slice8 code about as far as it could possibly be: after the first
iteration, the remaining 999999 will find the cache as hot as possible.
Furthermore, the test doesn't exercise misaligned or odd-length cases,
which would incur extra start/stop overhead for slice8.

These numbers are um, not impressive. Considering that a large fraction
of our WAL records are pretty short, the fact that slice8 consistently
loses at short buffer lengths is especially discouraging. Much of that
ground could be made up perhaps with tenser coding of the initialization
and finalization code, but it'd still not be worth taking any legal risk
for AFAICS.

regards, tom lane

#16

Bruce Momjian

bruce@momjian.us

about 19 years ago

In reply to: Tom Lane (#9)

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

However, I am not sure getting a clarification from the author even
helps us legally. Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community?

Bitter experience with other companies.

The problem is we have lots of companies involved, and I bet some we
don't even know about (e.g. yahoo/gmail addresses), and with
contributors who don't know that their employment agreement says
anything they do while employed is company property. And we have Unisys
involved now too. How worrying is that? :-(

So unless we hear about a problem, I think we should use the code.

It hasn't even been tested. One thing I'd want to know about is the
performance effect on non-Intel machines.

Sure.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#17

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Bruce Momjian (#16)

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community?

Bitter experience with other companies.

The problem is we have lots of companies involved, and I bet some we
don't even know about (e.g. yahoo/gmail addresses),

It's not so much that I don't trust Intel as that a CRC algorithm is
exactly the sort of nice little self-contained thing that people love
to try to patent these days. What I am really afraid of is that someone
else has already invented this same method (or something close enough
to it) and filed for a patent that Intel doesn't know about either.
I'd be wondering about that no matter where the code had come from.

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.

regards, tom lane

#18

Bruce Momjian

bruce@momjian.us

about 19 years ago

In reply to: Tom Lane (#17)

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community?

Bitter experience with other companies.

The problem is we have lots of companies involved, and I bet some we
don't even know about (e.g. yahoo/gmail addresses),

It's not so much that I don't trust Intel as that a CRC algorithm is
exactly the sort of nice little self-contained thing that people love
to try to patent these days. What I am really afraid of is that someone
else has already invented this same method (or something close enough
to it) and filed for a patent that Intel doesn't know about either.
I'd be wondering about that no matter where the code had come from.

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.

Agreed. I just wanted to point out we have other sharks in the water. :-(

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#19

Noname

mark@mark.mielke.cc

about 19 years ago

In reply to: Tom Lane (#15)

On Sun, Oct 22, 2006 at 06:06:13PM -0400, Tom Lane wrote:

Intel Xeon EM64T (Fedora Core 5)

8192 bytes 4.420879 7.633120
1024 bytes 0.571794 0.819372
64 bytes 0.047354 0.071906

Intel Pentium 4 (Fedora Core 5)

8192 bytes 6.942324 28.848572 (yes, really)
1024 bytes 0.905259 3.625360
64 bytes 0.068065 0.260224

AMDX2 3800+ (Fedora Core 5)

STD CRC SLICE8 CRC

8192 bytes 8.576635 7.170038
1024 bytes 1.504361 1.402446
64 bytes 0.154459 0.144209

Odd that the AMD shows opposite of the two Intel numbers above, and
that it was an "Intel engineer" who wrote it. My first speculation is
that you did your Intel numbers backwards. My second speculation is
that you already thought of that and confirmed before posting. :-)

So yeah - not too impressive...

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#20

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Noname (#19)

mark@mark.mielke.cc writes:

... My first speculation is
that you did your Intel numbers backwards. My second speculation is
that you already thought of that and confirmed before posting. :-)

Yah, I checked. Several times... but if anyone else wants to repeat
the experiment, please do. Or look for bugs in either my test case
or Gurjeet's.

regards, tom lane

#21

Mark Kirkwood

markir@paradise.net.nz

about 19 years ago

In reply to: Tom Lane (#20)

Tom Lane wrote:

Yah, I checked. Several times... but if anyone else wants to repeat
the experiment, please do. Or look for bugs in either my test case
or Gurjeet's.

FWIW - FreeBSD and Linux results using Tom's test program on almost
identical hardware[1]Both boxes have identical mobos, memory and CPUs (same sspec nos).:

Std crc Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2)

8192 bytes 12.975314 14.503810
1024 bytes 1.633557 1.852322
64 bytes 0.111580 0.206975

Intel P-III 1.26Ghz (Gentoo 2006.1)

8192 bytes 12.967997 28.363876
1024 bytes 1.632317 3.626230
64 bytes 0.111513 0.326557

Interesting that the slice-8 algorithm seems to work noticeably better
on FreeBSD than Linux - but still not as well as the standard one (for
these tests anyway)...

Cheers

Mark

[1]: Both boxes have identical mobos, memory and CPUs (same sspec nos).

#22

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Mark Kirkwood (#21)

Mark Kirkwood <markir@paradise.net.nz> writes:

Interesting that the slice-8 algorithm seems to work noticeably better
on FreeBSD than Linux -

Are you running similar gcc versions on both? I realize I forgot to
document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...

regards, tom lane

#23

Anders Steinlein

dinhelt@online.no

about 19 years ago

In reply to: Tom Lane (#22)

On Mon, 23 Oct 2006 07:35:08 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I realize I forgot to document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...

Does anyone have an Intel compiler availible? It would be interesting to
see results on that compared to gcc (given that it is an Intel algorithm
;).

\Anders

#24

Mark Kirkwood

markir@paradise.net.nz

about 19 years ago

In reply to: Tom Lane (#22)

Tom Lane wrote:

Mark Kirkwood <markir@paradise.net.nz> writes:

Interesting that the slice-8 algorithm seems to work noticeably better
on FreeBSD than Linux -

Are you running similar gcc versions on both? I realize I forgot to
document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...

Ah - good point, FreeBSD is using an older compiler:

FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)

Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build
itself and see if compiling with it changes the results any....

#25

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Martijn van Oosterhout (#13)

On Sun, 2006-10-22 at 17:18 +0200, Martijn van Oosterhout wrote:

On Sun, Oct 22, 2006 at 08:10:56PM +0530, Gurjeet Singh wrote:

Hi all,

Michael Kounavis has given a green signal (please refer the forwarded
message).

I don't think that helps. The publishing date of this article was March
2006. If this is really the first time this algorithm was published, that
means that anyone else (or even the author) has the option of patenting
this algorithm sometime before March 2007 and still claiming they
invented it first.

Slice-By-8 was first mentioned here:

M. E. Kounavis and F. L. Berry, "A Systematic Approach to Building High
Performance Software-based CRC Generators", Proceedings, Tenth IEEE
International Symposium on Computers and Communications (ISCC 2005), La
Manga Del Mar Menor, Spain, June, 2005.

so presumably the opportunity to patent has already passed, given what
you say.

And realistically we would wait at least a year or three after that,
because you don't get to see patents as they're applied for.

Maybe March 2010 we can look into it...

I think that's a tad overcooked. We have to balance caution with
boldness. Simply copying stuff from research can be dangerous, but I
think this is about as safe as it gets.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#26

Heikki Linnakangas

heikki@enterprisedb.com

about 19 years ago

In reply to: Tom Lane (#22)

Tom Lane wrote:

Are you running similar gcc versions on both? I realize I forgot to
document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...

It would be interesting to see if the Intel compiler works better for
this algorithm. Anyone have it installed?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27

Jeremy Drake

pgsql@jdrake.com

about 19 years ago

In reply to: Mark Kirkwood (#21)

1 attachment(s)

On Mon, 23 Oct 2006, Mark Kirkwood wrote:

Tom Lane wrote:

Yah, I checked. Several times... but if anyone else wants to repeat
the experiment, please do. Or look for bugs in either my test case
or Gurjeet's.

Just for fun, I tried it out with both GCC and with Intel's C compiler
with some agressive platform-specific flags on my 2.8Ghz Xeon running
Gentoo.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 4.697572 9.806341
1024 bytes 0.597429 1.181828
64 bytes 0.046636 0.086984

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 0.000004 0.001085
1024 bytes 0.000004 0.001292
64 bytes 0.000003 0.001078

So at this point I realize that intel's compiler is optimizing the loop
away, at least for the std crc and probably for both. So I make mycrc an
array of 2, and substript mycrc[j&1] in the loop.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 51.397146 9.523182
1024 bytes 6.430986 1.229043
64 bytes 0.400062 0.128579

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 29.881708 0.001432
1024 bytes 3.750313 0.001432
64 bytes 0.238583 0.001431

So it looks like something fishy is still going on with the slice-8 with
the intel compiler.

I have attached my changed testcrc.c file.

FWIW - FreeBSD and Linux results using Tom's test program on almost identical
hardware[1]:

Std crc Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2)

8192 bytes 12.975314 14.503810
1024 bytes 1.633557 1.852322
64 bytes 0.111580 0.206975

Intel P-III 1.26Ghz (Gentoo 2006.1)

8192 bytes 12.967997 28.363876
1024 bytes 1.632317 3.626230
64 bytes 0.111513 0.326557

Interesting that the slice-8 algorithm seems to work noticeably better on
FreeBSD than Linux - but still not as well as the standard one (for these
tests anyway)...

Cheers

Mark

[1] Both boxes have identical mobos, memory and CPUs (same sspec nos).

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
You can tune a piano, but you can't tuna fish.

#28

Gregory Stark

stark@enterprisedb.com

about 19 years ago

In reply to: Simon Riggs (#25)

"Simon Riggs" <simon@2ndquadrant.com> writes:

Slice-By-8 was first mentioned here:

Are you sure?

US patent 7,047,479 filed in 2002 sounds like it may be relevant:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7047479.PN.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#29

Mark Kirkwood

markir@paradise.net.nz

about 19 years ago

In reply to: Mark Kirkwood (#24)

Mark Kirkwood wrote:

Tom Lane wrote:

Are you running similar gcc versions on both? I realize I forgot to
document what I was using:

Ah - good point, FreeBSD is using an older compiler:

FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)

Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build
itself and see if compiling with it changes the results any....

Here are the results after building gcc 4.1.2 (repeating results for gcc
3.4.6 for comparison). I suspect that performance is probably impacted
because gcc 4.1.2 (and also the rest of the tool-chain) is built with
gcc 3.4.6 - but it certainly suggests that the newer gcc versions don't
like the slice-8 algorithm for some reason.

Std crc Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 3.4.6)

8192 bytes 12.975314 14.503810
1024 bytes 1.633557 1.852322
64 bytes 0.111580 0.206975

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 4.1.2)

8192 bytes 19.516974 29.457739
1024 bytes 2.448570 3.742106
64 bytes 0.112644 0.335292

#30

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Gregory Stark (#28)

On Mon, 2006-10-23 at 05:22 -0400, Gregory Stark wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

Slice-By-8 was first mentioned here:

Are you sure?

US patent 7,047,479 filed in 2002 sounds like it may be relevant:

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7047479.PN.

Yes, I'm sure.

That patent is titled "Parallel CRC formulation" and the abstract
describes the use of two threads to calculate different parts of the
CRC. It's designed for parallel circuit designs in hardware.

That doesn't resemble SB8 at all - there's still only one thread. The
cleverness of SB8 is to calculate more bytes simultaneously *without*
requiring a corresponding increase in the size of the lookup table.

We have the original author's word that no patent has been filed. Even
if we disbelieve that, which I have no reason to do, that alone would be
sufficient to make a counter-claim for any damages claimed by any
hypothetical patent holder in the future.

What is the difference between this case and other patches?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#31

Benny Amorsen

benny+usenet@amorsen.dk

about 19 years ago

In reply to: Gurjeet Singh (#1)

"MK" == Mark Kirkwood <markir@paradise.net.nz> writes:

MK> Here are the results after building gcc 4.1.2 (repeating results
MK> for gcc 3.4.6 for comparison). I suspect that performance is
MK> probably impacted because gcc 4.1.2 (and also the rest of the
MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
MK> that the newer gcc versions don't like the slice-8 algorithm for
MK> some reason.

They don't seem to like the old CRC algorithms either. It is quite
strange, such dramatic performance regressions from 3.x to 4.x are
rare.

/Benny

Import Notes

Reference msg id not found: 65937bea0610180811l1e324302qbeec112c4cdbb185@mail.gmail.com

#32

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Jeremy Drake (#27)

Jeremy Drake <pgsql@jdrake.com> writes:

So at this point I realize that intel's compiler is optimizing the loop
away, at least for the std crc and probably for both. So I make mycrc an
array of 2, and substript mycrc[j&1] in the loop.

That's not a good workaround, because making mycrc expensive to access
means your inner loop timing isn't credible at all. Instead try making the
buffer array nonlocal --- malloc it, perhaps.

regards, tom lane

#33

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Tom Lane (#15)

On Sun, 2006-10-22 at 18:06 -0400, Tom Lane wrote:

These numbers are um, not impressive. Considering that a large fraction
of our WAL records are pretty short, the fact that slice8 consistently
loses at short buffer lengths is especially discouraging. Much of that
ground could be made up perhaps with tenser coding of the initialization
and finalization code, but it'd still not be worth taking any legal risk
for AFAICS.

It doesn't look good for SB8, does it? Nor for gcc4.1 either.

Presumably Intel themselves will have some come-back, but I'm not sure
what they'll so to so many conclusive tests.

Instead, I'd like to include a parameter to turn off CRC altogether, for
heavily CPU bound operations and the WAL drive on trustworthy hardware.

wal_checksum = off

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#34

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Simon Riggs (#33)

"Simon Riggs" <simon@2ndquadrant.com> writes:

Instead, I'd like to include a parameter to turn off CRC altogether, for
heavily CPU bound operations and the WAL drive on trustworthy hardware.

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay. You may as well propose not writing WAL
at all (and no, I don't think it'd pass).

regards, tom lane

#35

Dawid Kuroczko

qnex42@gmail.com

about 19 years ago

In reply to: Tom Lane (#17)

On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's not so much that I don't trust Intel as that a CRC algorithm is
exactly the sort of nice little self-contained thing that people love
to try to patent these days. What I am really afraid of is that someone
else has already invented this same method (or something close enough
to it) and filed for a patent that Intel doesn't know about either.
I'd be wondering about that no matter where the code had come from.

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.

The horror, the horror. I wonder if changing to Slicing by 8 would be
so self contained, so that people from software-patent free world
would be able to just patch their distribution if they will.

Regards,
Dawid

#36

Josh Berkus

josh@agliodbs.com

about 19 years ago

In reply to: Bruce Momjian (#18)

Bruce, Tom, All:

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.

Agreed. I just wanted to point out we have other sharks in the water.

*IF* Slice-by-8 turned out to be a winner, I could get the legal issues
looked into and probably resolved. But we should only do that if it
turns out we really want to use it.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#37

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Tom Lane (#34)

On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

Instead, I'd like to include a parameter to turn off CRC altogether, for
heavily CPU bound operations and the WAL drive on trustworthy hardware.

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay.

No we don't: Zero length records are the trigger for EOF.

Sure, a CRC failure will cause replay to end, but only if you calculate
it and compare it to whats on disk - which is the bit I would turn off.
We don't rely on that, however, so it is avoidable.

In most cases, it would be foolish to avoid: but there are cases where
the data is CRC checked by the hardware/system already, so I'd like to
make an option to turn this off, defaulting to on, for safety.

You may as well propose not writing WAL
at all (and no, I don't think it'd pass).

That would undo all of my efforts, so no I wouldn't consider that.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#38

Joshua D. Drake

jd@commandprompt.com

about 19 years ago

In reply to: Simon Riggs (#37)

In most cases, it would be foolish to avoid: but there are cases where
the data is CRC checked by the hardware/system already, so I'd like to
make an option to turn this off, defaulting to on, for safety.

How would we know? What are those cases?

Sounds like a foot gun to me.

Sincerely,

Joshua D. Drake

You may as well propose not writing WAL
at all (and no, I don't think it'd pass).

That would undo all of my efforts, so no I wouldn't consider that.

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#39

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Simon Riggs (#37)

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay.

No we don't: Zero length records are the trigger for EOF.

Only if the file happens to be all-zero already, which is not the normal
operating state (see WAL-file recycling). Otherwise you have to be able
to detect an invalid record.

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

regards, tom lane

#40

Jeremy Drake

pgsql@jdrake.com

about 19 years ago

In reply to: Tom Lane (#32)

On Mon, 23 Oct 2006, Tom Lane wrote:

Jeremy Drake <pgsql@jdrake.com> writes:

So at this point I realize that intel's compiler is optimizing the loop
away, at least for the std crc and probably for both. So I make mycrc an
array of 2, and substript mycrc[j&1] in the loop.

That's not a good workaround, because making mycrc expensive to access
means your inner loop timing isn't credible at all. Instead try making the
buffer array nonlocal --- malloc it, perhaps.

That did not make any difference. The way I see it, the only way to
convince the compiler it really needs to do this loop more than once is to
make it think it is not overwriting the same variable every time. The
subscript was the cheapest way I could think of to do that. Any other
suggestions on how to do this are welcome.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
I like being single. I'm always there when I need me.
-- Art Leo

#41

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Jeremy Drake (#40)

Jeremy Drake <pgsql@jdrake.com> writes:

On Mon, 23 Oct 2006, Tom Lane wrote:

That's not a good workaround, because making mycrc expensive to access
means your inner loop timing isn't credible at all. Instead try making the
buffer array nonlocal --- malloc it, perhaps.

That did not make any difference. The way I see it, the only way to
convince the compiler it really needs to do this loop more than once is to
make it think it is not overwriting the same variable every time. The
subscript was the cheapest way I could think of to do that. Any other
suggestions on how to do this are welcome.

Hmm. Maybe store the CRCs into a global array somewhere?

uint32 results[NTESTS];

for ...
{
INIT/COMP/FIN_CRC32...
results[j] = mycrc;
}

This still adds a bit of overhead to the outer loop, but not much.

Another possibility is to put the INIT/COMP/FIN_CRC32 into an external
subroutine, thereby adding a call/return to the outer loop overhead.

regards, tom lane

#42

Jeremy Drake

pgsql@jdrake.com

about 19 years ago

In reply to: Tom Lane (#41)

1 attachment(s)

On Mon, 23 Oct 2006, Tom Lane wrote:

Hmm. Maybe store the CRCs into a global array somewhere?

uint32 results[NTESTS];

for ...
{
INIT/COMP/FIN_CRC32...
results[j] = mycrc;
}

This still adds a bit of overhead to the outer loop, but not much.

That seems to have worked.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 26.765317 10.511143
1024 bytes 3.357843 1.280890
64 bytes 0.223213 0.103767

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 29.495836 0.007107
1024 bytes 3.708665 0.012183
64 bytes 0.242579 0.008700

So the gcc times are reasonable, but the icc times for the slice-by-8 are
still too fast to be believed. I will have to take a look at the
generated assembly later and see what gives.

My changed testcrc.c is attached, again.

--
"I'd love to go out with you, but I did my own thing and now I've got
to undo it."

#43

Mark Kirkwood

markir@paradise.net.nz

about 19 years ago

In reply to: Benny Amorsen (#31)

Benny Amorsen wrote:

"MK" == Mark Kirkwood <markir@paradise.net.nz> writes:

MK> Here are the results after building gcc 4.1.2 (repeating results
MK> for gcc 3.4.6 for comparison). I suspect that performance is
MK> probably impacted because gcc 4.1.2 (and also the rest of the
MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
MK> that the newer gcc versions don't like the slice-8 algorithm for
MK> some reason.

They don't seem to like the old CRC algorithms either. It is quite
strange, such dramatic performance regressions from 3.x to 4.x are
rare.

Right - I think the regression is caused by libc and kernel being built
with gcc 3.4.6 and the test program being built with gcc 4.1.2.
Rebuilding *everything* with 4.1.2 (which I'm not sure is possible for
FreeBSD at the moment) would probably get us back to numbers that looked
more like my Gentoo ones [1]Note that the upgrade process for switching Gentoo from gcc 3.4 to 4.1 involves precisely this - build 4.1, then rebuild everything using 4.1 (including 4.1 itself!).

Cheers

Mark

[1]: Note that the upgrade process for switching Gentoo from gcc 3.4 to 4.1 involves precisely this - build 4.1, then rebuild everything using 4.1 (including 4.1 itself!)
4.1 involves precisely this - build 4.1, then rebuild everything using
4.1 (including 4.1 itself!)

#44

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Mark Kirkwood (#43)

Mark Kirkwood <markir@paradise.net.nz> writes:

Right - I think the regression is caused by libc and kernel being built
with gcc 3.4.6 and the test program being built with gcc 4.1.2.

Why do you think that? The performance of the CRC loop shouldn't depend
at all on either libc or the kernel, because they're not invoked inside
the loop.

regards, tom lane

#45

Kenneth Marshall

ktm@it.is.rice.edu

about 19 years ago

In reply to: Tom Lane (#44)

On Mon, Oct 23, 2006 at 05:23:27PM -0400, Tom Lane wrote:

Mark Kirkwood <markir@paradise.net.nz> writes:

Right - I think the regression is caused by libc and kernel being built
with gcc 3.4.6 and the test program being built with gcc 4.1.2.

Why do you think that? The performance of the CRC loop shouldn't depend
at all on either libc or the kernel, because they're not invoked inside
the loop.

I can believe that not re-building GCC 4.1.x with the 4.1.x compiler
could result in it not taking full advantage of new features and functions.

Ken

#46

Mark Kirkwood

markir@paradise.net.nz

about 19 years ago

In reply to: Tom Lane (#44)

Tom Lane wrote:

Mark Kirkwood <markir@paradise.net.nz> writes:

Right - I think the regression is caused by libc and kernel being built
with gcc 3.4.6 and the test program being built with gcc 4.1.2.

Why do you think that? The performance of the CRC loop shouldn't depend
at all on either libc or the kernel, because they're not invoked inside
the loop.

Just come to the same conclusion myself... I've got gcc 3.4.6 on Gentoo,
so tried using that, and observed no regression at all there for the
std case - which pretty much rules out any issues connected with "what
did I build kernel + libc with vs what compiler am I using now":

Intel P-III 1.26Ghz (Gentoo 2006.1)

Std crc Slice-8 crc

8192 bytes 12.967997 28.363876 (gcc 4.1.1)
8192 bytes 12.956765 13.978495 (gcc 3.4.6)

So Gentoo using gcc 3.4.6 looks like FreeBSD using 3.4.6, so the std vs
slice-8 performance seems to be all about compiler version.

Now the regression on FreeBSD for the std case might be (as Kenneth
pointed out) due to 4.1.1 being built by 3.4.6.... but I guess it is
just a nit at this point.

Cheers

Mark

#47

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Tom Lane (#39)

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay.

No we don't: Zero length records are the trigger for EOF.

Only if the file happens to be all-zero already, which is not the normal
operating state (see WAL-file recycling). Otherwise you have to be able
to detect an invalid record.

OK.

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

Well, I understand the need for the zero-length test and the incorrect
back pointer (which also incidentally tests that the current record was
not left over from previous use of xlog file).

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data. So you
don't *need* the checksum in those cases. If we really think we need it
we could put the xlprev pointer as the *last* field on the xlrec, just
to make doubly sure - having a trailer as well as a header.

CRC-checked disks are actually the industry norm and have been for
around 5 years. ANSI SCSI Parallel Interface 3 (SPI-3) (UltraSCSI 160)
defines the use of CRC and this is available as standard from all key
manufacturers. (CRC-32 is the required level). So if you are using
modern SCSI, SATA or SAS technologies you'll be just fine. (Those checks
are a *requirement* of the underlying technology because of the error
rates when the bus speeds are so high.)

I checked this out in a conversation with the fine people at Seagate,
who also publish a variety of technical manuals/details on this, e.g.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/wp_ultra320.pdf
...You'll see that CRC is a Mandatory Feature of ANSI SPI-4 now.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/sas_link_layer.pdf

So, I'd like the *option* to turn our CRC off, please, somehow.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#48

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Simon Riggs (#47)

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.
Unless you're prepared to posit that your system never crashes at all,
you have to be able to detect the case where you've got a good front
half of a WAL record and non-matching data in the next page. The two
halves could be written out in widely separated write()s, indeed might
never have been simultaneously resident in the WAL buffers at all.

CRC-checked disks are actually the industry norm and have been for
around 5 years.

Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
irrelevant to the database-crash risk.

regards, tom lane

#49

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Tom Lane (#48)

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

Just a thought: Would there be benefit in not allowing page-spanning WAL
records, when they are small? That would tend to reduce the number of
WAL writes, even if it did cause some space wastage on disk. That would
reduce the number of same block re-writes and might improve the
sequential behaviour of WAL access. We never needed to think about this
while full_page_writes=on was the only option.

CRC-checked disks are actually the industry norm and have been for
around 5 years.

Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
irrelevant to the database-crash risk.

According to the man from Seagate...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#50

Noname

mark@mark.mielke.cc

about 19 years ago

In reply to: Simon Riggs (#49)

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

I'm almost wondering whether the three numbers are enough. I'm too busy
to sketch it all down and predict failure points... :-)

Just a thought: Would there be benefit in not allowing page-spanning WAL
records, when they are small? That would tend to reduce the number of
WAL writes, even if it did cause some space wastage on disk. That would
reduce the number of same block re-writes and might improve the
sequential behaviour of WAL access. We never needed to think about this
while full_page_writes=on was the only option.

Probably. Might not be much though.

CRC-checked disks are actually the industry norm and have been for
around 5 years.

Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
irrelevant to the database-crash risk.

According to the man from Seagate...

Once upon a time when bits were stored the size of smarties (exaggeration)....
Back in those days (before my time), they liked to use things like parity.

I didn't read the page, but perhaps there is some confusion between
CRC and error correction codes. Obviously, technologies were
introduced over time. I don't remember ever having a hard disk that
didn't have some sort of error detection. The 5 year claim seems
decades too short unless they are talking about a newer technology.
Even the old 5.25" DOS floppies seemed to be able to detect errors
rather than return invalid corrupted bits.

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#51

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Noname (#50)

mark@mark.mielke.cc writes:

The 5 year claim seems
decades too short unless they are talking about a newer technology.

I think what Simon is on about is CRCs being routinely used on the cable
between the disk drive and the CPU.  When I was involved in this stuff
you usually only got a parity bit on each byte transferred.  CRCs for
all disk command/result messages would definitely help close a risk area
--- but that's only one risk of many.

regards, tom lane

#52

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Noname (#50)

On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

That is the torn-page problem. If your system doesn't already protect
you against this you have no business turning off full_page_writes,
which was one of my starting assumptions.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#53

Noname

mark@mark.mielke.cc

about 19 years ago

In reply to: Simon Riggs (#52)

On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

That is the torn-page problem. If your system doesn't already protect
you against this you have no business turning off full_page_writes,
which was one of my starting assumptions.

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)

I'm on wrong on one of these assumptions, I'm open to being educated.
My opinion as of a few seconds ago, is that a write to a single disk
sector is safe, but that a write that extends across several sectors
is not.

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#54

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Noname (#53)

On Tue, 2006-10-24 at 12:47 -0400, mark@mark.mielke.cc wrote:

On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 10:18 -0400, mark@mark.mielke.cc wrote:

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

That is the torn-page problem. If your system doesn't already protect
you against this you have no business turning off full_page_writes,
which was one of my starting assumptions.

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)

I'm on wrong on one of these assumptions, I'm open to being educated.
My opinion as of a few seconds ago, is that a write to a single disk
sector is safe, but that a write that extends across several sectors
is not.

http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

full_page_writes = off

I'm very happy to learn more myself...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#55

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Simon Riggs (#49)

On Tue, 2006-10-24 at 14:52 +0100, Simon Riggs wrote:

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

Looking further, I see that the xlog page header already contains
xlp_pageaddr which is a XLogRecPtr. So an xlrec that tried to span
multiple pages yet failed in between would easily show up as a failure
in ValidXLOGHeader(), even before the CRC check. [The XLogRecPtr
contains both the offset within the file and a unique identification of
the WAL file, so any data left over from previous uses of that data file
will be easily recognised as such].

So we don't even need to CRC check the page-spanning ones either.

So it all comes down to: do you trust your hardware? I accept that some
hardware/OS combinations will give you high risk, others will reduce it
considerably. All I'm looking to do is to pass on the savings for people
that are confident they have invested wisely in hardware.

Does anybody have a reasonable objection to introducing an option for
people that are comfortable they are making the correct choice?
wal_checksum = off (or other suggested naming...)

I don't want to take blind risks, so shoot me down, please, if I err.
I'm happy either way: either we speed up, or we're safer not to.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#56

Alvaro Herrera

alvherre@commandprompt.com

about 19 years ago

In reply to: Noname (#53)

mark@mark.mielke.cc wrote:

I'm on wrong on one of these assumptions, I'm open to being educated.
My opinion as of a few seconds ago, is that a write to a single disk
sector is safe, but that a write that extends across several sectors
is not.

Unless it's fsync'ed, which is what we do at CHECKPOINT. Keep in mind
that we save full page images on WAL the first time we touch the page
after a checkpoint. This means that if a partial write occured, we will
restore it from WAL.

So it's not safe in general, but it is safe in Postgres.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#57

Gregory Maxwell

gmaxwell@gmail.com

about 19 years ago

In reply to: Noname (#53)

On 10/24/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)

[snip]

A file system with an apropreiate transaction method could do this..
In *theory* reiser4 write()s are atomic. No one has verified, however,
that there is no torn page risk introduced in some other part of the
kernel.

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.

#58

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Gregory Maxwell (#57)

"Gregory Maxwell" <gmaxwell@gmail.com> writes:

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.

The reasoning for supporting full_page_writes = off is that if you have
a stable kernel and suitable backup power, battery backed write cache,
etc, your risk of a partially completed write() may be low enough to
be acceptable. Obviously there are no 100.00% guarantees, but that's
what you keep backups for ...

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too. This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

One point not directly related to crash safety is whether CRC checking
isn't still going to be a good idea when PITR is enabled. Archived WAL
files are going to have been through a great deal more transferring than
ones merely being used to recover from a crash.

regards, tom lane

#59

Gurjeet Singh

singh.gurjeet@gmail.com

about 19 years ago

In reply to: Tom Lane (#15)

1 attachment(s)

Sorry for getting into the conversation so late... It was a long weekend in
India.

On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:

I didn't particularly trust the timing calculations in your benchmark
program,

Any particular reason? (why and what did you doubt in it?).

I designed the prog. to be flexible to test different sized blocks (to
cost single/less INIT/COMP/FIN iterations), and different size lists of data
(to control the number of iterations). Please share you wisdom.

When I first saw your results, I had a strong feeling that function-call
overhead was going against SB8. And then, Jeremy's trials, and subsequent
success, on disabling loop optimizations also pointed to this possibility.

So, I have taken your tests and converted the SB8 function calls into
macros. And the results are (please note that crc = 0 is explained later):

std_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 8.471994

sb8_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 0.000006

std_8192_printcrc.out
crc = 8228BB0E, bufsize = 8192, loops = 1000000, elapsed = 32.490704

sb8_8192_printcrc.out
crc = 7E67A22A, bufsize = 8192, loops = 1000000, elapsed = 22.349156

std_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.151354

sb8_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.000005

std_64_printcrc.out
crc = 9C9FBE2E, bufsize = 64, loops = 1000000, elapsed = 0.559315

sb8_64_printcrc.out
crc = F70BC6AE, bufsize = 64, loops = 1000000, elapsed = 0.357382

The result names are in the format:
<algo_type>_<test_size>_<was_mycrc_referenced_in_printf>.out

crc = 0 in the result means that the mycrc variable was not refereced
anywhere after the for-loop. As can be seen, if mycrc is not refrenced in
the printf, that is, it's usage is limited to just inside the 'for' loop,
then GCC (4.1) seems to be optimizing the loop heavily. In the case of SB8,
if mycrc is not referenced later, it seems to have totally removed the
loop!!!

The only difference between the <x>_noprintcrc and the <x>_printcrc
tests was that in the printf() call, the first parameter after the format
string was either a zero or mycrc variable, respectively.

I am highly apprehensive that I might have made some mistake while
converting function calls to macros; though, I have not besen able to prove
it thus far. Please check it's validity as compared to the function-call
version.

If there's no mistake, then I think SB8 is back in the performance game
now. These results were obtained with gcc 4.1 on FC5 running on Intel
Pentium M 1.86 GHz, and OS starteted and running in runlevel 3.

Please dump the .c and .h files from the attachment on top of Tom's
package, and test it as earlier.

Best regards,

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | yahoo }.com

#60

Gregory Stark

gsstark@mit.edu

about 19 years ago

In reply to: Tom Lane (#58)

Tom Lane <tgl@sss.pgh.pa.us> writes:

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too. This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

Actually I think we can deal with torn pages in the WAL more easily than in
database files anyways. In database files we need to get the entire page
correctly one way or the other so we need full_page_writes in order to be deal
properly.

In the WAL we just need to be able to detect torn pages and stop reading WAL
at that point. That's easier and doesn't really need a CRC. We could just
adopt the Sybase strategy of storing a unique id number every 512 bytes
throughout the WAL page. If those numbers don't match then we have a torn
page; the system crashed at that point and we should stop reading WAL pages.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#61

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Tom Lane (#58)

On Tue, 2006-10-24 at 14:07 -0400, Tom Lane wrote:

"Gregory Maxwell" <gmaxwell@gmail.com> writes:

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.

The reasoning for supporting full_page_writes = off is that if you have
a stable kernel and suitable backup power, battery backed write cache,
etc, your risk of a partially completed write() may be low enough to
be acceptable. Obviously there are no 100.00% guarantees, but that's
what you keep backups for ...

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too. This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

I agree about the significant additional risk, hence the additional
parameter.

I'll do some internal testing to see what the risk-reward is. If that
seems worthwhile, then I'll post the patch for general testing/comment.

(Incidentally, having GUCs that depend on other GUCs is bad news since
they are set alphabetically. I'd want to only allow wal_checksum=off iff
full_page_writes=off, which will work, but only because W comes after F
and for no other reason. Generic solution for dependent GUCs would be
great...)

One point not directly related to crash safety is whether CRC checking
isn't still going to be a good idea when PITR is enabled. Archived WAL
files are going to have been through a great deal more transferring than
ones merely being used to recover from a crash.

Agreed. Both disks and tapes/other mechanisms must be known CRC-safe
before this idea would be worth using in production. Many enterprises do
already think they have bomb-proof kit, so we may as well support them
in that belief.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#62

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Gurjeet Singh (#59)

"Gurjeet Singh" <singh.gurjeet@gmail.com> writes:

On 10/23/06, Tom Lane <tgl@sss.pgh.pa.us > wrote:

I didn't particularly trust the timing calculations in your benchmark
program,

Any particular reason? (why and what did you doubt in it?).

Well, the specific thing that set off my bogometer was

#define TV_DIFF_MILLI(tv1, tv2) ((tv2.tv_sec*1000+((tv2.tv_usec)/1000))-(tv1.tv_sec*1000+((tv1.tv_usec)/1000)))

which is going to have overflow problems on any platform where tv_sec
isn't a 64-bit type (which is still all of 'em AFAIK). But more
generally, your test is timing a CRC across 100 4Kb segments, which
isn't representative of PG's usage of CRCs. I don't think there are
any XLogWrite calls that have more than about 5 segments, and in most
cases those segments contain a few dozen bytes not a few K. So you
need to be looking at much shorter loop runs.

The test case I proposed uses timing code that I trusted (borrowed from
long-established coding in postgres.c), and tests loop lengths that are
somewhat representative for PG, but it is still biased in favor of slice8
because it repeats the CRC calculations consecutively without any other
activity --- presumably this fact creates a bias for a method that needs
more L2 cache space over one that doesn't need so much. I'd have tried
harder to make an unbiased test case if this version had showed slice8 as
competitive, but so far it seems that on a majority of current CPUs and
compilers it's not competitive.

regards, tom lane

#63

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Gregory Stark (#60)

On Tue, 2006-10-24 at 15:42 -0400, Gregory Stark wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too. This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

Actually I think we can deal with torn pages in the WAL more easily than in
database files anyways. In database files we need to get the entire page
correctly one way or the other so we need full_page_writes in order to be deal
properly.

In the WAL we just need to be able to detect torn pages and stop reading WAL
at that point. That's easier and doesn't really need a CRC. We could just
adopt the Sybase strategy of storing a unique id number every 512 bytes
throughout the WAL page. If those numbers don't match then we have a torn
page; the system crashed at that point and we should stop reading WAL pages.

Right, but I'm only suggesting using this if your safe from torn pages
anyhow. I'm not suggesting anyone does wal_checksum = off when
full_page_writes = on.

I do think you're right: we're much less exposed by a torn page in WAL
than we are for the stable database.

I've looked into this in more depth following your suggestion: I think
it seems straightforward to move the xl_prev field from being a header
to a trailer. That way when we do the test on the back pointer we will
be assured that there is no torn page effecting the remainder of the
xlrec. That would make it safer with wal_checksum = off.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#64

Gregory Stark

stark@enterprisedb.com

about 19 years ago

In reply to: Simon Riggs (#63)

"Simon Riggs" <simon@2ndquadrant.com> writes:

I've looked into this in more depth following your suggestion: I think
it seems straightforward to move the xl_prev field from being a header
to a trailer. That way when we do the test on the back pointer we will
be assured that there is no torn page effecting the remainder of the
xlrec. That would make it safer with wal_checksum = off.

Hm. I think in practice this may actually help reduce the exposure to torn
pages. However in theory there's no particular reason to think the blocks will
be written out in physical order.

The kernel may sync its buffers in some order dictated by its in-memory data
structure and may end up coming across the second half of the 8kb page before
the first half. It may even lie earlier on disk than the first half if the
filesystem started a new extent at that point.

If they were 4kb pages there would be fewer ways it could be written out of
order, but even then the hard drive could find a bad block and remap it. I'm
not sure what level of granularity drives remap at, it may be less than 4kb.

To eliminate the need for the CRC in the WAL for everyone and still be safe
from torn pages I think you have to have something like xl_prev repeated every
512b throughout the page.

But if this is only an option for systems that don't expect to suffer from
torn pages then sure, putting it in a footer seems like a good way to reduce
the exposure somewhat. Putting it in both a header *and* a footer might be
even better.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#65

Zeugswetter Andreas ADI SD

ZeugswetterA@spardat.at

about 19 years ago

In reply to: Simon Riggs (#63)

In the WAL we just need to be able to detect torn pages and stop
reading WAL at that point. That's easier and doesn't really need a
CRC. We could just adopt the Sybase strategy of storing a unique id
number every 512 bytes throughout the WAL page. If those numbers

don't

match then we have a torn page; the system crashed at that point and

we should stop reading WAL pages.

I've looked into this in more depth following your
suggestion: I think it seems straightforward to move the
xl_prev field from being a header to a trailer. That way when
we do the test on the back pointer we will be assured that
there is no torn page effecting the remainder of the xlrec.
That would make it safer with wal_checksum = off.

I do not think we can assume any order of when a block is written to
disk.

I think all this can only be used on OS and hardware, that can guarantee
that what is written by one IO call (typically 8k) from pg is safe.
Those combinations do exist, so I think we want the switch.
Putting xl_prev to the end helps only for direct IO WAL sync methods,
else we would need it on every page.

Andreas

#66

Simon Riggs

simon@2ndquadrant.com

about 19 years ago

In reply to: Zeugswetter Andreas ADI SD (#65)

On Fri, 2006-10-27 at 10:54 +0200, Zeugswetter Andreas ADI SD wrote:

In the WAL we just need to be able to detect torn pages and stop
reading WAL at that point. That's easier and doesn't really need a
CRC. We could just adopt the Sybase strategy of storing a unique id
number every 512 bytes throughout the WAL page. If those numbers

don't

match then we have a torn page; the system crashed at that point and

we should stop reading WAL pages.

I've looked into this in more depth following your
suggestion: I think it seems straightforward to move the
xl_prev field from being a header to a trailer. That way when
we do the test on the back pointer we will be assured that
there is no torn page effecting the remainder of the xlrec.
That would make it safer with wal_checksum = off.

I do not think we can assume any order of when a block is written to
disk.

I think all this can only be used on OS and hardware, that can guarantee
that what is written by one IO call (typically 8k) from pg is safe.
Those combinations do exist, so I think we want the switch.

OK, good.

... but from the various comments I'll not bother moving xl_prev to be a
trailer; I'll just leave that where it is now.

Putting xl_prev to the end helps only for direct IO WAL sync methods,
else we would need it on every page.

[There is already an XLogRecPtr on each 8k page.]

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#67

Martijn van Oosterhout

kleptog@svana.org

about 19 years ago

In reply to: Simon Riggs (#66)

On Fri, Oct 27, 2006 at 10:11:08AM +0100, Simon Riggs wrote:

Putting xl_prev to the end helps only for direct IO WAL sync methods,
else we would need it on every page.

[There is already an XLogRecPtr on each 8k page.]

Given that hardware sector size is still 512 bytes, should there be a
way of detecting a missing 512 byte block in the middle of an 8K block.
The idea of simply writing a serial counter every 512 bytes seems to
be a good way to handle that...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#68

Zeugswetter Andreas ADI SD

ZeugswetterA@spardat.at

about 19 years ago

In reply to: Martijn van Oosterhout (#67)

Putting xl_prev to the end helps only for direct IO WAL sync
methods, else we would need it on every page.

[There is already an XLogRecPtr on each 8k page.]

Given that hardware sector size is still 512 bytes, should
there be a way of detecting a missing 512 byte block in the
middle of an 8K block.
The idea of simply writing a serial counter every 512 bytes
seems to be a good way to handle that...

No, we have CRC for that. You are not supposed to turn it off
when you see a chance, that an 8k block is not whole.

Andreas

Re: New CRC algorithm: Slicing by 8

Attachments:

Attachments:

Attachments:

Attachments: