Cost of XLogInsert CRC calculations
I was profiling a case involving UPDATEs into a table with too many
indexes (brought to mind by mysql's sql-bench, about which more later)
and got this rather surprising result for routines costing more than
1% of the total runtime:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
64.03 86.20 86.20 133608 0.00 0.00 XLogInsert
3.50 90.91 4.71 2484787 0.00 0.00 _bt_compare
2.92 94.84 3.93 839893 0.00 0.00 hash_search
2.77 98.57 3.73 1875815 0.00 0.00 LWLockAcquire
1.89 101.12 2.55 1887972 0.00 0.00 LWLockRelease
1.27 102.83 1.71 125234 0.00 0.00 _bt_getroot
1.01 104.19 1.36 403342 0.00 0.00 PinBuffer
1.00 105.54 1.35 840002 0.00 0.00 hash_any
I suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into the inlined CRC calculations. Maybe we need to think
twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
records that are often only a few dozen bytes.
regards, tom lane
On Sun, 6 Mar 2005, Tom Lane wrote:
I suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into the inlined CRC calculations. Maybe we need to think
twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
records that are often only a few dozen bytes.
Isn't the CRC quite important on recovery to recognize where the last
valid log record is?
Is there any better implementations of CRC-64? Would using a different
polynomial help?
Would it help to do the CRC calculation in a more wholesale fashion in
XLogWrite?
How about switching to CRC-32 or even CRC-16? I searched the archives for
the reason CRC-64 was chosen in the first place. It seems that the
difference in computation time was not considered to be significant, and
there was 8 bytes available in the record header anyway.
Just some thoughts...
- Heikki
On Sun, 2005-03-06 at 00:17 -0500, Tom Lane wrote:
I was profiling a case involving UPDATEs into a table with too many
indexes (brought to mind by mysql's sql-bench, about which more later)
and got this rather surprising result for routines costing more than
1% of the total runtime:Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
64.03 86.20 86.20 133608 0.00 0.00 XLogInsert
3.50 90.91 4.71 2484787 0.00 0.00 _bt_compare
2.92 94.84 3.93 839893 0.00 0.00 hash_search
2.77 98.57 3.73 1875815 0.00 0.00 LWLockAcquire
1.89 101.12 2.55 1887972 0.00 0.00 LWLockRelease
1.27 102.83 1.71 125234 0.00 0.00 _bt_getroot
1.01 104.19 1.36 403342 0.00 0.00 PinBuffer
1.00 105.54 1.35 840002 0.00 0.00 hash_anyI suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into the inlined CRC calculations. Maybe we need to think
twice about the cost/benefit ratio of using 64-bit CRCs to protect xlog
records that are often only a few dozen bytes.
Yes, in recent performance tests sponsored by Unisys, this result was
also very clear. In those tests we used Intel VTune to identify the
precise lines of code soaking up the cycles...it was the CRC checks.
More results should be available from the Unisys testing within a few
days.
I had assumed that the majority of the cost of CRC checking was as a
result of the need to log complete blocks, rather than the rather small
xlog records themselves?
Best Regards, Simon Riggs
Hi Tom,
I was profiling a case involving UPDATEs into a table with too many
indexes (brought to > mind by mysql's sql-bench, about which more later) and
got this rather surprising result > for routines costing more than 1% of the
total runtime:
(cut)
I suppose that the bulk of the CPU cycles being attributed to XLogInsert
are going into > the inlined CRC calculations. Maybe we need to think twice
about the cost/benefit ratio > of using 64-bit CRCs to protect xlog records
that are often only a few dozen bytes.
Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
sure there are some error rates somewhere dependent upon the polynomial and
the types of error detected.... Try the following link towards the bottom:
http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
vs. CRC size.
Kind regards,
Mark.
------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT
T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk
Import Notes
Resolved by subject fallback
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
sure there are some error rates somewhere dependent upon the polynomial and
the types of error detected.... Try the following link towards the bottom:
http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
vs. CRC size.
When the CRC size was decided, I recall someone arguing that it would
really make a difference to have 1-in-2^64 chance of failure rather than
1-in-2^32. I was dubious about this at the time, but didn't have any
evidence showing that we shouldn't go for 64. I suppose we ought to try
the same example with a 32-bit CRC and see how much it helps.
regards, tom lane
Tom Lane wrote:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
sure there are some error rates somewhere dependent upon the polynomial and
the types of error detected.... Try the following link towards the bottom:
http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
vs. CRC size.When the CRC size was decided, I recall someone arguing that it would
really make a difference to have 1-in-2^64 chance of failure rather than
1-in-2^32. I was dubious about this at the time, but didn't have any
evidence showing that we shouldn't go for 64. I suppose we ought to try
the same example with a 32-bit CRC and see how much it helps.
Continuing this why not a 16-bit then ?
Regards
Gaetano Mendola
On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
Wow, a 64-bit CRC does seem excessive, especially when going back to Zmodem
days where a 50-100k file seemed to be easily protected by a 32-bit CRC. I'm
sure there are some error rates somewhere dependent upon the polynomial and
the types of error detected.... Try the following link towards the bottom:
http://www.ee.unb.ca/tervo/ee4253/crc.htm for some theory on detection rates
vs. CRC size.When the CRC size was decided, I recall someone arguing that it would
really make a difference to have 1-in-2^64 chance of failure rather than
1-in-2^32. I was dubious about this at the time, but didn't have any
evidence showing that we shouldn't go for 64. I suppose we ought to try
the same example with a 32-bit CRC and see how much it helps.
I think some of the additional run-time may be coming from processor
stalls associated with some of the constants used in the CRC checks.
I'll come back with more info on that later.
Well, we're using the CRC in 3 separate places...
(1) for xlog records
(2) for complete blocks copied to xlog
(3) for control files
For (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.
I think I'd like to keep (3) at CRC64...its just too important. Plus
thats slightly less code to change.
My money is on (2) being the source of most of that run-time anyway,
since when we enclose a whole block it takes a lot longer to CRC64 all
BLCKSZ bytes than it would do to CRC a single record in (1). But of
course, longer stretches of data need better error detection rates.
If Ethernet is using CRC32, it seems somewhat strange to use anything
higher than that, seeing as we're very likely to be sending xlog files
across the net anyway. Packet size is mostly comparable to BLCKSZ isn't
it?
So, yes CRC32 seems more reasonable.
One of the things I was thinking about was whether we could use up those
cycles more effectively. If we were to include a compression routine
before we calculated the CRC that would
- reduce the size of the blocks to be written, hence reduce size of xlog
- reduce the following CRC calculation
I was thinking about using a simple run-length encoding to massively
shrink half-empty blocks with lots of zero padding, but we've already
got code to LZW the data down also.
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Well, we're using the CRC in 3 separate places...
(1) for xlog records
(2) for complete blocks copied to xlog
(3) for control files
For (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.
I think I'd like to keep (3) at CRC64...its just too important. Plus
thats slightly less code to change.
The control files are so short that CRC16 would be plenty.
My money is on (2) being the source of most of that run-time anyway,
Undoubtedly, so there's not going to be much win from micro-optimization
by having several different CRC functions. I would go for CRC32 across
the board, myself.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Simon Riggs <simon@2ndquadrant.com> writes:
For (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.The control files are so short that CRC16 would be plenty.
It's not really the size of the data that governs the reasonable size of the
CRC. It's the probability of there being errors. For that you have to think
about what possible causes of errors you're concerned with.
AFAIK there's no CRC check at all in tables or indexes. So it doesn't seem
like bad hardware like a bad drive or RAM is what you're concerned with here.
From the other post it sounded like the threat was the possibility of an
interrupted arriving after the transaction commit log entry is written but
before the fsync has written the rest of the log.
In that case it seems the probability of it occurring is independent of the
size of the log entry. Is a 1 in 64k chance of a power failure resulting in
data corruption acceptable? A 1 in 4 billion chance?
You could eliminate the hole completely by using two fsyncs. fsync the
transaction information once. Then once that completes, then write and fsync
the transaction commit log entry.
--
greg
On Mon, 2005-03-07 at 20:50 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Well, we're using the CRC in 3 separate places...
(1) for xlog records
(2) for complete blocks copied to xlog
(3) for control filesFor (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.I think I'd like to keep (3) at CRC64...its just too important. Plus
thats slightly less code to change.The control files are so short that CRC16 would be plenty.
My money is on (2) being the source of most of that run-time anyway,
Undoubtedly, so there's not going to be much win from micro-optimization
by having several different CRC functions.
Agreed.
I would go for CRC32 across
the board, myself.
Sold.
Best Regards, Simon Riggs
On Mon, Mar 07, 2005 at 11:53:59PM +0000, Simon Riggs wrote:
On Mon, 2005-03-07 at 09:39 -0500, Tom Lane wrote:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
[skipped]
Well, we're using the CRC in 3 separate places...
(1) for xlog records
(2) for complete blocks copied to xlog
(3) for control filesFor (1), records are so short that probably CRC16 would be sufficient
without increasing the error rate noticeably.I think I'd like to keep (3) at CRC64...its just too important. Plus
thats slightly less code to change.My money is on (2) being the source of most of that run-time anyway,
since when we enclose a whole block it takes a lot longer to CRC64 all
BLCKSZ bytes than it would do to CRC a single record in (1). But of
course, longer stretches of data need better error detection rates.
Well, if there is no need for error recovery, than
what about using more simple algorithm, like checksum? Perhaps,
it even could be attached to one of required memcpy calls.
One of the things I was thinking about was whether we could use up those
cycles more effectively. If we were to include a compression routine
before we calculated the CRC that would
- reduce the size of the blocks to be written, hence reduce size of xlog
- reduce the following CRC calculationI was thinking about using a simple run-length encoding to massively
shrink half-empty blocks with lots of zero padding, but we've already
got code to LZW the data down also.Best Regards, Simon Riggs
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Simon,
I think having a compression routine in there could make real sense.
We have done some major I/O testing involving compression for a large
customer some time ago. We have seen that compressing / decompressing on
the fly is in MOST cases much faster than uncompressed I/O (try a simple
"cat file | ..." vs." zcat file.gz | ...") - the zcat version will be
faster on all platforms we have tried (Linux, AIX, Sun on some SAN
system, etc. ...).
Also, when building up a large database within one transaction the xlog
will eat a lot of storage - this can be quite annoying when you have to
deal with a lot of data).
Are there any technical reasons which would prevent somebody from
implementing compression?
Best regards,
Hans
--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/660/816 40 77
www.cybertec.at, www.postgresql.at
On Fri, 2005-03-11 at 19:31 +0100, Hans-Jürgen Schönig wrote:
One of the things I was thinking about was whether we could use up those
cycles more effectively. If we were to include a compression routine
before we calculated the CRC that would
- reduce the size of the blocks to be written, hence reduce size of xlog
- reduce the following CRC calculationI was thinking about using a simple run-length encoding to massively
shrink half-empty blocks with lots of zero padding, but we've already
got code to LZW the data down also.
I think having a compression routine in there could make real sense.
We have done some major I/O testing involving compression for a large
customer some time ago. We have seen that compressing / decompressing on
the fly is in MOST cases much faster than uncompressed I/O (try a simple
"cat file | ..." vs." zcat file.gz | ...") - the zcat version will be
faster on all platforms we have tried (Linux, AIX, Sun on some SAN
system, etc. ...).
Also, when building up a large database within one transaction the xlog
will eat a lot of storage - this can be quite annoying when you have to
deal with a lot of data).
Agreed.
Are there any technical reasons which would prevent somebody from
implementing compression?
Not currently, that I'm aware of. But the way additional blocks are
added to xlog records would need to be changed to allow for variable
length output.
It's on my list...
Best Regards, Simon Riggs
-----Original Message-----
From: Mark Cave-Ayland [mailto:m.cave-ayland@webbased.co.uk]
Sent: 07 March 2005 11:04
To: 'tgl@sss.pgh.pa.us'
Cc: 'pgsql-hackers@postgreSQL.org'
Subject: Re: Cost of XLogInsert CRC calculations
(cut)
I suppose that the bulk of the CPU cycles being attributed to
XLogInsert are going into > the inlined CRC calculations. Maybe we
need to think twice about the cost/benefit ratio > of using 64-bit
CRCs to protect xlog records that are often only a few dozen bytes.Wow, a 64-bit CRC does seem excessive, especially when going
back to Zmodem days where a 50-100k file seemed to be easily
protected by a 32-bit CRC. I'm sure there are some error
rates somewhere dependent upon the polynomial and the types
of error detected.... Try the following link towards the
bottom: http://www.ee.unb.ca/tervo/ee4253/crc.htm for some
theory on detection rates vs. CRC size.
Hi Tom/Simon,
I was just researching some articles on compression (zlib) and I saw mention
of the Adler-32 algorithm which is supposed to be slightly less accurate
than an equivalent CRC calculation but significantly faster to compute. I
haven't located a good paper comparing the error rates of the two different
checksums, however extending the Adler-32 algorithm up to 64 bits may be a
way of increasing the speed at the expense of a slight loss in the accuracy
of error detection if we decided that we want to stay at 64 bits as opposed
to 32.
The following seems to indicate that Adler-32 is at least twice as fast as
optimised CRC32: http://www.winimage.com/misc/readfile_test.htm.
Kind regards,
Mark.
------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT
T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
I was just researching some articles on compression (zlib) and I saw mention
of the Adler-32 algorithm which is supposed to be slightly less accurate
than an equivalent CRC calculation but significantly faster to compute. I
haven't located a good paper comparing the error rates of the two different
checksums,
... probably because there isn't one. With all due respect to the Zip
guys, I doubt anyone has done anywhere near the analysis on Adler-32
that has been done on CRCs. I'd much prefer to stick with true CRC
and drop it to 32 bits than go with a less-tested algorithm. Throwing
more bits at the problem doesn't necessarily create a safer checksum.
regards, tom lane
Tom Lane wrote:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
I was just researching some articles on compression (zlib) and I saw mention
of the Adler-32 algorithm which is supposed to be slightly less accurate
than an equivalent CRC calculation but significantly faster to compute. I
haven't located a good paper comparing the error rates of the two different
checksums,... probably because there isn't one. With all due respect to the Zip
guys, I doubt anyone has done anywhere near the analysis on Adler-32
that has been done on CRCs. I'd much prefer to stick with true CRC
and drop it to 32 bits than go with a less-tested algorithm. Throwing
more bits at the problem doesn't necessarily create a safer checksum.
Agreed. 64-bit was overkill when we added it, and it is now shown to be
a performance problem.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, 2005-05-10 at 10:34 -0400, Bruce Momjian wrote:
Tom Lane wrote:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
I was just researching some articles on compression (zlib) and I saw mention
of the Adler-32 algorithm which is supposed to be slightly less accurate
than an equivalent CRC calculation but significantly faster to compute. I
haven't located a good paper comparing the error rates of the two different
checksums,... probably because there isn't one. With all due respect to the Zip
guys, I doubt anyone has done anywhere near the analysis on Adler-32
that has been done on CRCs. I'd much prefer to stick with true CRC
and drop it to 32 bits than go with a less-tested algorithm. Throwing
more bits at the problem doesn't necessarily create a safer checksum.Agreed. 64-bit was overkill when we added it, and it is now shown to be
a performance problem.
Hold on... Tom has shown that there is a performance problem with the
existing CRC calculation. Thanks to Mark for staying on top of that with
some good ideas.
The cause of the performance problem has been attributed to it being a
64-bit rather than 32-bit calculation. That is certainly part of it, but
I have seen evidence that there is an Intel processor stall associated
with the use of a single byte constant somewhere in the algorithm. So
I'm unclear as to what extent the poor performance is attributable to
either issue.
That's where my experience stops so I have highlighted that for somebody
with more hardware specific assembler experience to have a look at the
algorithm. Fixing that, if possible, could greatly improve the
performance whether or not we drop from 64 to 32 bits. My hope for
outside assistance on that looks like it is not now forthcoming.
My guess would be that the algorithm's use of the byte-by-byte
calculation together with a bitmask of &FF is responsible. Perhaps
varying the length of the bitmask to either &000000FF or longer may
help? (see src/include/xlog_internal.h)
Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
The cause of the performance problem has been attributed to it being a
64-bit rather than 32-bit calculation. That is certainly part of it, but
I have seen evidence that there is an Intel processor stall associated
with the use of a single byte constant somewhere in the algorithm.
That's awfully vague --- can't you give any more detail?
I have seen XLogInsert eating significant amounts of time (up to 10% of
total CPU time) on non-Intel architectures, so I think that dropping
down to 32 bits is warranted in any case. But if you are correct then
that might not fix the problem on Intel machines. We need more info.
regards, tom lane
On Tue, 2005-05-10 at 18:22 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
The cause of the performance problem has been attributed to it being a
64-bit rather than 32-bit calculation. That is certainly part of it, but
I have seen evidence that there is an Intel processor stall associated
with the use of a single byte constant somewhere in the algorithm.That's awfully vague --- can't you give any more detail?
I have seen XLogInsert eating significant amounts of time (up to 10% of
total CPU time) on non-Intel architectures, so I think that dropping
down to 32 bits is warranted in any case. But if you are correct then
that might not fix the problem on Intel machines. We need more info.
I have seen an Intel VTune report that shows a memory stall causing high
latency associated with a single assembly instruction that in the
compiled code of the CRC calculation. The instruction was manipulating a
single byte only. I couldn't tell exactly which line of PostgreSQL code
produced the assembler. This could be either a partial register stall or
a memory order buffer stall (or another?)
Here's a discussion of this
http://www.gamasutra.com/features/19991221/barad_pfv.htm
Sorry, but thats all I know. I will try to obtain the report, which is
not in my possession.
I do *not* know with any certainty what the proportion of time lost from
the CRC calc proper in an idealised CPU against the time lost from this
hardware specific interaction. I don't know if non-Intel CPUs are
effected either.
Best Regards, Simon Riggs
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: 10 May 2005 23:22
To: Simon Riggs
Cc: Bruce Momjian; Mark Cave-Ayland (External);
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations
(cut)
That's awfully vague --- can't you give any more detail?
I have seen XLogInsert eating significant amounts of time (up
to 10% of total CPU time) on non-Intel architectures, so I
think that dropping down to 32 bits is warranted in any case.
But if you are correct then that might not fix the problem
on Intel machines. We need more info.regards, tom lane
Hi Tom/Simon,
Just for the record, I found a better analysis of Adler-32 following some
links from Wikipedia. In summary, the problem with Adler-32 is that while it
is only slightly less sensitive than CRC-32, it requires roughly a 1k
"run-in" in order to attain full coverage of the bits (with respect to
sensitivity of the input). This compares to 4 bytes of "run-in" required for
CRC-32. So unless we can guarantee a minimum of 1k data per Xlog record then
Adler-32 won't be suitable. See the following two links for more
information:
http://en.wikipedia.org/wiki/Adler-32
http://www.ietf.org/rfc/rfc3309.txt
One other consideration would be that since CRC-32 calculations for Xlog
records occur so often, perhaps the CRC-32 routines could be written in
in-line assembler, falling back to C for unsupported processors. It would be
interesting to come up with some benchmarks to see if indeed this would be
faster than the current C implementation, since as the routine is called so
often it could add up to a significant saving under higher loads.
Kind regards,
Mark.
------------------------
WebBased Ltd
17 Research Way
Plymouth
PL6 8BT
T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk