WAL CPU overhead/optimization (was Master-slave visibility order)

Started by Andres Freundover 12 years ago8 messages
#1Andres Freund
andres@2ndquadrant.com

On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:

On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:

FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
per CPU overhead actually minimally increased (at least in my tests), it
just scales noticeably better than before.

Interesting. Do you have any insight what is behind the CPU overhead?
Maybe the solution is to make WAL insertion cheap enough to not
matter. That won't be easy, but neither are the alternatives.

Funnily by far the biggest thing I have seen in benchmarks is the CRC32
computation. I plan to brush up my ~3 year old CRC32 reimplementation
patch sometime, but afair you had a much better one?

I have some doubts about weakening the hash function by also using FNV
or similar here, so I'd first like to try how much of a difference a
better CRC32 implementation can make with the current XLogInsert()
implementation.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Ants Aasma
ants@cybertec.at
In reply to: Andres Freund (#1)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:

On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:

FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
per CPU overhead actually minimally increased (at least in my tests), it
just scales noticeably better than before.

Interesting. Do you have any insight what is behind the CPU overhead?
Maybe the solution is to make WAL insertion cheap enough to not
matter. That won't be easy, but neither are the alternatives.

Funnily by far the biggest thing I have seen in benchmarks is the CRC32
computation. I plan to brush up my ~3 year old CRC32 reimplementation
patch sometime, but afair you had a much better one?

I have some doubts about weakening the hash function by also using FNV
or similar here, so I'd first like to try how much of a difference a
better CRC32 implementation can make with the current XLogInsert()
implementation.

The CRC32 implementations mostly differ by the amount of lookups that
are done in parallel. Postgresql does 1 lookup, IIRC zlib
implementation does 4, Intel has a paper that recommends going up to
8. The tradeoff is that each level requires a 4KB lookup table - for
small records the additional cache misses will probably kill any
speedup.

A quick overview of the hot cache large buffer performance of a few
interesting options:
crc32 slice-by-1: 0.148 bytes/cycle
crc32 slice-by-4: 0.392 bytes/cycle
crc32 slice-by-8: 0.654 bytes/cycle
crc32c instruction pipelined by 3: 6.8 bytes/cycle (number from Intels paper)
FNV 1 byte at a time version: 0.333 bytes/cycle
md5: 0.159 bytes/cycle
Murmur3A: 1.019 bytes/cycle
CityHash64: 4.246 bytes/cycle

CityHash64 actually looks pretty good, there no known hash quality
issues. Compared to CRC, the only weakening is that single bit errors
are not guaranteed to be 100% detected. There's also the issue that
only a 64bit implementation exists, but I'm sure this can be resolved
(if necessary, we can just use Murmur3 on 32bit).

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@2ndquadrant.com
In reply to: Ants Aasma (#2)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On 2013-08-30 02:53:54 +0300, Ants Aasma wrote:

On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:

On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:

FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
per CPU overhead actually minimally increased (at least in my tests), it
just scales noticeably better than before.

Interesting. Do you have any insight what is behind the CPU overhead?
Maybe the solution is to make WAL insertion cheap enough to not
matter. That won't be easy, but neither are the alternatives.

Funnily by far the biggest thing I have seen in benchmarks is the CRC32
computation. I plan to brush up my ~3 year old CRC32 reimplementation
patch sometime, but afair you had a much better one?

I have some doubts about weakening the hash function by also using FNV
or similar here, so I'd first like to try how much of a difference a
better CRC32 implementation can make with the current XLogInsert()
implementation.

The CRC32 implementations mostly differ by the amount of lookups that
are done in parallel. Postgresql does 1 lookup, IIRC zlib
implementation does 4, Intel has a paper that recommends going up to
8. The tradeoff is that each level requires a 4KB lookup table - for
small records the additional cache misses will probably kill any
speedup.

A quick overview of the hot cache large buffer performance of a few
interesting options:
[interesting data]

I am not sure "hot cache large buffer performance" is really the
interesting case. Most of the XLogInsert()s are pretty small in the
common workloads. I vaguely recall trying 8 and getting worse
performance on many workloads, but that might have been a problem of my
implementation.

The reason I'd like to go for a faster CRC32 implementation as a first
step is that it's easy. Easy to verify, easy to analyze, easy to
backout. I personally don't have enough interest/time in the 9.4 cycle
to purse conversion to a different algorithm (I find the idea of using
different ones on 32/64bit pretty bad), but I obviously won't stop
somebody else ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Ants Aasma
ants@cybertec.at
In reply to: Andres Freund (#3)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On Fri, Aug 30, 2013 at 3:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I am not sure "hot cache large buffer performance" is really the
interesting case. Most of the XLogInsert()s are pretty small in the
common workloads. I vaguely recall trying 8 and getting worse
performance on many workloads, but that might have been a problem of my
implementation.

Slice-by-8 doesn't have any overhead for small buffers besides the
lookup tables, so it most likely the cache misses that were the issue.
Murmur3, CityHash and SpookyHash don't have any lookup tables and are
excellent with small keys. Especially CityHash, 64 byte hash is quoted
at 9ns.

The reason I'd like to go for a faster CRC32 implementation as a first
step is that it's easy. Easy to verify, easy to analyze, easy to
backout. I personally don't have enough interest/time in the 9.4 cycle
to purse conversion to a different algorithm (I find the idea of using
different ones on 32/64bit pretty bad), but I obviously won't stop
somebody else ;)

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5ktm@rice.edu
ktm@rice.edu
In reply to: Ants Aasma (#4)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On Fri, Aug 30, 2013 at 03:22:37AM +0300, Ants Aasma wrote:

On Fri, Aug 30, 2013 at 3:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I am not sure "hot cache large buffer performance" is really the
interesting case. Most of the XLogInsert()s are pretty small in the
common workloads. I vaguely recall trying 8 and getting worse
performance on many workloads, but that might have been a problem of my
implementation.

Slice-by-8 doesn't have any overhead for small buffers besides the
lookup tables, so it most likely the cache misses that were the issue.
Murmur3, CityHash and SpookyHash don't have any lookup tables and are
excellent with small keys. Especially CityHash, 64 byte hash is quoted
at 9ns.

The reason I'd like to go for a faster CRC32 implementation as a first
step is that it's easy. Easy to verify, easy to analyze, easy to
backout. I personally don't have enough interest/time in the 9.4 cycle
to purse conversion to a different algorithm (I find the idea of using
different ones on 32/64bit pretty bad), but I obviously won't stop
somebody else ;)

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

Regards,
Ants Aasma

You may want to also check out xxhash with a BSD License and very fast
32-bit performance as well:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html
http://code.google.com/p/xxhash/

FWIW I agree that a much faster function would be better for CPU overhead.

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Ants Aasma (#4)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

I agree. However, on the flip side, a bird in the hand is worth two
in the bush. A patch that just does the same thing faster is likely
to be less controversial than changing the algorithm, and does not
preclude changing the algorithm later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Hannu Krosing
hannu@2ndQuadrant.com
In reply to: Robert Haas (#6)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On 09/03/2013 05:27 PM, Robert Haas wrote:

On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

I agree. However, on the flip side, a bird in the hand is worth two
in the bush. A patch that just does the same thing faster is likely
to be less controversial than changing the algorithm, and does not
preclude changing the algorithm later.

Has anybody looked at the recent i5/i7 added hrdware support for CRC32 ?

http://www.strchr.com/crc32_popcnt

From number there it is probably still slower than xxhash, but it might
be worth doing as
an interim optimisation.

Cheers

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@2ndquadrant.com
In reply to: Hannu Krosing (#7)
Re: WAL CPU overhead/optimization (was Master-slave visibility order)

On 2013-09-03 23:25:01 +0200, Hannu Krosing wrote:

On 09/03/2013 05:27 PM, Robert Haas wrote:

On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

I agree. However, on the flip side, a bird in the hand is worth two
in the bush. A patch that just does the same thing faster is likely
to be less controversial than changing the algorithm, and does not
preclude changing the algorithm later.

Has anybody looked at the recent i5/i7 added hrdware support for CRC32 ?

http://www.strchr.com/crc32_popcnt

From number there it is probably still slower than xxhash, but it might
be worth doing as
an interim optimisation.

Yes. There's two problems: For one they use a different polynom
(castagnoli afair), so the results aren't compatible with ones we
compute so far. For another, you need runtime detection of whether the
instruction is available and you need some compiler specific things to
use it.
Neither is too hard to solve, but I think that amount of work is better
spent using a different hash.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers