Using POPCNT and other advanced bit manipulation instructions

Started by David Rowleyover 7 years ago49 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

Back in 2016 [1]/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word. Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

Thomas Munro did some work to make this happen but didn't go as far as
adding the required run-time test to allow builds which were built on
a machine with these instructions to work on a machine without them.
We've now got other places in the code which have similar run-time
tests (for example CRC calculation), so I think we should be able to
do the same for the ABM instructions.

Thomas mentions in [1]/messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com, to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2]https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/ which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount(). I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems. Remember, that's not limited to core code since
pg_config will cause extensions to be compiled with -mpopcnt too.

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member(). The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation. I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call. That
seems to have mostly been done in Thomas' patch [3]/messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com, part of which
I've used for the visibilitymap.c code changes. If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel. 2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4]https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT.

I'm also a little uncertain of my cpuid bit tests. POPCNT appears to
have use bit 5 in EAX=80000001h, but also bit 23 in EAX=1 [5]https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages. This
appears to be a variation between Intel and AMD. AMD always implement
either both POPCNT and LZCNT or neither. Where Intel use AMDs cpuid
bit flag just for LZCNT and have reserved their own flag for POPCNT
(they didn't implement both at once, as AMD did). I'm a bit uncertain
if AMD will set the Intel POPCNT flag or not, and if they do now, then
I'm not sure if they always did. Intel were 2nd in that race, so I
assume at least the earliest AMD processors would just set only the
AMD flag. Testing might help reveal if I have this right.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

This should be down to the improved performance of
visibilitymap_count(), but it may not be entirely just from faster bit
counter as I also couldn't resist tightening up the inner-most loop.

[1]: /messages/by-id/CAEepm=3k++Ytf2LNQCvpP6m1=gY9zZHP_cfnn47=WTsoCrLCvA@mail.gmail.com
[2]: https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
[3]: /messages/by-id/CAEepm=3g1_fjJGp38QGv+38BC2HHVkzUq6s69nk3mWLgPHqC3A@mail.gmail.com
[4]: https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT
[5]: https://en.wikipedia.org/wiki/CPUID#CPUID_usage_from_high-level_languages

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchapplication/octet-stream; name=v1-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchDownload+937-164
#2Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

On 20/12/2018 18:53, David Rowley wrote
[...]

Patched:

postgres=# analyze t1;
Time: 680.833 ms
Time: 699.976 ms
Time: 695.608 ms
Time: 676.007 ms
Time: 693.487 ms
Time: 726.982 ms
Time: 677.835 ms
Time: 688.426 ms

Master:

postgres=# analyze t1;
Time: 721.837 ms
Time: 756.035 ms
Time: 734.545 ms
Time: 751.969 ms
Time: 730.140 ms
Time: 724.266 ms
Time: 713.625 ms

(+3.66% avg)

[...]

Looking at the normalized standard deviations, the patched results have
a higher than 5% chance of being better simply by chance.  I suspect
that you have made an improvement, but the statistics are not convincing.

I can supply detailed working if you want.

Cheers,
Gavin

#3David Rowley
dgrowleyml@gmail.com
In reply to: Gavin Flower (#2)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, 20 Dec 2018 at 20:17, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:

Looking at the normalized standard deviations, the patched results have
a higher than 5% chance of being better simply by chance. I suspect
that you have made an improvement, but the statistics are not convincing.

Yeah, I'd hoped that I could have gotten a better signal to noise
ratio by running the test many times, but you're right. That was on
my laptop. I've run the test again on an AWS instance and the results
seem to be a bit more stable. Same table with 1 int column and 100m
rows. statistics set to 10.

Unpatched

postgres=# analyze a;

Time: 38.248 ms
Time: 35.185 ms
Time: 35.067 ms
Time: 34.879 ms
Time: 34.816 ms
Time: 34.558 ms
Time: 34.722 ms
Time: 34.427 ms
Time: 34.214 ms
Time: 34.301 ms
Time: 35.751 ms
Time: 33.993 ms
Time: 33.880 ms
Time: 33.617 ms
Time: 33.381 ms
Time: 33.326 ms

Patched:

postgres=# analyze a;

Time: 34.421 ms
Time: 33.523 ms
Time: 33.230 ms
Time: 33.678 ms
Time: 32.987 ms
Time: 32.914 ms
Time: 33.165 ms
Time: 32.707 ms
Time: 32.645 ms
Time: 32.814 ms
Time: 32.082 ms
Time: 32.143 ms
Time: 32.310 ms
Time: 31.966 ms
Time: 31.702 ms
Time: 32.089 ms

Avg +5.72%, Median +5.29%

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: David Rowley (#3)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, Dec 20, 2018 at 6:53 AM David Rowley <david.rowley@2ndquadrant.com> wrote:

Thomas mentions in [1], to get the GCC to use the POPCNT instruction,
we must pass -mpopcnt in the build flags. After doing a bit of
research, I found [2] which mentions that some compilers have some
pattern matching code to allow the popcnt instruction to be used even
without a macro such as __builtin_popcount(). I believe I've
correctly written the run-time test to skip using the new popcnt
function, but if there's any code around that might cause the compiler
to use the popcnt instruction from pattern matching, then that might
cause problems.

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1]https://arxiv.org/pdf/1611.07612.pdf, where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

#5José Luis Tallón
jltallon@adv-solutions.net
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

On 20/12/18 6:53, David Rowley wrote:

Back in 2016 [1] there was some discussion about using the POPCNT
instruction to improve the performance of counting the number of bits
set in a word. Improving this helps various cases, such as
bms_num_members and also things like counting the allvisible and
frozen pages in the visibility map.

[snip]

I've put together a very rough patch to implement using POPCNT and the
leading and trailing 0-bit instructions to improve the performance of
bms_next_member() and bms_prev_member(). The correct function should
be determined on the first call to each function by way of setting a
function pointer variable to the most suitable supported
implementation. I've not yet gone through and removed all the
number_of_ones[] arrays to replace with a pg_popcount*() call.

IMVHO: Please do not disregard potential optimization by the compiler
around those calls.. o_0  That might explain the reduced performance
improvement observed.

Not that I can see any obvious alternative to your implementation right
away ...

That
seems to have mostly been done in Thomas' patch [3], part of which
I've used for the visibilitymap.c code changes. If this patch proves
to be possible, then I'll look at including the other changes Thomas
made in his patch too.

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel.

I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test
on. Feel free to toss me some test code.

cpuinfo flags:    fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm
cmp_legacy 3dnowprefetch vmmcall

2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

Thanks

#6David Rowley
dgrowleyml@gmail.com
In reply to: Dmitry Dolgov (#4)
Re: Using POPCNT and other advanced bit manipulation instructions

Thanks for looking at this.

On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function. This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1], where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7David Rowley
dgrowleyml@gmail.com
In reply to: José Luis Tallón (#5)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, 20 Dec 2018 at 23:59, Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:

IMVHO: Please do not disregard potential optimization by the compiler
around those calls.. o_0 That might explain the reduced performance
improvement observed.

It was a speedup that I measured. Did you see something else?

What I'm really looking for by posting now are reasons why we can't do
this. I'm also interested in getting some testing done on older
machines, particularly machines with processors that are from before
2007, both AMD and Intel.

I can offer a 2005-vintage Opteron 2216 rev3 (bought late 2007) to test
on. Feel free to toss me some test code.

cpuinfo flags: fpu de tsc msr pae mce cx8 apic mca cmov pat clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
rep_good nopl extd_apicid eagerfpu pni cx16 hypervisor lahf_lm
cmp_legacy 3dnowprefetch vmmcall

2007-2008 seems to be around the time both
AMD and Intel added support for POPCNT and LZCNT, going by [4].

It would be really good if you could git clone a copy of master and
patch it with the patch from earlier in the thread and see if you
encounter any issues running make check-world.

I'm a bit uncertain if passing -mpopcnt to a recent gcc would result
in the popcnt instruction being compiled in if the machine doing the
compiling had no support for that.

Likely it would be simple to test that with:

echo "int main(char **argv, int argc) { return
__builtin_popcount(argc); }" > popcnt.c && gcc popcnt.c -S -mpopcnt &&
cat popcnt.s | grep pop

I see a "popcntl" in there on my machine.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8Dmitry Dolgov
9erthalion6@gmail.com
In reply to: David Rowley (#6)
Re: Using POPCNT and other advanced bit manipulation instructions

On Fri, Jan 4, 2019 at 1:38 PM David Rowley <david.rowley@2ndquadrant.com> wrote:

On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I've checked for Clang 6, it turns out that indeed it generates popcnt without
any macro, but only in one place for bloom_prop_bits_set. After looking at this
function it seems that it would be benefitial to actually use popcnt there too.

Yeah, that's the pattern that's mentioned in
https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
It would need to be changed to call the popcount function. This
existing makes me a bit more worried that some extension could be
using a similar pattern and end up being compiled with -mpopcnt due to
pg_config having that CFLAG. That's all fine until the binary makes
it's way over to a machine without that instruction.

It surprises me, that it's not that obvious how to disable this feature for
clang. I guess one should be able to turn it off by invoking opt manually:

clang -S -mpopcnt -emit-llvm *.c
opt -S -mattr=+popcnt <all the options without -loop-idiom> *.ll
llc -mattr=+popcnt *.optimized.ll
clang -mpopcnt *optimized.s

But for some reason this doesn't work for me (popcnt is not appearing in
the first place).

I am able to measure performance gains from the patch. In a 3.4GB
table containing a single column with just 10 statistics targets, I
got the following times after running ANALYZE on the table.

I've tested it too a bit, and got similar results when the patched version is
slightly faster. But then I wonder if popcnt is the best solution here, since
after some short research I found a paper [1], where authors claim that:

Maybe surprisingly, we show that a vectorized approach using SIMD
instructions can be twice as fast as using the dedicated instructions on
recent Intel processors.

[1]: https://arxiv.org/pdf/1611.07612.pdf

I can't imagine that using the number_of_ones[] array processing
8-bits at a time would be slower than POPCNT though.

Yeah, probably you're right. If I understand correctly even with the lookup
table in the cache the access would be a bit slower than a POPCNT instruction.

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#1)
Re: Using POPCNT and other advanced bit manipulation instructions

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.
For another, I think the arrangement of all those "ifdef
HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd
lay them out like this:

#ifdef HAVE__BUILTIN_CTZ
int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_choose;
#else
int (*pg_rightmost_one32) (uint32 word) = pg_rightmost_one32_slow;
#endif

#ifdef HAVE__BUILTIN_CTZ
/*
* This gets called on the first call. It replaces the function pointer
* so that subsequent calls are routed directly to the chosen implementation.
*/
static int
pg_rightmost_one32_choose(uint32 word)
{
...

(You need declarations for the "choose" variants at the top of the file,
but that seems okay.)

Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
put at the top of the file something like

#if bms are 32 bits
#define pg_rightmost_one(x) pg_rightmost_one32(x)
#define pg_popcount(x) pg_popcount32(x)
#elif they are 64 bits
#define ...
#else
#error ...
#endif

This way, each place that uses the functions does not need the ifdefs.

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#9)
Re: Using POPCNT and other advanced bit manipulation instructions

On Thu, Jan 31, 2019 at 07:45:02PM -0300, Alvaro Herrera wrote:

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

Moved to next CF, waiting on author. I think that this needs more
reviews.
--
Michael

#11David Rowley
dgrowleyml@gmail.com
In reply to: Alvaro Herrera (#9)
Re: Using POPCNT and other advanced bit manipulation instructions

Thanks for looking at this.

On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.

I've moved the code into src/port and renamed the file to pg_bitutils.c

For another, I think the arrangement of all those "ifdef
HAVE_THIS_OR_THAT" in the bitutils.c file is a bit hard to read. I'd
lay them out like this:

I've made this change too, although when doing it I realised that I
had forgotten to include the check for CPUID. It's possible that does
not exist but POPCNT does, I guess. This has made the #ifs a bit more
complex.

Finally, the part in bitmapset.c is repetitive on the #ifdefs; I'd just
put at the top of the file something like

Yeah, agreed. Much neater that way.

Other than those minor changes, I think we should just get this pushed
and see what the buildfarm thinks. In the words of a famous PG hacker:
if a platform ain't in the buildfarm, we don't support it.

I also made a number of other changes to the patch.

1. The patch now only uses the -mpopcnt CFLAG for pg_bitutils.c. I
thought this was important so we don't expose that flag in pg_config
and possibly end up building extension with popcnt instructions, which
might not be portable to other older hardware.
2. Wrote a new pg_popcnt function that accepts an array of bytes and a
size variable. This seems useful for the bloomfilter use.

There are still various number_of_ones[] arrays around the codebase.
These exist in tsgistidx.c, _intbig_gist.c and _ltree_gist.c. It
would be nice to get rid of those too, but one of the usages in each
of those 3 files requires XORing with another bit array before
counting the bits. I thought about maybe writing a pop_count_xor()
function that accepts 2 byte arrays and a length parameter, but it
seems a bit special case, so I didn't.

Another thing I wasn't sure of was if I should just have
bms_num_members() just call pg_popcount(). It might be worth
benchmarking to see what's faster. My thinking is that pg_popcount
will inline the pg_popcount64() call so it would mean a single
function call rather than one for each bitmapword in the set.

I've compiled and run make check-world on Linux with GCC7.3 and
clang6.0. I've also tested on MSVC to ensure I didn't break windows.
It would be good to get a few more people to compile it and run the
tests.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchapplication/octet-stream; name=v2-0001-Add-basic-support-for-using-the-POPCNT-and-SSE4.2.patchDownload+914-170
#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#11)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-04, David Rowley wrote:

On Fri, 1 Feb 2019 at 11:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I only have cosmetic suggestions for this patch. For one thing, I think
the .c file should be in src/port and its header should be in
src/include/port/, right beside the likes of pg_bswap.h and pg_crc32c.h.

I've moved the code into src/port and renamed the file to pg_bitutils.c

I've pushed this now. Let's see what the buildfarm has to say about it.

I've compiled and run make check-world on Linux with GCC7.3 and
clang6.0. I've also tested on MSVC to ensure I didn't break windows.
It would be good to get a few more people to compile it and run the
tests.

That's what the buildfarm is for ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#12)
Re: Using POPCNT and other advanced bit manipulation instructions

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Using POPCNT and other advanced bit manipulation instructions

Hi,

On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about

it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to say that reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures, especially because the latter two might be hard to come by outside the bf).

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

Andres Freund <andres@anarazel.de> writes:

I'd assume that breaking bit counting would cause distinct enough damage (compile time or crashes). That's not to say that reverting ryu shouldn't be considered (although I'm not that bothered by cross version, ia64 and cygwin failures, especially because the latter two might be hard to come by outside the bf).

The pink doesn't appear to be limited to non-mainstream platforms,
see eg lapwing, fulmar. However, I see Andrew just pushed some fixes,
so this argument is moot pending how much that helps.

regards, tom lane

#16Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

"Andres" == Andres Freund <andres@anarazel.de> writes:

It's likely to be hard to tell, given the amount of pink from the
Ryu patch. If Andrew is not planning to clean that up PDQ,

Besides crake (x-version), fulmar (icc) and lorikeet (cygwin), I hope
the rest of the known failures should pass on the next cycle; the
mac/ppc failures are because we redefine "bool" in a way that broke the
upstream code's c99 assumptions, and the rest are numerical instability
in ts_rank.

I'd suggest reverting that patch pending having some repairs for it.

Andres> I'd assume that breaking bit counting would cause distinct
Andres> enough damage (compile time or crashes). That's not to say that
Andres> reverting ryu shouldn't be considered (although I'm not that
Andres> bothered by cross version, ia64 and cygwin failures, especially
Andres> because the latter two might be hard to come by outside the
Andres> bf).

IA64 is working fine as far as I can see (specifically, anole is
passing); it's ICC on x86_64 that broke (fulmar).

--
Andrew (irc:RhodiumToad)

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#14)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Andres Freund wrote:

Hi,

On February 13, 2019 8:40:14 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I've pushed this now. Let's see what the buildfarm has to say about

it.

It's likely to be hard to tell, given the amount of pink from the Ryu
patch. If Andrew is not planning to clean that up PDQ, I'd suggest
reverting that patch pending having some repairs for it.

I'd assume that breaking bit counting would cause distinct enough
damage (compile time or crashes).

I was a bit surprised to find out that the assembly generated by
compiling the code in test for __builtin_foo() does not actually include
the calls being tested ... (they're only used to generate the value for
a static variable, and that gets optimized away); but then the comment
for the test does say that we're only testing that the compiler
understands the construct, so I suppose that's fine. Also, we already
do that for bswap.

This "compiler explorer" tool is nice:

https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(j:1,source:&#39;int+main(int+argc+,+char**+argv)%0A%7B%0A%0A++return++__builtin_popcountll((unsigned)argc)%3B%0A%0A%7D&#39;),l:&#39;5&#39;,n:&#39;0&#39;,o:&#39;C%2B%2B+source+%231&#39;,t:&#39;0&#39;)),k:45.01119572686435,l:&#39;4&#39;,m:100,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;),(g:!((h:compiler,i:(compiler:g447,filters:(b:&#39;0&#39;,commentOnly:&#39;0&#39;,directives:&#39;0&#39;,intel:&#39;0&#39;),libs:!(),options:&#39;-Wall+-O3+-msse4.2&#39;,source:1),l:&#39;5&#39;,n:&#39;0&#39;,o:&#39;x86-64+gcc+4.4.7+(Editor+%231,+Compiler+%231)&#39;,t:&#39;0&#39;)),k:54.98880427313565,l:&#39;4&#39;,m:100,n:&#39;0&#39;,o:&#39;&#39;,s:0,t:&#39;0&#39;)),l:&#39;2&#39;,n:&#39;0&#39;,o:&#39;&#39;,t:&#39;0&#39;)),version:4

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Andrew Gierth (#16)
Re: Using POPCNT and other advanced bit manipulation instructions

"Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Andrew> IA64 is working fine as far as I can see (specifically, anole
Andrew> is passing); it's ICC on x86_64 that broke (fulmar).

And I know what's wrong on fulmar now, so that'll be fixed shortly.

--
Andrew (irc:RhodiumToad)

#19Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Alvaro Herrera (#12)
Re: Using POPCNT and other advanced bit manipulation instructions

"Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Alvaro> I've pushed this now. Let's see what the buildfarm has to say
Alvaro> about it.

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

--
Andrew (irc:RhodiumToad)

#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Gierth (#19)
Re: Using POPCNT and other advanced bit manipulation instructions

On 2019-Feb-13, Andrew Gierth wrote:

"Alvaro" == Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Alvaro> I've pushed this now. Let's see what the buildfarm has to say
Alvaro> about it.

Lapwing's latest failure looks like yours rather than mine now? (the
previous two were mine)

It definitely is ... plans have changed from using IndexOnly scans to
Seqscans, which is likely fallout from the visibilitymap_count() change.
Looking.

(I patched the Makefile to add -mpopcnt to all the compile lines rather
than just the frontend one, but I can't see that explaining the
failure.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Alvaro Herrera (#20)
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#23)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#24)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#28)
#30Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Alvaro Herrera (#22)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gavin Flower (#30)
#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#29)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#33)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#33)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#36)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#38)
#40Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#35)
#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kyotaro Horiguchi (#40)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#38)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#42)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#43)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#44)
#47Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#35)