Keep elog(ERROR) and ereport(ERROR) calls in the cold path

Started by David Rowleyalmost 6 years ago35 messageshackers
Jump to latest
#1David Rowley
dgrowleyml@gmail.com

Back in [1]/messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com I experimented with a patch to coax compilers to build all
elog/ereport calls that were >= ERROR into a cold path away from the
function rasing the error. At the time, I really just wanted to test
how much of a speedup we could get by doing this and ended up just
writing up a patch that basically changed all elog(ERROR) calls from:

if (<error situation check>)
{
<do stuff>;
elog(ERROR, "...");
}

to add an unlikely() and become;

if (unlikely(<error situation check>)
{
<do stuff>;
elog(ERROR, "...");
}

Per the report in [1]/messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com I did see some pretty good gains in performance
from doing this. The problem was, that at the time I couldn't figure
out a way to do this without an invasive patch that changed the code
in the thousands of elog/ereport calls.

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path. To make the
compiler properly mark just >= ERROR calls as cold, and only when the
elevel is a constant, I modified the ereport_domain macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job. (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)

I sampled a few .s files to inspect what code had changed. Looking at
mcxt.s, fmgr.s and xlog.s, the first two of these because I found in
[1]: /messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com
because it's pretty big. As far as I could see, GCC correctly moved
all the error reporting stuff where the elevel was a constant and >=
ERROR into the cold path and left the lower-level or non-consts elevel
calls alone.

For clang, I didn't see any changes in the .s files. I suspect that
they might have a few smarts in there and see the
__builtin_unreachable() call and assume the path is cold already based
on that. That was with clang 10. Perhaps older versions are not as
smart.

Benchmarking:

For benchmarking, I've not done a huge amount to test the impacts of
this change. However, I can say that I am seeing some fairly good
improvements. There seems to be some small improvements to execution
speed using TPCH-Q1 and also some good results from a pgbench -S test.

For TPCH-Q1:

Master:
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5272.630 ms
latency average = 5258.610 ms
latency average = 5250.871 ms

Master + elog_ereport_attribute_cold.patch
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5182.761 ms
latency average = 5194.851 ms
latency average = 5183.128 ms

Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.

For pgbench -S:

My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:

Master:
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)

Master + elog_ereport_attribute_cold.patch
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)

It would be useful if someone with some server-grade Intel hardware
could run a few tests on this. The above results are all from AMD
hardware.

I've attached the patch for this. I'll add it to the July 'fest.

David

[1]: /messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com

Attachments:

elog_ereport_attribute_cold.patchapplication/octet-stream; name=elog_ereport_attribute_cold.patchDownload+44-0
#2Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#1)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Wed, Jun 24, 2020 at 9:51 PM David Rowley <dgrowleyml@gmail.com> wrote:

$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch

Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.

For pgbench -S:

My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:

This is pretty cool, particularly because it affects single-client
performance. It seems like a lot of ideas people have had about
speeding up pgbench performance - including me - have improved
performance under concurrency at the cost of very slightly degrading
single-client performance. It would be nice to claw some of that back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Andres Freund
andres@anarazel.de
In reply to: David Rowley (#1)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

Hi,

Thanks for picking this up again!

On 2020-06-25 13:50:37 +1200, David Rowley wrote:

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.

I recall you trying this before? Has that gotten easier because we
evolved ereport()/elog(), or has gcc become smarter, or ...?

To make the compiler properly mark just >= ERROR calls as cold, and
only when the elevel is a constant, I modified the ereport_domain
macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I think it'd be good to not just do this for ERROR, but also for <=
DEBUG1. I recall seing quite a few debug elogs that made the code worse
just by "being there".

I suspect that doing this for DEBUG* could also improve the code for
clang, because we obviously don't have __builtin_unreachable after those.

I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job. (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)

I think it has to be constant in all paths for that...

Master:
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)

Master + elog_ereport_attribute_cold.patch
drowley@amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)

That is pretty damn cool.

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index e976201030..8076e8af24 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -219,6 +219,19 @@ err_gettext(const char *str)
#endif
}
+#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && defined(HAVE__BUILTIN_CONSTANT_P)
+/*
+ * errstart_cold
+ *		A simple wrapper around errstart, but hinted to be cold so that the
+ *		compiler is more likely to put error code in a cold area away from the
+ *		main function body.
+ */
+bool
+pg_attribute_cold errstart_cold(int elevel, const char *domain)
+{
+	return errstart(elevel, domain);
+}
+#endif

Hm. Would it make sense to have this be a static inline?

/*
* errstart --- begin an error-reporting cycle
diff --git a/src/include/c.h b/src/include/c.h
index d72b23afe4..087b8af6cb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -178,6 +178,21 @@
#define pg_noinline
#endif
+/*
+ * Marking certain functions as "hot" or "cold" can be useful to assist the
+ * compiler in arranging the assembly code in a more efficient way.
+ * These are supported from GCC >= 4.3 and clang >= 3.2
+ */
+#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
+	(defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 2)))
+#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
+#define pg_attribute_hot __attribute__((hot))
+#define pg_attribute_cold __attribute__((cold))
+#else
+#define pg_attribute_hot
+#define pg_attribute_cold
+#endif

Wonder if we should start using __has_attribute() for things like this.

https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html

I.e. we could do something like
#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

#if __has_attribute(hot)
#define pg_attribute_hot __attribute__((hot))
#else
#define pg_attribute_hot
#endif

clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
I don't think we'd loose too much.

#ifdef HAVE__BUILTIN_CONSTANT_P
+#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
+#define ereport_domain(elevel, domain, ...)	\
+	do { \
+		pg_prevent_errno_in_scope(); \
+		if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
+			 errstart_cold(elevel, domain) : \
+			 errstart(elevel, domain)) \
+			__VA_ARGS__, errfinish(__FILE__, __LINE__, PG_FUNCNAME_MACRO); \
+		if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
+			pg_unreachable(); \
+	} while(0)
+#else							/* !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
#define ereport_domain(elevel, domain, ...)	\
do { \
pg_prevent_errno_in_scope(); \
@@ -129,6 +141,7 @@
if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
pg_unreachable(); \
} while(0)
+#endif							/* HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
#else							/* !HAVE__BUILTIN_CONSTANT_P */
#define ereport_domain(elevel, domain, ...)	\
do { \
@@ -146,6 +159,9 @@

Could we do this without another copy? Feels like we should be able to
just do that in the existing #ifdef HAVE__BUILTIN_CONSTANT_P if we just
add errstart_cold() independent HAVE_PG_ATTRIBUTE_HOT_AND_COLD.

Greetings,

Andres Freund

#4David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#3)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Fri, 26 Jun 2020 at 04:35, Andres Freund <andres@anarazel.de> wrote:

On 2020-06-25 13:50:37 +1200, David Rowley wrote:

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.

I recall you trying this before? Has that gotten easier because we
evolved ereport()/elog(), or has gcc become smarter, or ...?

Yeah, I appear to have tried it and found it not to work in [1]/messages/by-id/20171030094449.ffqhvt5n623zvyja@alap3.anarazel.de. I can
only assume GCC got smarter in regards to how it marks a path as cold.
Previously it seemed not do due to the do/while(0). I'm pretty sure
back when I tested last that ditching the do while made it work, just
we can't really get rid of it.

To make the compiler properly mark just >= ERROR calls as cold, and
only when the elevel is a constant, I modified the ereport_domain
macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I think it'd be good to not just do this for ERROR, but also for <=
DEBUG1. I recall seing quite a few debug elogs that made the code worse
just by "being there".

I think that case is different. We don't want to move the entire elog
path into the cold path for that. We'd only want to hint that errstart
is unlikely to return true if elevel <= DEBUG1

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index e976201030..8076e8af24 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -219,6 +219,19 @@ err_gettext(const char *str)
#endif
}
+#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && defined(HAVE__BUILTIN_CONSTANT_P)
+/*
+ * errstart_cold
+ *           A simple wrapper around errstart, but hinted to be cold so that the
+ *           compiler is more likely to put error code in a cold area away from the
+ *           main function body.
+ */
+bool
+pg_attribute_cold errstart_cold(int elevel, const char *domain)
+{
+     return errstart(elevel, domain);
+}
+#endif

Hm. Would it make sense to have this be a static inline?

I thought about that but didn't try it to ensure it still worked ok. I
didn't think it was that important to make sure we don't get the extra
function hop for ERRORs. It seemed like a case we'd not want to really
optimise for.

/*
* errstart --- begin an error-reporting cycle
diff --git a/src/include/c.h b/src/include/c.h
index d72b23afe4..087b8af6cb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -178,6 +178,21 @@
#define pg_noinline
#endif
+/*
+ * Marking certain functions as "hot" or "cold" can be useful to assist the
+ * compiler in arranging the assembly code in a more efficient way.
+ * These are supported from GCC >= 4.3 and clang >= 3.2
+ */
+#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 3))) || \
+     (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && __clang_minor__ >= 2)))
+#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1
+#define pg_attribute_hot __attribute__((hot))
+#define pg_attribute_cold __attribute__((cold))
+#else
+#define pg_attribute_hot
+#define pg_attribute_cold
+#endif

Wonder if we should start using __has_attribute() for things like this.

https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html

I.e. we could do something like
#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

#if __has_attribute(hot)
#define pg_attribute_hot __attribute__((hot))
#else
#define pg_attribute_hot
#endif

clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so
I don't think we'd loose too much.

Thanks for pointing that out. Seems like a good idea to me. I don't
think we'll upset too many people running GCC 4.4 to 5.0. I can't
imagine many people serious about performance will be using a
PostgreSQL version that'll be released in 2021 with a pre 2014
compiler.

#ifdef HAVE__BUILTIN_CONSTANT_P
+#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD
+#define ereport_domain(elevel, domain, ...)  \
+     do { \
+             pg_prevent_errno_in_scope(); \
+             if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
+                      errstart_cold(elevel, domain) : \
+                      errstart(elevel, domain)) \
+                     __VA_ARGS__, errfinish(__FILE__, __LINE__, PG_FUNCNAME_MACRO); \
+             if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
+                     pg_unreachable(); \
+     } while(0)
+#else                                                        /* !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
#define ereport_domain(elevel, domain, ...)  \
do { \
pg_prevent_errno_in_scope(); \
@@ -129,6 +141,7 @@
if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \
pg_unreachable(); \
} while(0)
+#endif                                                       /* HAVE_PG_ATTRIBUTE_HOT_AND_COLD */
#else                                                        /* !HAVE__BUILTIN_CONSTANT_P */
#define ereport_domain(elevel, domain, ...)  \
do { \
@@ -146,6 +159,9 @@

Could we do this without another copy? Feels like we should be able to
just do that in the existing #ifdef HAVE__BUILTIN_CONSTANT_P if we just
add errstart_cold() independent HAVE_PG_ATTRIBUTE_HOT_AND_COLD.

Yeah. I just did it that way so we didn't get the extra function hop
in compilers that don't support __attribute((cold)). If I can inline
errstart_cold() and have the compiler still properly determine that
it's a cold function, then it seems wise to do it that way. If not,
then I'll need to keep a separate macro.

David

[1]: /messages/by-id/20171030094449.ffqhvt5n623zvyja@alap3.anarazel.de

#5David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#4)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Fri, 26 Jun 2020 at 13:21, David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 26 Jun 2020 at 04:35, Andres Freund <andres@anarazel.de> wrote:

On 2020-06-25 13:50:37 +1200, David Rowley wrote:

In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path.

I recall you trying this before? Has that gotten easier because we
evolved ereport()/elog(), or has gcc become smarter, or ...?

Yeah, I appear to have tried it and found it not to work in [1]. I can
only assume GCC got smarter in regards to how it marks a path as cold.
Previously it seemed not do due to the do/while(0). I'm pretty sure
back when I tested last that ditching the do while made it work, just
we can't really get rid of it.

To make the compiler properly mark just >= ERROR calls as cold, and
only when the elevel is a constant, I modified the ereport_domain
macro to do:

if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \

I think it'd be good to not just do this for ERROR, but also for <=
DEBUG1. I recall seing quite a few debug elogs that made the code worse
just by "being there".

I think that case is different. We don't want to move the entire elog
path into the cold path for that. We'd only want to hint that errstart
is unlikely to return true if elevel <= DEBUG1

I played around with this trying to find if there was a way to make this work.

v2 patch includes the change you mentioned about using __has_attribute
(cold) and removes the additional ereport_domain macro
v3 is v2 plus an additional change to mark the branch within
ereport_domain as unlikely when elevel <= DEBUG1
v4 is v2 plus it marks the errstart call as unlikely regardless of elevel.

I tried v4 as I was having trouble as v3 was showing worse performance
than v2. v4 appears better on the AMD system, but that system is
producing noisy results (very obvious if looking at attached
amd3990x_elog_cold.png)

I ran pgbench -S T 600 -P 10 with each patch and for the AMD machine I got:

master = 27817.32167 tps
v2 = 28991.65667 tps (104.22% of master)
v3 = 28622.775 tps (102.90% of master)
v4 = 29648.91 tps (106.58% of master)

(I attribute the speedup here not being the same as my last report due
to noise. A recent bios update partially fixed the problem, but not
completely)

For the intel laptop I got:

master = 25452.38167 tps
v2 = 25473.695 tps (100.08% of master)
v3 = 25434.89333 tps (99.93% of master)
v4 = 25389.02833 tps (99.75% of master)

Looking at the assembly for the v3 patch, it does appear that the
elevel <= DEBUG1 calls were correctly moved to the cold area and that
the ones > DEBUG1 and < ERROR were left alone. However, I did only
look at xlog.s. The intel results don't look very promising, but
perhaps this is not the ideal test to show improvements with
instruction cache efficiency.

+bool
+pg_attribute_cold errstart_cold(int elevel, const char *domain)
+{
+     return errstart(elevel, domain);
+}
+#endif

Hm. Would it make sense to have this be a static inline?

I didn't find a way to make this work (using gcc-10). Inlining, of
course, makes the assembly just call errstart(). errstart_cold() is
nowhere to be seen. The __attribute(cold) does not seem to be applied
to the errstart() call where the errstart_cold() call was inlined.

I've attached a graph showing the results for the AMD and Intel runs
and also csv files of the pgbench tps output. I've also attached each
version of the patch I tested.

It would be good to see some testing done on other hardware.

David

Attachments:

amd3990x_elog_cold.pngimage/png; name=amd3990x_elog_cold.pngDownload+1-2
intel-i78565u_elog_cold.pngimage/png; name=intel-i78565u_elog_cold.pngDownload+0-1
amd3990x_elog_cold.csvapplication/vnd.ms-excel; name=amd3990x_elog_cold.csvDownload
intel_i7-8565U_elog_cold.csvapplication/vnd.ms-excel; name=intel_i7-8565U_elog_cold.csvDownload
elog_ereport_attribute_cold_v2.patchapplication/octet-stream; name=elog_ereport_attribute_cold_v2.patchDownload+39-1
elog_ereport_attribute_cold_v3.patchapplication/octet-stream; name=elog_ereport_attribute_cold_v3.patchDownload+41-1
elog_ereport_attribute_cold_v4.patchapplication/octet-stream; name=elog_ereport_attribute_cold_v4.patchDownload+39-1
#6Michael Paquier
michael@paquier.xyz
In reply to: David Rowley (#5)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Mon, Jun 29, 2020 at 09:36:56PM +1200, David Rowley wrote:

I've attached a graph showing the results for the AMD and Intel runs
and also csv files of the pgbench tps output. I've also attached each
version of the patch I tested.

It would be good to see some testing done on other hardware.

Worth noting that the patch set fails to apply. I have moved this
entry to the next CF, waiting on author.
--
Michael

#7David Rowley
dgrowleyml@gmail.com
In reply to: Michael Paquier (#6)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Mon, 3 Aug 2020 at 19:54, Michael Paquier <michael@paquier.xyz> wrote:

Worth noting that the patch set fails to apply. I have moved this
entry to the next CF, waiting on author.

Thanks.

NB: It's not a patch set. It's 3 different versions of the patch.
They're not all meant to apply at once. Probably the CF bot wasn't
aware of that though :-(

David

#8David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#5)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Mon, 29 Jun 2020 at 21:36, David Rowley <dgrowleyml@gmail.com> wrote:

(I attribute the speedup here not being the same as my last report due
to noise. A recent bios update partially fixed the problem, but not
completely)

I managed to fix the unstable performance on this AMD machine by
tweaking some bios power management settings.

I did some further testing with the v4 patch using both each of:

1. pgbench -S
2. pgbench -S -M prepared
3. pgbench -S -M prepared -c 64 -j 64.
4. TPC-H @ 5GB scale (per recommendation from Andres offlist)

The results of 1-3 above don't really show much of a win, which really
does contradict what I saw about 5 years ago when testing unlikely()
around elog calls in [1]/messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com. The experiment I did back then did pre-date
the use of unlikely() in the source code, so I thought perhaps that
since we now have a sprinkling of unlikely() in various of the hottest
code paths that the use of those already gained most of what we were
going to gain from today's patch. To see if this was the case, I
decided to hack up a test patch which removes all those unlikely()
calls that exist in an if test above an elog/ereport ERROR and I
confirm that I *do* see a small regression in performance from doing
that. This patch only serves to confirm if the existing unlikely()
macros are already giving us most of what we'd get from today's v4
patch, and the results do seem to confirm that.

The 5GB scaled TPC-H test does show some performance gains from the v4
patch and shows an obvious regression from removing the unlikely()
calls too.

Based, mostly on the TPC-H results where performance did improve close
to 2%, I'm starting to think it would be a good idea just to go for
the v4 patch. It means that future hot elog/ereport calls should make
it into the cold path.

Currently, I'm just unsure how other CPUs will benefit from this. The
3990x I've been testing with is pretty new and has some pretty large
caches. I suspect older CPUs may see larger gains.

David

[1]: /messages/by-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8CJWTP9zXeoHMw@mail.gmail.com

Attachments:

pgbench_c1j1_600s.pngimage/png; name=pgbench_c1j1_600s.pngDownload
pgbench_c1j1_prepared_600s.pngimage/png; name=pgbench_c1j1_prepared_600s.pngDownload
tpch_5gb.pngimage/png; name=tpch_5gb.pngDownload
pgbench_c64j64_prepared_600s.pngimage/png; name=pgbench_c64j64_prepared_600s.pngDownload
#9Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#8)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On 2020-08-05 05:00, David Rowley wrote:

The 5GB scaled TPC-H test does show some performance gains from the v4
patch and shows an obvious regression from removing the unlikely()
calls too.

Based, mostly on the TPC-H results where performance did improve close
to 2%, I'm starting to think it would be a good idea just to go for
the v4 patch. It means that future hot elog/ereport calls should make
it into the cold path.

Something based on the v4 patch makes sense.

I would add DEBUG1 back into the conditional, like

if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
DEBUG1) ? \

Also, for the __has_attribute handling, I'd prefer the style that Andres
illustrated earlier, using:

#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#9)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Sat, 5 Sep 2020 at 08:36, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Something based on the v4 patch makes sense.

Thanks for having a look at this.

I would add DEBUG1 back into the conditional, like

if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
DEBUG1) ? \

hmm, but surely we don't want to move all code that's in the same
branch as an elog(DEBUG1) call into a cold area.

With elog(ERROR) we generally have the form:

if (<some condition we hope never to see>)
elog(ERROR, "something bad happened");

In this case, we'd quite like for the compiler to put code relating to
the elog in some cold area of the binary.

With DEBUG we often have the form:

<do normal stuff>

elog(DEBUG1, "some interesting information");

<do normal stuff>

I don't think we'd want to move the <do normal stuff> into a cold area.

The v3 patch just put an unlikely() around the errstart() call if the
level was <= DEBUG1. That just to move the code that's inside the if
(errstart(...)) in the macro into a cold area.

Also, for the __has_attribute handling, I'd prefer the style that Andres
illustrated earlier, using:

#ifndef __has_attribute
#define __has_attribute(attribute) 0
#endif

Yip, for sure. That way is much nicer.

David

#11Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#10)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On 2020-09-06 02:24, David Rowley wrote:

I would add DEBUG1 back into the conditional, like

if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
DEBUG1) ? \

hmm, but surely we don't want to move all code that's in the same
branch as an elog(DEBUG1) call into a cold area.

Yeah, nevermind that.

The v3 patch just put an unlikely() around the errstart() call if the
level was <= DEBUG1. That just to move the code that's inside the if
(errstart(...)) in the macro into a cold area.

That could be useful. Depends on how much effect it has.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#11)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Fri, 11 Sep 2020 at 02:01, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-09-06 02:24, David Rowley wrote:

I would add DEBUG1 back into the conditional, like

if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
DEBUG1) ? \

hmm, but surely we don't want to move all code that's in the same
branch as an elog(DEBUG1) call into a cold area.

Yeah, nevermind that.

I've reattached the v4 patch since it just does the >= ERROR case.

The v3 patch just put an unlikely() around the errstart() call if the
level was <= DEBUG1. That just to move the code that's inside the if
(errstart(...)) in the macro into a cold area.

That could be useful. Depends on how much effect it has.

I wonder if it is. I'm having trouble even seeing gains from the ERROR
case and I'm considering dropping this patch due to that.

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

It would be good if someone else could run some tests on their own
hardware to see if they can see any gains.

David

Attachments:

elog_ereport_attribute_cold_v4.patchapplication/octet-stream; name=elog_ereport_attribute_cold_v4.patchDownload+39-1
tpch_scale5_elog_ereport_cold_v4_vs_master.odsapplication/vnd.oasis.opendocument.spreadsheet; name=tpch_scale5_elog_ereport_cold_v4_vs_master.odsDownload
#13David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#12)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Tue, 22 Sep 2020 at 19:08, David Rowley <dgrowleyml@gmail.com> wrote:

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs
master + elog_ereport_attribute_cold_v4.patch

It does not look great. The patched version seems to have done about
1.17% less work than master did.

David

Attachments:

tpch_scale5_elog_ereport_cold_v4_vs_master_10min.odsapplication/vnd.oasis.opendocument.spreadsheet; name=tpch_scale5_elog_ereport_cold_v4_vs_master_10min.odsDownload
#14Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#13)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On 2020-09-22 22:42, David Rowley wrote:

On Tue, 22 Sep 2020 at 19:08, David Rowley <dgrowleyml@gmail.com> wrote:

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs
master + elog_ereport_attribute_cold_v4.patch

It does not look great. The patched version seems to have done about
1.17% less work than master did.

I wonder how much benefit you'd get from

a) compiling with -O3 instead of -O2, or
b) compiling with profile-driven optimization

I think that would indicate a target and/or a ceiling of what we should
be expecting from hot/cold/likely/unlikely optimization techniques like
this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#13)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Wed, 23 Sep 2020 at 08:42, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 22 Sep 2020 at 19:08, David Rowley <dgrowleyml@gmail.com> wrote:

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs
master + elog_ereport_attribute_cold_v4.patch

It does not look great. The patched version seems to have done about
1.17% less work than master did.

I've marked this patch back as waiting for review. It would be good if
someone could run some tests on some intel hardware and see if they
can see any speedup.

David

#16Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#15)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On 2020-09-29 11:26, David Rowley wrote:

On Wed, 23 Sep 2020 at 08:42, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 22 Sep 2020 at 19:08, David Rowley <dgrowleyml@gmail.com> wrote:

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs
master + elog_ereport_attribute_cold_v4.patch

It does not look great. The patched version seems to have done about
1.17% less work than master did.

I've marked this patch back as waiting for review. It would be good if
someone could run some tests on some intel hardware and see if they
can see any speedup.

What is the way forward here? What exactly would you like to have tested?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#16)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Tue, 3 Nov 2020 at 20:08, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-09-29 11:26, David Rowley wrote:

I've marked this patch back as waiting for review. It would be good if
someone could run some tests on some intel hardware and see if they
can see any speedup.

What is the way forward here? What exactly would you like to have tested?

It would be good to see some small scale bench -S tests with and
without -M prepared.

Also, small scale TPC-H tests would be good. I really only did
testing on new AMD hardware, so some testing on intel hardware would
be good.

David

#18Peter Eisentraut
peter_e@gmx.net
In reply to: David Rowley (#17)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On 2020-11-03 21:53, David Rowley wrote:

On Tue, 3 Nov 2020 at 20:08, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 2020-09-29 11:26, David Rowley wrote:

I've marked this patch back as waiting for review. It would be good if
someone could run some tests on some intel hardware and see if they
can see any speedup.

What is the way forward here? What exactly would you like to have tested?

It would be good to see some small scale bench -S tests with and
without -M prepared.

Also, small scale TPC-H tests would be good. I really only did
testing on new AMD hardware, so some testing on intel hardware would
be good.

I did tests of elog_ereport_attribute_cold_v4.patch on an oldish Mac
Intel laptop with pgbench scale 1 (default), and then:

pgbench -S -T 60

master: tps = 8251.883229 (excluding connections establishing)
patched: tps = 9556.836232 (excluding connections establishing)

pgbench -S -T 60 -M prepared

master: tps = 14713.821837 (excluding connections establishing)
patched: tps = 16200.066185 (excluding connections establishing)

So from that this seems like an easy win.

I also tested on a newish Mac ARM laptop, and there the patch did not do
anything, but that was because clang does not support the cold
attribute, so that part works as well. ;-)

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

#19David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#18)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Sat, 21 Nov 2020 at 03:26, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I did tests of elog_ereport_attribute_cold_v4.patch on an oldish Mac
Intel laptop with pgbench scale 1 (default), and then:

pgbench -S -T 60

master: tps = 8251.883229 (excluding connections establishing)
patched: tps = 9556.836232 (excluding connections establishing)

pgbench -S -T 60 -M prepared

master: tps = 14713.821837 (excluding connections establishing)
patched: tps = 16200.066185 (excluding connections establishing)

So from that this seems like an easy win.

Well, that makes it look pretty good. If we can get 10-15% on some
machines without making things slower on any other machines, then that
seems like a good win to me.

Thanks for testing that.

David

#20David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#19)
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

On Tue, 24 Nov 2020 at 09:36, David Rowley <dgrowleyml@gmail.com> wrote:

Well, that makes it look pretty good. If we can get 10-15% on some
machines without making things slower on any other machines, then that
seems like a good win to me.

Pushed.

Thank you both for reviewing this.

David

#21Greg Nancarrow
gregn4422@gmail.com
In reply to: David Rowley (#20)
#22David Rowley
dgrowleyml@gmail.com
In reply to: Greg Nancarrow (#21)
In reply to: David Rowley (#22)
#24Peter Eisentraut
peter_e@gmx.net
In reply to: Dagfinn Ilmari Mannsåker (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#20)
#26David Rowley
dgrowleyml@gmail.com
In reply to: Peter Eisentraut (#24)
#27David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#25)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#27)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#29)
#32David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#31)
#33David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#33)