NUMA packaging and patch

Started by Kevin Grittnerover 11 years ago15 messages
#1Kevin Grittner
kgrittn@ymail.com
1 attachment(s)

I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA.  The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables.  These were all run
on a single connection.  As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck.  On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance).  I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load.  The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once.  I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial.  Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on.  This is determined by a the "cpuset" associated with the
process which reads or writes the disk page.  Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior.  Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory.  On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations.  I will illustrate
the process on my Ubuntu machine.  Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done.  (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources.  This will normally be all cores and all RAM.  This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset.  This can be a shell from which
# pg_ctl is run, at least for bash shells.  It could also be
# written by the postmaster itself, essentially as an extra pid
# file.  Possible snippet from a service script:

echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes.  This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created.  It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list.  (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

numa-interleave-shared-buffers-v3.difftext/x-diff; name=numa-interleave-shared-buffers-v3.diffDownload
diff --git a/configure b/configure
index ed1ff0a..79a0dea 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libnuma
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -831,6 +832,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libnuma
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1518,6 +1520,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libnuma          use libnuma for NUMA support
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -5822,6 +5825,39 @@ fi
 
 
 
+
+#
+# NUMA library
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+  withval=$with_libnuma;
+  case $withval in
+    yes)
+      $as_echo "#define USE_LIBNUMA 1 Define to 1 to use NUMA features, like interleaved shared memory. (--with-libnuma)" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libnuma=no
+
+fi
+
+
+
+
+
+
 #
 # tzdata
 #
@@ -8781,6 +8817,56 @@ fi
 
 fi
 
+if test "$with_libnuma" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_set_localalloc in -lnuma" >&5
+$as_echo_n "checking for numa_set_localalloc in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_set_localalloc+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_set_localalloc ();
+int
+main ()
+{
+return numa_set_localalloc ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_numa_numa_set_localalloc=yes
+else
+  ac_cv_lib_numa_numa_set_localalloc=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_set_localalloc" >&5
+$as_echo "$ac_cv_lib_numa_numa_set_localalloc" >&6; }
+if test "x$ac_cv_lib_numa_numa_set_localalloc" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+  LIBS="-lnuma $LIBS"
+
+else
+  as_fn_error $? "library 'numa' is required for NUMA support" "$LINENO" 5
+fi
+
+fi
+
 # for contrib/sepgsql
 if test "$with_selinux" = yes; then
   { $as_echo "$as_me:${as_lineno-$LINENO}: checking for security_compute_create_name in -lselinux" >&5
@@ -9466,6 +9552,17 @@ fi
 
 fi
 
+if test "$with_libnuma" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <numa.h> is required for NUMA support" "$LINENO" 5
+fi
+
+
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      for ac_header in ldap.h
diff --git a/configure.in b/configure.in
index 80df1d7..fb06737 100644
--- a/configure.in
+++ b/configure.in
@@ -761,6 +761,16 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+
+#
+# NUMA library
+#
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA support],
+              [AC_DEFINE([USE_LIBNUMA], 1 [Define to 1 to use NUMA features, like interleaved shared memory. (--with-libnuma)])])
+
+AC_SUBST(with_libnuma)
+
+
 #
 # tzdata
 #
@@ -969,6 +979,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libnuma" = yes ; then
+  AC_CHECK_LIB(numa, numa_set_localalloc, [], [AC_MSG_ERROR([library 'numa' is required for NUMA support])])
+fi
+
 # for contrib/sepgsql
 if test "$with_selinux" = yes; then
   AC_CHECK_LIB(selinux, security_compute_create_name, [],
@@ -1097,6 +1111,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libnuma" = yes ; then
+  AC_CHECK_HEADER(numa.h, [], [AC_MSG_ERROR([header file <numa.h> is required for NUMA support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 7430757..6d6cd10 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -27,6 +27,9 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
 
 #include "miscadmin.h"
 #include "portability/mem.h"
@@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
 		 */
 	}
 
+#ifdef USE_LIBNUMA
+	/*
+	 * If this is not a private segment and we are using libnuma, make the
+	 * large memory segment interleaved.
+	 */
+	if (!makePrivate && numa_available())
+	{
+		void   *start;
+
+		if (AnonymousShmem == NULL)
+			start = memAddress;
+		else
+			start = AnonymousShmem;
+
+		numa_interleave_memory(start, size, numa_all_nodes_ptr);
+	}
+#endif
+
 	/*
 	 * OK, we created a new segment.  Mark it as created by this process. The
 	 * order of assignments here is critical so that another Postgres process
#2Merlin Moncure
mmoncure@gmail.com
In reply to: Kevin Grittner (#1)
Re: NUMA packaging and patch

On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA. The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables. These were all run
on a single connection. As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck. On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.

Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance). I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load. The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once. I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial. Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on. This is determined by a the "cpuset" associated with the
process which reads or writes the disk page. Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior. Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.

To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory. On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations. I will illustrate
the process on my Ubuntu machine. Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.

# Sysadmin must create the root cpuset if not already done. (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.

sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset

# Sysadmin must create a cpuset for postgres and configure
# resources. This will normally be all cores and all RAM. This is
# where we specify that this cpuset will spread pages among its
# memory nodes.

sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"

# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.

sudo chown postgres /dev/cpuset/postgres/tasks

# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset. This can be a shell from which
# pg_ctl is run, at least for bash shells. It could also be
# written by the postmaster itself, essentially as an extra pid
# file. Possible snippet from a service script:

echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...

Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes. This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created. It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list. (The
only difference from v2 is fixing bitrot.)

I'll add it to the next CF.

Hm, your patch seems to boil down to interleave_memory(start, size,
numa_all_nodes_ptr) inside PGSharedMemoryCreate(). I've read your
email a couple of times and am a little hazy around a couple of
points, in particular: "the above is probably more important than the
attached patch". So I have a couple of questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
to instruct operators to disable zone_reclaim. Will your changes
invalidate any of that advice?

*) is there any downside to enabling --with-libnuma if you have
support? Do you expect packagers will enable it generally? Why not
just always build it in (if configure allows it) and rely on a GUC if
there is some kind of tradeoff (and if there is one, what kinds of
things are you looking for to manage it)?

*) The bash script above, what problem does the 'alternate policy' solve?

*) What kinds of improvements (even if in very general terms) will we
see from better numa management? Are there further optimizations
possible?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kevin Grittner
kgrittn@ymail.com
In reply to: Merlin Moncure (#2)
Re: NUMA packaging and patch

Merlin Moncure <mmoncure@gmail.com> wrote:

On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Hm, your patch seems to boil down to
   interleave_memory(start, size, numa_all_nodes_ptr)
inside PGSharedMemoryCreate().

That's the functional part -- the rest is about not breaking the
builds for environments which are not NUMA-aware.

I've read your email a couple of times and am a little hazy
around a couple of points, in particular: "the above is probably
more important than the attached patch".  So I have a couple of
questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
to instruct operators to disable zone_reclaim.  Will your changes
invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

*) is there any downside to enabling --with-libnuma if you have
support?

Not that I can see.  There are two additional system calls on
postmaster start-up.  I don't expect the time those take to be
significant.

Do you expect packagers will enable it generally?

I suspect so.

Why not just always build it in (if configure allows it) and rely
on a GUC if there is some kind of tradeoff (and if there is one,
what kinds of things are you looking for to manage it)?

If a build is done on a machine with the NUMA library, and the
executable is deployed on a machine without it, the postmaster will
get an error on the missing library.  I talked about this briefly
with Tom in Ottawa, and he thought that it would be up to packagers
to create a dependency on the library if they build PostgreSQL
using the --with-libnuma option.  The reason to require the option
is so that a build is not created which won't run on target
machines if a packagers does nothing to deal with NUMA.

*) The bash script above, what problem does the 'alternate
policy' solve?

By default, all OS buffers and cache is located in the memory node
closest to the process which does the read or write which first
causes it to be used.  For something like the cp command, that
probably makes sense.  For something like PostgreSQL it can lead to
unbalanced placement of shared resources (like pages in shared
tables and indexes).

*) What kinds of improvements (even if in very general terms)
will we see from better numa management?  Are there further
optimizations possible?

When I spread both OS cache and PostgreSQL shared memory, I got
about 2% better performance overall for a read-only load on a 4
node system which started with everything on one node.  I used
pgbench and picked a scale which put the database size at about 25%
of machine memory before I initialized the database, so that one
memory node was 100% filled with minimal "spill" to the other
nodes.  The run times between the two cases had very minimal
overlap.  The balanced memory usage had more consistent results;
the unbalance load had more variable performance timings, with a
rare run showing better than all the balanced times.

I didn't spend as much time with read/write benchmarks but those
seemed overall worse for the unbalance load, and one outlier on the
bad side was about 20% below the (again, pretty tightly clustered)
times for the balanced load.

These tests were designed to try to create a pretty bad case for
the unbalanced load in a default cpuset configuration and just an
unlucky sizing of the working set relative to a memory node size.
At PGCon I had a discussion over lunch with someone who saw far
worse performance from unbalance memory, but he carefully
engineered a really bad case by using one cpuset to force all data
into one node, and then another cpuset to force PostgreSQL to run
only on cores from which access to that node was relatively slow.
If I remember correctly, he saw about 20% of the throughput that way
versus using the same cores with balanced memory usage.  He
conceded that this was a pretty artificial case, and you would
have to be *trying* to hurt performance to set things up that way,
but he wanted to establish a "worst case" so that he had a hard
bounding of what the maximum possible benefit from balancing load
might be.

There is definitely a need for more benchmarks and benchmarks on
more environments, but my preliminary tests all looked favorable to
the combination of this patch and the cpuset changes.  I would have
posted this months ago if I had found enough time to do more
benchmarks and put together a nice presentation of the results, but
I figured it was a good idea to put this in front of people even
with only preliminary results, so that if others were interested in
doing so they could see what results they got in their
environments or with workloads I had not considered.

I will note that given the wide differences I saw between run times
with the unbalanced memory usage, there must be some variable that
matters which I was not properly controlling.  I still haven't
figured out what that was.  It might be something as simple as a
particular process (like the checkpoint or bgwriter process?)
landing on the fully-allocated memory node versus landing somewhere
else.

I will also note that if the buffers and cache are populated by
small OLTP queries running on a variety of cores, the data can be
spread just by happenstance, and in that case this patch should not
be expected to make any difference at all.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@2ndquadrant.com
In reply to: Kevin Grittner (#3)
Re: NUMA packaging and patch

On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
to instruct operators to disable zone_reclaim.� Will your changes
invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

I think it'll still be important unless you're running an OLTP workload
(i.e. minimal per backend allocations) and your entire workload fits
into shared buffers. What zone_reclaim > 0 essentially does is to never
allocate memory from remote nodes. I.e. it will throw away all numa node
local OS cache to satisfy a memory allocation (including
pagefaults).
I honestly wouldn't expect this to make a huge difference *wrt*
zone_reclaim_mode.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kevin Grittner
kgrittn@ymail.com
In reply to: Andres Freund (#4)
Re: NUMA packaging and patch

Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
to instruct operators to disable zone_reclaim.  Will your changes
invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

I think it'll still be important unless you're running an OLTP workload
(i.e. minimal per backend allocations) and your entire workload fits
into shared buffers. What zone_reclaim > 0 essentially does is to never
allocate memory from remote nodes. I.e. it will throw away all numa node
local OS cache to satisfy a memory allocation (including
pagefaults).

I don't think that cpuset spreading of OS buffers and cache, and
the patch to spread shared memory, will make too much difference
unless the working set is fully cached.  Where I have seen the
biggest problems is when the active set > one memory node and <
total machine RAM.  I would agree that unless this patch is
providing benefit for such a fully-cached load, it won't make any
difference regarding the need for zone_reclaim_mode.  Where the
data is heavily cached, zone_reclaim > 0 might discard some cached
pages to allow, say, a RAM sort to be done in faster memory (for
the current process's core), so it might be a wash or even make
zone_reclaim > 0 a win.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#5)
Re: NUMA packaging and patch

On Mon, Jun 9, 2014 at 1:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-09 08:59:03 -0700, Kevin Grittner wrote:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html )
to instruct operators to disable zone_reclaim. Will your changes
invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

I think it'll still be important unless you're running an OLTP workload
(i.e. minimal per backend allocations) and your entire workload fits
into shared buffers. What zone_reclaim > 0 essentially does is to never
allocate memory from remote nodes. I.e. it will throw away all numa node
local OS cache to satisfy a memory allocation (including
pagefaults).

I don't think that cpuset spreading of OS buffers and cache, and
the patch to spread shared memory, will make too much difference
unless the working set is fully cached. Where I have seen the
biggest problems is when the active set > one memory node and <
total machine RAM.

But that's precisely the scenario where vm.zone_reclaim_mode != 0 is a
disaster. You'll end up throwing away the cached pages and rereading
the data from disk, even though the memory *could* have been kept all
in cache.

I would agree that unless this patch is
providing benefit for such a fully-cached load, it won't make any
difference regarding the need for zone_reclaim_mode. Where the
data is heavily cached, zone_reclaim > 0 might discard some cached
pages to allow, say, a RAM sort to be done in faster memory (for
the current process's core), so it might be a wash or even make
zone_reclaim > 0 a win.

I will believe that when, and only when, I see benchmarks convincingly
demonstrating it. Setting zone_reclaim_mode can only be a win if the
performance benefit from using faster memory is greater than the
performance cost of any rereading-from-disk that happens. IME, that's
a highly unusual situation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Josh Berkus
josh@agliodbs.com
In reply to: Kevin Grittner (#1)
Re: NUMA packaging and patch

On 06/08/2014 03:45 PM, Kevin Grittner wrote:

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on.

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Kevin Grittner
kgrittn@ymail.com
In reply to: Josh Berkus (#7)
Re: NUMA packaging and patch

Josh Berkus <josh@agliodbs.com> wrote:

On 06/08/2014 03:45 PM, Kevin Grittner wrote:

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is
running on.

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

I was not aware of that.  Thanks.  Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented.  I did
figure we would want to add some documentation around it, though.
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Kevin Grittner (#8)
Re: NUMA packaging and patch

Hello,

Let me comment on this patch.

It can be applied on head of the master branch, built and run
regression test successfully.
What this patch tries to do is quite simple and obvious.
It suggests operating system to distribute physical pages to
every numa nodes on allocation.

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Probably, interleave policy should work well on OLTP workload.
How about OLAP workload if physical pages are migrated
by operating system transparently to local node?
In OLAP case, less concurrency is required, but a query run
complicated logic (usually including full-scan) on a particular
CPU.

Isn't it make sense to have a GUC to control the numa policy.
In some cases, it makes sense to allocate physical memory
according to operating system's choice.

Thanks,

2014-06-11 2:34 GMT+09:00 Kevin Grittner <kgrittn@ymail.com>:

Josh Berkus <josh@agliodbs.com> wrote:

On 06/08/2014 03:45 PM, Kevin Grittner wrote:

By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is
running on.

Note that this will stop being the default in future Linux kernels.
However, we'll have to deal with the old ones for some time to come.

I was not aware of that. Thanks. Do you have a URL handy?

In any event, that is the part of the problem which I think falls
into the realm of packagers and/or sysadmins; a patch for that
doesn't seem sensible, given how cpusets are implemented. I did
figure we would want to add some documentation around it, though.
Do you agree that is worthwhile?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Claudio Freire
klaussfreire@gmail.com
In reply to: Kohei KaiGai (#9)
Re: NUMA packaging and patch

On Thu, Jun 26, 2014 at 11:18 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

One thing I concern is, it may conflict with numa-balancing
features that is supported in the recent Linux kernel; that
migrates physical pages according to the location of tasks
which references the page beyond the numa zone.
# I'm not sure whether it is applied on shared memory region.
# Please correct me if I misunderstood. But it looks to me
# physical page in shared memory is also moved.
http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf

Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
which is one of the hottest sources of memory bandwidth consumption in
a database.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Kevin Grittner
kgrittn@ymail.com
In reply to: Claudio Freire (#10)
Re: NUMA packaging and patch

Claudio Freire <klaussfreire@gmail.com> wrote:

Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
which is one of the hottest sources of memory bandwidth consumption in
a database.

Agreed.  On the bright side, the packagers and/or sysadmins can fix this
without any changes to the PostgreSQL code, by creating a custom cpuset
and using it during launch of the postmaster.  I went through that
exercise in my original email.  This patch complements that by
preventing one CPU from managing all of PostgreSQL shared memory, and
thus becoming a bottleneck.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Kevin Grittner (#1)
Re: NUMA packaging and patch

Re: Kevin Grittner 2014-06-09 <1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com>

@@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
*/
}

+#ifdef USE_LIBNUMA
+	/*
+	 * If this is not a private segment and we are using libnuma, make the
+	 * large memory segment interleaved.
+	 */
+	if (!makePrivate && numa_available())
+	{
+		void   *start;
+
+		if (AnonymousShmem == NULL)
+			start = memAddress;
+		else
+			start = AnonymousShmem;
+
+		numa_interleave_memory(start, size, numa_all_nodes_ptr);
+	}
+#endif

How much difference would it make if numactl --interleave=all was used
instead of using numa_interleave_memory() on the shared memory
segments? I guess that would make backend-local memory also
interleaved, but it would avoid having a dependency on libnuma in the
packages.

The numactl manpage even has this example:

numactl --interleave=all bigdatabase arguments Run big
database with its memory interleaved on all CPUs.

It is probably better to have native support in the postmaster, though
this could be mentioned as an alternative in the documentation.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@2ndquadrant.com
In reply to: Christoph Berg (#12)
Re: NUMA packaging and patch

On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:

Re: Kevin Grittner 2014-06-09 <1402267501.41111.YahooMailNeo@web122304.mail.ne1.yahoo.com>

@@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
*/
}

+#ifdef USE_LIBNUMA
+	/*
+	 * If this is not a private segment and we are using libnuma, make the
+	 * large memory segment interleaved.
+	 */
+	if (!makePrivate && numa_available())
+	{
+		void   *start;
+
+		if (AnonymousShmem == NULL)
+			start = memAddress;
+		else
+			start = AnonymousShmem;
+
+		numa_interleave_memory(start, size, numa_all_nodes_ptr);
+	}
+#endif

How much difference would it make if numactl --interleave=all was used
instead of using numa_interleave_memory() on the shared memory
segments? I guess that would make backend-local memory also
interleaved, but it would avoid having a dependency on libnuma in the
packages.

I've tested this a while ago, and it's rather painful if you have a OLAP
workload with lots of backend private memory.

The numactl manpage even has this example:

numactl --interleave=all bigdatabase arguments Run big
database with its memory interleaved on all CPUs.

It is probably better to have native support in the postmaster, though
this could be mentioned as an alternative in the documentation.

I wonder if we shouldn't backpatch such a notice.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Kevin Grittner
kgrittn@ymail.com
In reply to: Andres Freund (#13)
Re: NUMA packaging and patch

Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:

How much difference would it make if numactl --interleave=all
was used instead of using numa_interleave_memory() on the shared
memory segments? I guess that would make backend-local memory
also interleaved, but it would avoid having a dependency on
libnuma in the packages.

I've tested this a while ago, and it's rather painful if you have
a OLAP workload with lots of backend private memory.

I'm not surprised; I would expect it to generally have a negative
effect, which would be most pronounced with an OLAP workload.

The numactl manpage even has this example:

     numactl --interleave=all bigdatabase arguments Run big
     database with its memory interleaved on all CPUs.

It is probably better to have native support in the postmaster,
though this could be mentioned as an alternative in the
documentation.

I wonder if we shouldn't backpatch such a notice.

I would want to see some evidence that it was useful first.  In
most of my tests the benefit of interleaving just the OS cache and
PostgreSQL shared_buffers was about 2%.  That could easily be
erased if work_mem allocations and other process-local memory were
not allocated close to the process which was using it.

I expect that the main benefit of this proposed patch isn't the 2%
typical benefit I was seeing, but that it will be insurance against
occasional, much larger hits.  I haven't had much luck making these
worst case episodes reproducible, though.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Kevin Grittner (#14)
Re: NUMA packaging and patch

Re: Kevin Grittner 2014-07-01 <1404213492.98740.YahooMailNeo@web122306.mail.ne1.yahoo.com>

Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-07-01 11:01:04 +0200, Christoph Berg wrote:

How much difference would it make if numactl --interleave=all
was used instead of using numa_interleave_memory() on the shared
memory segments? I guess that would make backend-local memory
also interleaved, but it would avoid having a dependency on
libnuma in the packages.

I've tested this a while ago, and it's rather painful if you have
a OLAP workload with lots of backend private memory.

I'm not surprised; I would expect it to generally have a negative
effect, which would be most pronounced with an OLAP workload.

Ok, then +1 on having this in core, even if it buys us a dependency on
something that isn't in the usual base system after OS install.

I wonder if we shouldn't backpatch such a notice.

I would want to see some evidence that it was useful first.� In
most of my tests the benefit of interleaving just the OS cache and
PostgreSQL shared_buffers was about 2%.� That could easily be
erased if work_mem allocations and other process-local memory were
not allocated close to the process which was using it.

I expect that the main benefit of this proposed patch isn't the 2%
typical benefit I was seeing, but that it will be insurance against
occasional, much larger hits.� I haven't had much luck making these
worst case episodes reproducible, though.

Afaict, the numactl notice will only be useful as a postscriptum to
the --with-libnuma docs, with the caveats mentioned. Or we backpatch
(something like) the full docs of the feature, with a note that it's
only 9.5+. (Or the full feature gets backpatched...)

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers