RFC: seccomp-bpf support

Started by Joe Conwayover 6 years ago34 messages

mail@joeconway.com

over 6 years ago

2 attachment(s)

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

---------
A seccomp (bpf) filter is comprised of a default action, and a set of
rules with actions pertaining to specific syscalls (possibly with even
more specific sets of arguments). Once loaded into the kernel, a filter
is inherited by all child processes and cannot be removed. It can,
however, be overlaid with another filter. For any given syscall match,
the most restrictive (a.k.a. highest precedence) action will be taken by
the kernel. PostgreSQL has already been run "in the wild" under seccomp
control in containers, and possibly systemd. Adding seccomp support into
PostgreSQL itself mitigates issues with these approaches, and has
several advantages:

* Container seccomp filters tend to be extremely broad/permissive,
typically allowing about 6 out 7 of all syscalls. They must do this
because the use cases for containers vary widely.
* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.
* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.
* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.
* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.
* With built-in support, it is possible to calculate and return (in the
form of an SRF) the effective filters being applied to the postmaster
and the current backend.
* With built-in support, it could be possible (this part not yet
implemented) to have separate filters for different backend types,
e.g. autovac workers, background writer, etc.

---------
Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

The code itself has been through several rounds of revision based on
discussions I have had with the author of libseccomp as well as a few
other folks. However as of the moment:

* Documentation - general discussion missing entirely
* No regression tests

---------
For convenience, here are a couple of additional links to relevant
information regarding seccomp:
https://en.wikipedia.org/wiki/Seccomp
https://github.com/seccomp/libseccomp

---------
Specific feedback requested:
1. Placement of pg_get_seccomp_filter() in
src/backend/utils/adt/genfile.c
originally made sense but after several rewrites no longer does.
Ideas where it *should* go?
2. Where should a general discussion section go in the docs, if at all?
3. Currently this supports a global filter at the postmaster level,
which is inherited by all child processes, and a secondary filter
at the client backend session level. It likely makes sense to
support secondary filters for other types of child processes,
e.g. autovacuum workers, etc. Add that now (pg13), later release,
or never?
4. What is the best way to approach testing of this feature? Tap
testing perhaps?
5. Default GUC values - should we provide "starter" lists, or only a
procedure for generating a list (as below).

---------
Notes on usage:
===============
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

0. Setup:
* install libseccomp, libseccomp-dev, and seccomp
* install auditd if not already installed
* configure postgres --with-seccomp and maybe --enable-tap-tests to
improve feature coverage (see below)

1. Modify postgresql.conf and/or create <pg_source_dir>/postgresql_tmp.conf
8<--------------------
seccomp = on
global_syscall_default = allow
global_syscall_allow = ''
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''
session_syscall_default = log
session_syscall_allow = '*'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<--------------------

2. Modify /etc/audit/auditd.conf
* disp_qos = 'lossless'
* change max_log_file_action = 'ignore'

3. Stop auditd, clear out all audit.logs, start auditd:
* systemctl stop auditd.service # if running
* echo -n "" > /var/log/audit/audit.log
* systemctl start auditd.service

4. Start/restart postgres.

5. Exercise postgres as much as possible (one or more of the following):
* make installcheck-world
* make check world \
EXTRA_REGRESS_OPTS=--temp-config=<pg_source_dir>/postgresql_tmp.conf
* run your application through its paces
* other random testing of relevant postgres features

Note: at this point audit.log will start growing quickly. During `make
check world` mine grew to just under 1 GB.

6. Process results:
a) systemctl stop auditd.service
b) Run the provided "get_syscalls.sh" script
c) Cut and paste the result as the value of session_syscall_allow.

7. Optional:
a) global_syscall_default = 'log'
b) Repeat steps 3-5
c) Repeat step 6a and 6b
d) Cut and paste the result as the value of global_syscall_allow

8. Iterate steps 3-6b.
* Output should be empty.
* If there are any new syscalls, add to global_syscall_allow and
session_syscall_allow.
* Iterate until output of "get_syscalls.sh" script is empty.

9. Optional:
* Change global and session defaults to "error" or "kill"
* Reduce the allow lists if desired
* This can be done for specific database users, by doing
ALTER ROLE... SET session_syscall_allow to '<some reduced allow list>'

10. Adjust settings to taste, restart postgres, and monitor audit.log
going forward.

Below are some values from my system. Note that I have made no attempt
thus far to do static code analysis -- this list was build using `make
check world` several times.
8<-------------------------
seccomp = on

global_syscall_default = log
global_syscall_allow =
'accept,access,bind,brk,chmod,clone,close,connect,dup,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getppid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,listen,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,pipe,poll,prctl,pread64,prlimit64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,seccomp,select,sendto,setitimer,set_robust_list,setsid,setsockopt,shmat,shmctl,shmdt,shmget,shutdown,socket,stat,statfs,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,wait4,write'
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''

session_syscall_default = log
session_syscall_allow =
'access,brk,chmod,close,connect,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,poll,pread64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,select,sendto,setitimer,setsockopt,shutdown,socket,stat,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,write'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<-------------------------

That results in the following effective filters at the ("context"
equals) global and session levels:

If you made it all the way to here, thank you for your attention :-)

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Attachments:

get_syscalls.shapplication/x-shellscript; name=get_syscalls.shDownload

seccomp-2019.08.28.00.difftext/x-patch; name=seccomp-2019.08.28.00.diffDownload

diff --git a/configure b/configure
index f14709e..18cdcd4 100755
*** a/configure
--- b/configure
*************** UUID_EXTRA_OBJS
*** 708,713 ****
--- 708,714 ----
  with_uuid
  with_systemd
  with_selinux
+ with_seccomp
  with_openssl
  with_ldap
  with_krb_srvnam
*************** with_bsd_auth
*** 853,858 ****
--- 854,860 ----
  with_ldap
  with_bonjour
  with_openssl
+ with_seccomp
  with_selinux
  with_systemd
  with_readline
*************** Optional Packages:
*** 1557,1562 ****
--- 1559,1565 ----
    --with-ldap             build with LDAP support
    --with-bonjour          build with Bonjour support
    --with-openssl          build with OpenSSL support
+   --with-seccomp          build with seccomp support
    --with-selinux          build with SELinux support
    --with-systemd          build with systemd support
    --without-readline      do not use GNU Readline nor BSD Libedit for editing
*************** $as_echo "$with_openssl" >&6; }
*** 7897,7902 ****
--- 7900,7940 ----
  
  
  #
+ # Seccomp
+ #
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with seccomp support" >&5
+ $as_echo_n "checking whether to build with seccomp support... " >&6; }
+ 
+ 
+ 
+ # Check whether --with-seccomp was given.
+ if test "${with_seccomp+set}" = set; then :
+   withval=$with_seccomp;
+   case $withval in
+     yes)
+ 
+ $as_echo "#define USE_SECCOMP 1" >>confdefs.h
+ 
+       ;;
+     no)
+       :
+       ;;
+     *)
+       as_fn_error $? "no argument expected for --with-seccomp option" "$LINENO" 5
+       ;;
+   esac
+ 
+ else
+   with_seccomp=no
+ 
+ fi
+ 
+ 
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_seccomp" >&5
+ $as_echo "$with_seccomp" >&6; }
+ 
+ 
+ #
  # SELinux
  #
  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with SELinux support" >&5
*************** fi
*** 12407,12412 ****
--- 12445,12500 ----
  
  
  
+ if test "$with_seccomp" = yes; then
+   { $as_echo "$as_me:${as_lineno-$LINENO}: checking for seccomp_init in -lseccomp" >&5
+ $as_echo_n "checking for seccomp_init in -lseccomp... " >&6; }
+ if ${ac_cv_lib_seccomp_seccomp_init+:} false; then :
+   $as_echo_n "(cached) " >&6
+ else
+   ac_check_lib_save_LIBS=$LIBS
+ LIBS="-lseccomp  $LIBS"
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+ /* end confdefs.h.  */
+ 
+ /* Override any GCC internal prototype to avoid an error.
+    Use char because int might match the return type of a GCC
+    builtin and then its argument prototype would still apply.  */
+ #ifdef __cplusplus
+ extern "C"
+ #endif
+ char seccomp_init ();
+ int
+ main ()
+ {
+ return seccomp_init ();
+   ;
+   return 0;
+ }
+ _ACEOF
+ if ac_fn_c_try_link "$LINENO"; then :
+   ac_cv_lib_seccomp_seccomp_init=yes
+ else
+   ac_cv_lib_seccomp_seccomp_init=no
+ fi
+ rm -f core conftest.err conftest.$ac_objext \
+     conftest$ac_exeext conftest.$ac_ext
+ LIBS=$ac_check_lib_save_LIBS
+ fi
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_seccomp_seccomp_init" >&5
+ $as_echo "$ac_cv_lib_seccomp_seccomp_init" >&6; }
+ if test "x$ac_cv_lib_seccomp_seccomp_init" = xyes; then :
+   cat >>confdefs.h <<_ACEOF
+ #define HAVE_LIBSECCOMP 1
+ _ACEOF
+ 
+   LIBS="-lseccomp $LIBS"
+ 
+ else
+   as_fn_error $? "library 'libseccomp' is required for seccomp support" "$LINENO" 5
+ fi
+ 
+ fi
+ 
  # for contrib/sepgsql
  if test "$with_selinux" = yes; then
    { $as_echo "$as_me:${as_lineno-$LINENO}: checking for security_compute_create_name in -lselinux" >&5
*************** else
*** 13050,13055 ****
--- 13138,13154 ----
  fi
  
  
+ fi
+ 
+ if test "$with_seccomp" = yes ; then
+   ac_fn_c_check_header_mongrel "$LINENO" "seccomp.h" "ac_cv_header_seccomp_h" "$ac_includes_default"
+ if test "x$ac_cv_header_seccomp_h" = xyes; then :
+ 
+ else
+   as_fn_error $? "header file <seccomp.h> is required for seccomp support" "$LINENO" 5
+ fi
+ 
+ 
  fi
  
  if test "$with_libxslt" = yes ; then
diff --git a/configure.in b/configure.in
index 805cf86..65b382d 100644
*** a/configure.in
--- b/configure.in
*************** AC_MSG_RESULT([$with_openssl])
*** 842,847 ****
--- 842,856 ----
  AC_SUBST(with_openssl)
  
  #
+ # Seccomp
+ #
+ AC_MSG_CHECKING([whether to build with seccomp support])
+ PGAC_ARG_BOOL(with, seccomp, no, [build with seccomp support],
+               [AC_DEFINE([USE_SECCOMP], 1, [Define to 1 to build with seccomp support. (--with-seccomp)])])
+ AC_MSG_RESULT([$with_seccomp])
+ AC_SUBST(with_seccomp)
+ 
+ #
  # SELinux
  #
  AC_MSG_CHECKING([whether to build with SELinux support])
*************** fi
*** 1234,1239 ****
--- 1243,1252 ----
  AC_SUBST(LDAP_LIBS_FE)
  AC_SUBST(LDAP_LIBS_BE)
  
+ if test "$with_seccomp" = yes; then
+   AC_CHECK_LIB(seccomp, seccomp_init, [], [AC_MSG_ERROR([library 'libseccomp' is required for seccomp support])])
+ fi
+ 
  # for contrib/sepgsql
  if test "$with_selinux" = yes; then
    AC_CHECK_LIB(selinux, security_compute_create_name, [],
*************** if test "$with_libxml" = yes ; then
*** 1389,1394 ****
--- 1402,1411 ----
    AC_CHECK_HEADER(libxml/parser.h, [], [AC_MSG_ERROR([header file <libxml/parser.h> is required for XML support])])
  fi
  
+ if test "$with_seccomp" = yes ; then
+   AC_CHECK_HEADER(seccomp.h, [], [AC_MSG_ERROR([header file <seccomp.h> is required for seccomp support])])
+ fi
+ 
  if test "$with_libxslt" = yes ; then
    AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
  fi
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc..c346db7 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*************** include_dir 'conf.d'
*** 1859,1864 ****
--- 1859,2031 ----
         </para>
        </listitem>
       </varlistentry>
+ 
+      <varlistentry id="guc-seccomp" xreflabel="seccomp">
+       <term><varname>seccomp</varname> (<type>bool</type>)
+       <indexterm>
+        <primary><varname>seccomp</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         <varname>seccomp</varname> turns on or off seccomp syscall enforcement.
+         This parameter can only be set at server start. The default value is
+         <literal>off</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-global-syscall-default" xreflabel="global_syscall_default">
+       <term><varname>global_syscall_default</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>global_syscall_default</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         <varname>global_syscall_default</varname> determines the default action taken by
+         kernel seccomp enforcement. It is applied to the postmaster and inherited by
+         all child processes.
+        </para>
+ 
+        <para>
+         Valid values are as follows in increasing precedence order.
+         The default value is <literal>allow</literal>, which allows all
+         syscalls not in a specific action list without any action including logging.
+         A value of <literal>log</literal> turns on seccomp enforcement
+         in log-only mode. In this mode, disallowed kernel syscalls are logged by auditd
+         to the audit log. When set to <literal>error</literal>, disallowed kernel syscalls
+         will return with a permission denied error. Finally, <literal>kill</literal> will
+         cause the offending process to be killed as though by a
+         <literal>SIGSYS</literal> signal.
+        </para>
+ 
+        <para>
+         This parameter can only be set at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-session-syscall-default" xreflabel="session_syscall_default">
+       <term><varname>session_syscall_default</varname> (<type>enum</type>)
+       <indexterm>
+        <primary><varname>session_syscall_default</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         <varname>session_syscall_default</varname> helps determine the default action taken
+         by kernel seccomp enforcement. It is applied to client backend sessions. The
+         effective value is the either this setting or that of the postmaster,
+         <varname>global_syscall_default</varname>, whichever has the higher precedence.
+        </para>
+ 
+        <para>
+         Valid values are the same as those for <varname>global_syscall_default</varname>.
+        </para>
+ 
+        <para>
+         This parameter can be set by the superuser, however new values only take effect
+         at session start. This makes it possible to customize sessions with the
+         <command>ALTER ROLE SET</command>. For example, a specific role might have it
+         set to <literal>error</literal> with a restrictive session allow list
+         (<varname>session_syscall_allow</varname>), while other roles have it set to
+         <literal>allow</literal>, assuming <varname>global_syscall_default</varname>
+         is also set to <literal>allow</literal>.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-global-syscall-lists" xreflabel="global_syscall_lists">
+       <term><varname>global_syscall_allow</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>global_syscall_allow</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>global_syscall_log</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>global_syscall_log</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>global_syscall_error</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>global_syscall_error</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>global_syscall_kill</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>global_syscall_kill</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+ 
+       <listitem>
+        <para>
+         These four configuration parameters are lists of kernel syscalls to be given
+         allow, log, error, and kill action rules in the global (postmaster) seccomp
+         filter. They are also inherited by all child processes. Any syscall not explicitly
+         enumerated in one of these lists will have an action as determined by the
+         <varname>global_syscall_default</varname> setting. This parameter can only be set
+         at server start.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-session-syscall-lists" xreflabel="session_syscall_lists">
+       <term><varname>session_syscall_allow</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>session_syscall_allow</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>session_syscall_log</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>session_syscall_log</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>session_syscall_error</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>session_syscall_error</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <term><varname>session_syscall_kill</varname> (<type>string</type>)
+       <indexterm>
+        <primary><varname>session_syscall_kill</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+ 
+       <listitem>
+        <para>
+         These four configuration parameters are lists of kernel syscalls to be given
+         allow, log, error, and kill action rules in the client session backend seccomp
+         filter. Any syscall not explicitly enumerated in one of these lists will have
+         a session filter action as determined by the <varname>session_syscall_default</varname>
+         setting. The actual effective action for any given syscall is the highest
+         precedence action, for that syscall, from either the session filter or the
+         global filter. This setting takes effect on session start and may not be
+         changed once a session is established.
+        </para>
+ 
+        <para>
+         The intent of this feature is to allow further restriction of the syscalls
+         available in an interactive user session. It is also possible to customize
+         sessions with the <command>ALTER ROLE SET</command>. For example, a specific
+         role might be allowed to use the necessary syscalls to enable an untrusted
+         procedural-language function to execute arbitrary system commands, while
+         other roles are denied that permission.
+        </para>
+ 
+        <para>
+         These lists may also be set to the single character <literal>'*'</literal>.
+         When set this way, the corresponding action global list is used without
+         modification.
+        </para>
+ 
+        <para>
+         This parameter can be changed without restarting the server, but changes only
+         take effect when a new session is started.
+        </para>
+ 
+       </listitem>
+      </varlistentry>
       </variablelist>
      </sect2>
  
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a7abf8c..51f5964 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** SELECT collation for ('foo' COLLATE "de_
*** 19678,19683 ****
--- 19678,19767 ----
      </tgroup>
     </table>
  
+    <para>
+     The functions shown in <xref linkend="functions-seccomp"/>
+     print information about active seccomp filters, both at the
+     global (postmaster) level and session (client backend) level.
+     In particular they calculate the the session level based
+     on the kernel's rules for overlaying the session filter
+     on top of the global filter. Essentially, for any given syscall
+     the most restrictive (highest precedence) rule will govern
+     the action taken.
+    </para>
+ 
+    <para>
+     These functions return no rows unless the <option>--with-seccomp</option>
+     was used during <command>configure</command>.
+    </para>
+ 
+    <table id="functions-seccomp">
+     <title>Seccomp Functions</title>
+     <tgroup cols="3">
+      <thead>
+       <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+      </thead>
+ 
+      <tbody>
+       <row>
+        <entry>
+         <indexterm><primary>pg_get_seccomp_filter</primary></indexterm>
+         <literal><function>pg_get_seccomp_filter()</function></literal>
+        </entry>
+        <entry><type>record</type></entry>
+        <entry>
+         Returns information about active seccomp filters.
+        </entry>
+       </row>
+ 
+      </tbody>
+     </tgroup>
+    </table>
+ 
+    <para>
+     <function>pg_get_seccomp_filter</function> returns a record, shown in
+     <xref linkend="functions-pg-get-seccomp-filter"/>
+    </para>
+ 
+    <table id="functions-pg-get-seccomp-filter">
+     <title><function>pg_get_seccomp_filter</function> Columns</title>
+     <tgroup cols="3">
+      <thead>
+       <row>
+        <entry>Column Name</entry>
+        <entry>Data Type</entry>
+        <entry>Description</entry>
+       </row>
+      </thead>
+ 
+      <tbody>
+ 
+       <row>
+        <entry><literal>syscall</literal></entry>
+        <entry><type>text</type></entry>
+        <entry>Name of the kernel syscall</entry>
+       </row>
+ 
+       <row>
+        <entry><literal>syscallnum</literal></entry>
+        <entry><type>int</type></entry>
+        <entry>Kernel syscall number, or -1 for a default rule</entry>
+       </row>
+ 
+       <row>
+        <entry><literal>filter_action</literal></entry>
+        <entry><type>text</type></entry>
+        <entry>Source context -> rule action</entry>
+       </row>
+ 
+       <row>
+        <entry><literal>context</literal></entry>
+        <entry><type>text</type></entry>
+        <entry>Context in which rule is applied</entry>
+       </row>
+ 
+      </tbody>
+     </tgroup>
+    </table>
    </sect1>
  
    <sect1 id="functions-admin">
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 4493862..939e9e3 100644
*** a/doc/src/sgml/installation.sgml
--- b/doc/src/sgml/installation.sgml
*************** su - postgres
*** 254,259 ****
--- 254,266 ----
  
      <listitem>
       <para>
+       You need <productname>seccomp</productname>, if you want to support
+       kernel syscall filtering.
+      </para>
+     </listitem>
+ 
+     <listitem>
+      <para>
        You need <application>Kerberos</application>, <productname>OpenLDAP</productname>,
        and/or <application>PAM</application>, if you want to support authentication
        using those services.
*************** su - postgres
*** 843,848 ****
--- 850,869 ----
           before proceeding.
          </para>
         </listitem>
+       </varlistentry>
+ 
+       <varlistentry>
+        <term><option>--with-seccomp</option></term>
+        <listitem>
+         <para>
+          Build with support for <indexterm><primary>seccomp</primary></indexterm>
+          kernel syscall filtering. This requires <productname>seccomp</productname>
+          packages to be installed. <filename>configure</filename> will check
+          for the required header files and libraries to make sure that
+          your <productname>seccomp</productname> installation is sufficient
+          before proceeding.
+         </para>
+        </listitem>
        </varlistentry>
  
        <varlistentry>
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index dc3f207..bbdc69b 100644
*** a/src/Makefile.global.in
--- b/src/Makefile.global.in
*************** with_perl	= @with_perl@
*** 185,190 ****
--- 185,191 ----
  with_python	= @with_python@
  with_tcl	= @with_tcl@
  with_openssl	= @with_openssl@
+ with_seccomp	= @with_seccomp@
  with_selinux	= @with_selinux@
  with_systemd	= @with_systemd@
  with_gssapi	= @with_gssapi@
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 1119e21..bb2e899 100644
*** a/src/backend/commands/variable.c
--- b/src/backend/commands/variable.c
***************
*** 17,22 ****
--- 17,25 ----
  #include "postgres.h"
  
  #include <ctype.h>
+ #ifdef USE_SECCOMP
+ #include <seccomp.h>
+ #endif
  
  #include "access/htup_details.h"
  #include "access/parallel.h"
*************** show_role(void)
*** 901,903 ****
--- 904,988 ----
  	/* Otherwise we can just use the GUC string */
  	return role_string ? role_string : "none";
  }
+ 
+ #ifdef USE_SECCOMP
+ /*
+  * check_syscall_list: GUC check_hook
+  * check various lists of syscalls used for seccomp enforcement
+  */
+ static bool
+ check_syscall_list(char **newval, void **extra, GucSource source)
+ {
+ 	char		   *rawstring = NULL;
+ 	List		   *elemlist = NIL;
+ 	ListCell	   *l;
+ 	bool			result = true;
+ 
+ 	/* Need a modifiable copy of string */
+ 	rawstring = pstrdup(*newval);
+ 
+ 	/* Parse string into list of syscalls */
+ 	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+ 	{
+ 		GUC_check_errdetail("List syntax is invalid.");
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	foreach(l, elemlist)
+ 	{
+ 		char   *cursyscall = (char *) lfirst(l);
+ 		int		syscallnum;
+ 
+ 		/* resolve the syscall name to its number on the current arch */
+ 		syscallnum = seccomp_syscall_resolve_name(cursyscall);
+ 		if (syscallnum < 0)
+ 		{
+ 			/* invalid syscall name */
+ 			GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ 			GUC_check_errdetail("Seccomp failed to resolve syscall: \"%s\"",
+ 								cursyscall);
+ 			result = false;
+ 			goto out;
+ 		}
+ 	}
+ 
+ out:
+ 	/* safe to release if NIL */
+ 	list_free(elemlist);
+ 
+ 	/* but pfree is not */
+ 	if (rawstring)
+ 		pfree(rawstring);
+ 
+ 	return result;
+ }
+ #endif
+ 
+ bool
+ check_global_syscall_list(char **newval, void **extra, GucSource source)
+ {
+ #ifdef USE_SECCOMP
+ 	return check_syscall_list(newval, extra, source);
+ #else
+ 	return true;
+ #endif
+ }
+ 
+ bool
+ check_session_syscall_list(char **newval, void **extra, GucSource source)
+ {
+ #ifdef USE_SECCOMP
+ 	/*
+ 	 * If the only character of the passed *newval string is '*'
+ 	 * then use the global allow list. Only applies to children
+ 	 * of the postmaster.
+ 	 */
+ 	if (strlen(*newval) == 1 && *newval[0] == '*')
+ 		return true;
+ 	else
+ 		return check_syscall_list(newval, extra, source);
+ #else
+ 	return true;
+ #endif
+ }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 62dc93d..2216d49 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 963,968 ****
--- 963,982 ----
  	 */
  	LocalProcessControlFile(false);
  
+ #ifdef USE_SECCOMP
+ 	/*
+ 	 * If seccomp filtering is requested, load the global filter.
+ 	 * The list of allowed syscalls may be ratched down further
+ 	 * in specific backends based on the actual needs by backend type.
+ 	 */
+ 	if(!load_seccomp_filter("postmaster"))
+ 	{
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("failed to load global seccomp filter")));
+ 	}
+ #endif
+ 
  	/*
  	 * Initialize SSL library, if specified.
  	 */
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 5d4f26a..a9df9e4 100644
*** a/src/backend/utils/adt/genfile.c
--- b/src/backend/utils/adt/genfile.c
***************
*** 15,20 ****
--- 15,23 ----
   */
  #include "postgres.h"
  
+ #ifdef USE_SECCOMP
+ #include <seccomp.h>
+ #endif
  #include <sys/file.h>
  #include <sys/stat.h>
  #include <unistd.h>
***************
*** 28,39 ****
--- 31,46 ----
  #include "funcapi.h"
  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
+ #include "nodes/bitmapset.h"
  #include "postmaster/syslogger.h"
  #include "storage/fd.h"
  #include "utils/builtins.h"
+ #include "utils/guc.h"
+ #include "utils/hsearch.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  #include "utils/timestamp.h"
+ #include "utils/varlena.h"
  
  typedef struct
  {
*************** pg_ls_archive_statusdir(PG_FUNCTION_ARGS
*** 669,671 ****
--- 676,1113 ----
  {
  	return pg_ls_dir_files(fcinfo, XLOGDIR "/archive_status", true);
  }
+ 
+ #define NUM_SECCOMP_FILTER_ATTS		4
+ #define NUM_SECCOMP_RULES			400
+ 
+ #ifdef USE_SECCOMP
+ typedef struct seccomp_rule
+ {
+ 	int			syscallnum;		/* syscall number */
+ 	char	   *syscall;		/* syscall name string */
+ 	int			rule_action;	/* action level for this rule */
+ 	char	   *source;			/* filter source for this rule */
+ } seccomp_rule;
+ 
+ typedef struct seccompHashEntry
+ {
+         int					syscallnum;
+         seccomp_rule	   *scr_entry;
+ } seccompHashEntry;
+ 
+ extern const struct config_enum_entry seccomp_options[];
+ 
+ static void
+ init_hash_from_bitmap(Bitmapset *A, int raction, char *source,
+ 					  HTAB *seccompHash)
+ {
+ 	bool				found;
+ 	int					syscallnum;
+ 	char			   *cursyscall;
+ 	seccompHashEntry   *hentry;
+ 
+ 	syscallnum = -1;
+ 	while ((syscallnum = bms_next_member(A, syscallnum)) >= 0)
+ 	{
+ 		seccomp_rule   *scr = palloc(sizeof(seccomp_rule));
+ 
+ 		scr->syscallnum = syscallnum;
+ 
+ 		/*
+ 		 * Resolver returns NULL on error. Given how we got here that
+ 		 * should never happen. We must free() the result to avoid leakage.
+ 		 */
+ 		cursyscall =  seccomp_syscall_resolve_num_arch(seccomp_arch_native(),
+ 													   syscallnum);
+ 		if (cursyscall)
+ 		{
+ 			scr->syscall = pstrdup(cursyscall);
+ 			free(cursyscall);
+ 		}
+ 		scr->rule_action = raction;
+ 		scr->source = source;
+ 
+ 		hentry = (seccompHashEntry *) hash_search(seccompHash,
+ 												  (const void *) &syscallnum,
+ 												  HASH_ENTER, &found);
+ 
+ 		/* should not happen */
+ 		if (found)
+ 			elog(ERROR, "duplicate syscall entry found: source \"%s\"",
+ 						 source);
+ 
+ 		hentry->syscallnum = syscallnum;
+ 		hentry->scr_entry = scr;
+ 	}
+ }
+ 
+ static void
+ ovly_hash_from_bitmap(Bitmapset *A, int raction, char *gsource,
+ 					  int sdef, char *ssource, HTAB *seccompHash)
+ {
+ 	bool				found;
+ 	int					syscallnum;
+ 	char			   *cursyscall;
+ 	seccompHashEntry   *hentry;
+ 
+ 	syscallnum = -1;
+ 	while ((syscallnum = bms_next_member(A, syscallnum)) >= 0)
+ 	{
+ 		seccomp_rule   *scr;
+ 
+ 		hentry = (seccompHashEntry *) hash_search(seccompHash,
+ 												  (const void *) &syscallnum,
+ 												  HASH_ENTER, &found);
+ 
+ 		/*
+ 		 * If an entry does not exist, we can just add it. However,
+ 		 * the default action from the session still wins if it takes
+ 		 * precedence over that of the global rule.
+ 		 *
+ 		 * If an entry does exist, we must determine whether the new
+ 		 * rule precedence overrides the old one.
+ 		 */
+ 		if (!found)
+ 		{
+ 			scr = palloc(sizeof(seccomp_rule));
+ 			scr->syscallnum = syscallnum;
+ 
+ 			/*
+ 			 * Resolver returns NULL on error. Given how we got here that
+ 			 * should never happen. We must free() the result to avoid leakage.
+ 			 */
+ 			cursyscall =  seccomp_syscall_resolve_num_arch(seccomp_arch_native(),
+ 														   syscallnum);
+ 			if (cursyscall)
+ 			{
+ 				scr->syscall = pstrdup(cursyscall);
+ 				free(cursyscall);
+ 			}
+ 			if (raction > sdef)
+ 			{
+ 				scr->rule_action = raction;
+ 				scr->source = gsource;
+ 			}
+ 			else
+ 			{
+ 				scr->rule_action = sdef;
+ 				scr->source = ssource;
+ 			}
+ 
+ 			hentry->syscallnum = syscallnum;
+ 			hentry->scr_entry = scr;
+ 		}
+ 		else
+ 		{
+ 			/* determine if adjustment is necessary */
+ 			scr = hentry->scr_entry;
+ 			if (raction > scr->rule_action)
+ 			{
+ 				/* new rule takes precedence */
+ 				scr->rule_action = raction;
+ 				scr->source = gsource;
+ 			}
+ 		}
+ 	}
+ }
+ 
+ static void
+ ovly_hash_from_default(Bitmapset *A, int raction, char *source,
+ 					  HTAB *seccompHash)
+ {
+ 	bool				found;
+ 	int					syscallnum;
+ 	seccompHashEntry   *hentry;
+ 
+ 	syscallnum = -1;
+ 	while ((syscallnum = bms_next_member(A, syscallnum)) >= 0)
+ 	{
+ 		seccomp_rule   *scr;
+ 
+ 		hentry = (seccompHashEntry *) hash_search(seccompHash,
+ 												  (const void *) &syscallnum,
+ 												  HASH_ENTER, &found);
+ 
+ 		/*
+ 		 * If an entry does not already exist at this point, something
+ 		 * odd is amiss. Should not happen.
+ 		 */
+ 		if (!found)
+ 			elog(ERROR, "failed to find expected session filter syscall " \
+ 						"entry: syscall number %d", syscallnum);
+ 		else
+ 		{
+ 			/* determine if adjustment is necessary */
+ 			scr = hentry->scr_entry;
+ 			if (raction > scr->rule_action)
+ 			{
+ 				/* new rule takes precedence */
+ 				scr->rule_action = raction;
+ 				scr->source = source;
+ 			}
+ 		}
+ 	}
+ }
+ 
+ static const char *
+ get_seccomp_opt_str(int val)
+ {
+ 	const struct config_enum_entry *entry;
+ 
+ 	/* stringify the enforcement action levels */
+ 	for (entry = seccomp_options; entry->name; entry++)
+ 		if (entry->val == val)
+ 			return entry->name;
+ 
+ 	return "unknown";
+ }
+ 
+ static void
+ put_global_rules(Bitmapset *A, int raction, char *source,
+ 				 TupleDesc tupdesc, Tuplestorestate *tupstore)
+ {
+ 	int					syscallnum;
+ 
+ 	syscallnum = -1;
+ 	while ((syscallnum = bms_next_member(A, syscallnum)) >= 0)
+ 	{
+ 		Datum				values[NUM_SECCOMP_FILTER_ATTS];
+ 		bool				nulls[NUM_SECCOMP_FILTER_ATTS];
+ 		char			   *cursyscall;
+ 
+ 		memset(values, 0, sizeof(values));
+ 		memset(nulls, 0, sizeof(nulls));
+ 
+ 		/*
+ 		 * Resolver returns NULL on error. Given how we got here that
+ 		 * should never happen. We must free() the result to avoid leakage.
+ 		 */
+ 		cursyscall =  seccomp_syscall_resolve_num_arch(seccomp_arch_native(),
+ 													   syscallnum);
+ 		if (cursyscall)
+ 		{
+ 			char	   *buf;
+ 
+ 			values[0] = PointerGetDatum(cstring_to_text(cursyscall));
+ 			free(cursyscall);
+ 
+ 			values[1] = Int32GetDatum(syscallnum);
+ 
+ 			buf = psprintf("%s->%s", source, get_seccomp_opt_str(raction));
+ 			values[2] = PointerGetDatum(cstring_to_text(buf));
+ 
+ 			values[3] = PointerGetDatum(cstring_to_text("global"));
+ 
+ 			/* shove row into tuplestore */
+ 			tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ 		}
+ 	}
+ }
+ #endif /* USE_SECCOMP */
+ 
+ Datum
+ pg_get_seccomp_filter(PG_FUNCTION_ARGS)
+ {
+ #ifdef USE_SECCOMP
+ 	seccomp_filter	   *g = global_filter;
+ 	seccomp_filter	   *s = session_filter;
+ 	HASHCTL         	ctl;
+ 	HTAB			   *seccompHash = NULL;
+ 	seccompHashEntry   *hentry;
+ 	seccomp_rule	   *scr = palloc(sizeof(seccomp_rule));
+ 	HASH_SEQ_STATUS		status;
+ 	int					syscallnum;
+ 	char			   *gsource = "global";
+ 	char			   *ssource = "session";
+  	int					gdef = g->def;
+  	int					sdef = s->def;
+  	int					mdef = (gdef > sdef) ? gdef : sdef;
+  	char			   *msource = (gdef > sdef) ? gsource : ssource;
+ 	Bitmapset		   *gunion = NULL;
+ 	Bitmapset		   *sunion = NULL;
+ 	char			   *buf;
+ 	Datum				values[NUM_SECCOMP_FILTER_ATTS];
+ 	bool				nulls[NUM_SECCOMP_FILTER_ATTS];
+ #endif
+ 	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ 	TupleDesc			tupdesc;
+ 	Tuplestorestate	   *tupstore;
+ 	MemoryContext		per_query_ctx;
+ 	MemoryContext		oldcontext;
+ 
+ 	/* Check to see if caller supports us returning a tuplestore */
+ 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("set-valued function called in context that cannot accept a set")));
+ 	if (!(rsinfo->allowedModes & SFRM_Materialize))
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("materialize mode required, but it is not " \
+ 						"allowed in this context")));
+ 
+ 	/* Switch into long-lived context to construct returned data structures */
+ 	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ 	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+ 
+ 	/* Build a tuple descriptor for our result type */
+ 	/* need a tuple descriptor representing three TEXT columns */
+ 	tupdesc = CreateTemplateTupleDesc(NUM_SECCOMP_FILTER_ATTS);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "syscall",
+ 					   TEXTOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "syscallnum",
+ 					   INT4OID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "filter_action",
+ 					   TEXTOID, -1, 0);
+ 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "context",
+ 					   TEXTOID, -1, 0);
+ 
+ 	/* Build a tuplestore to return our results in */
+ 	tupstore = tuplestore_begin_heap(true, false, work_mem);
+ 	rsinfo->returnMode = SFRM_Materialize;
+ 	rsinfo->setResult = tupstore;
+ 	rsinfo->setDesc = tupdesc;
+ 
+ #ifdef USE_SECCOMP
+ 
+ 	/*
+ 	 * We need to iterate through 4 bitmap sets each, across two filters
+ 	 * (global and session), applying the below logic, in order to
+ 	 * determine which action applies to what syscall. The most
+ 	 * straighforward way to do that seems to be to build a hash
+ 	 * table since the two filter sets may overlap, and the syscall
+ 	 * numbers may vary with architecture.
+ 	 *
+ 	 * The aforementioned logic is:
+ 	 * 1. The most recently installed filter is evaluated first (session)
+ 	 * 2. For a given filter, each syscall action is either the action
+ 	 *    value given in a syscall-specific rule, or the default action. 
+ 	 * 3. For any given syscall, the "first-seen action value of highest
+ 	 *    precedence" is applied. The precedence in order of high-to-low
+ 	 *    is: kill, error, log, allow.
+ 	 *
+ 	 * There are four combinations of the possible sets of rules to
+ 	 * consider:
+ 	 * g - global (postmaster)
+ 	 * s - session (backend)
+ 	 *
+ 	 * C1. Intersection of g + s
+ 	 * C2. In g, not in s
+ 	 * C3. In s, not in g
+ 	 * C4. Not in g or s
+ 	 *
+ 	 * C1 and C2 are handled by init_hash_from_bitmap()
+ 	 * and ovly_hash_from_bitmap(). C3 is handled by
+ 	 * ovly_hash_from_default(). C4 is covered by the final
+ 	 * "<default>" entry in the hash table.
+ 	 */
+ 	memset(&ctl, 0, sizeof(ctl));
+ 	ctl.keysize = sizeof(int);
+ 	ctl.entrysize = sizeof(seccompHashEntry);
+ 	seccompHash = hash_create("syscall rules", NUM_SECCOMP_RULES,
+ 							  &ctl, HASH_ELEM | HASH_BLOBS);
+ 
+ 	/*
+ 	 * Build up the hash table initially from the session filter.
+ 	 * We ensured no overlap of syscalls within a given filter in
+ 	 * load_seccomp_filter(), so it should be safe to just add
+ 	 * all the syscall numbers found in the 4 bitmap sets.
+ 	 */
+ 	init_hash_from_bitmap(s->kill, PG_SECCOMP_KILL, ssource, seccompHash);
+ 	init_hash_from_bitmap(s->error, PG_SECCOMP_ERROR, ssource, seccompHash);
+ 	init_hash_from_bitmap(s->log, PG_SECCOMP_LOG, ssource, seccompHash);
+ 	init_hash_from_bitmap(s->allow, PG_SECCOMP_ALLOW, ssource, seccompHash);
+ 
+ 	/*
+ 	 * Now overlay the global filter. Again we ensured no overlap
+ 	 * of syscalls within this filter in load_seccomp_filter(),
+ 	 * so it should be safe to just overlay all the syscall numbers
+ 	 * found in the 4 global bitmap sets.
+ 	 */
+ 	ovly_hash_from_bitmap(g->kill, PG_SECCOMP_KILL, gsource,
+ 						  sdef, ssource, seccompHash);
+ 	ovly_hash_from_bitmap(g->error, PG_SECCOMP_ERROR, gsource,
+ 						  sdef, ssource, seccompHash);
+ 	ovly_hash_from_bitmap(g->log, PG_SECCOMP_LOG, gsource,
+ 						  sdef, ssource, seccompHash);
+ 	ovly_hash_from_bitmap(g->allow, PG_SECCOMP_ALLOW, gsource,
+ 						  sdef, ssource, seccompHash);
+ 
+ 	/*
+ 	 * If rules from the session filter are not also explicitly
+ 	 * in the global filter, they must be compared against, and
+ 	 * possibly adjusted to, the global default action.
+ 	 */
+ 	gunion = bms_union(bms_union(bms_union(g->kill, g->error), g->log),
+ 					   g->allow);
+ 	sunion = bms_union(bms_union(bms_union(s->kill, s->error), s->log),
+ 					   s->allow);
+ 	ovly_hash_from_default(bms_difference(sunion, gunion),
+ 						   gdef, gsource, seccompHash);
+ 
+ 	/* create entry for the session default rule */
+ 	scr->syscallnum = -1;
+ 	scr->syscall = "<default>";
+ 	scr->rule_action = mdef;
+ 	scr->source = msource;
+ 	hentry = (seccompHashEntry *) hash_search(seccompHash,
+ 											  (const void *) &syscallnum,
+ 											  HASH_ENTER, NULL);
+ 	hentry->syscallnum = syscallnum;
+ 	hentry->scr_entry = scr;
+ 
+ 	/* Process the "session" results and fill the tuplestore */
+ 	hash_seq_init(&status, seccompHash);
+ 
+ 	while ((hentry = (seccompHashEntry *) hash_seq_search(&status)) != NULL)
+ 	{
+ 		char	   *buf;
+ 
+ 		memset(values, 0, sizeof(values));
+ 		memset(nulls, 0, sizeof(nulls));
+ 
+ 		scr = hentry->scr_entry;
+ 		buf = psprintf("%s->%s", scr->source,
+ 								 get_seccomp_opt_str(scr->rule_action));
+ 
+ 		values[0] = PointerGetDatum(cstring_to_text(scr->syscall));
+ 		values[1] = Int32GetDatum(scr->syscallnum);
+ 		values[2] = PointerGetDatum(cstring_to_text(buf));
+ 		values[3] = PointerGetDatum(cstring_to_text("session"));
+ 
+ 		/* shove row into tuplestore */
+ 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ 	}
+ 
+ 	/*
+ 	 * Add rows for the "global" context. This is far simpler, since
+ 	 * we can simply iterate through the global bitmaps and do not
+ 	 * need to take care for rule precedence, etc., due to there
+ 	 * only being one filter (that we know about in any case).
+ 	 */
+ 	put_global_rules(g->kill, PG_SECCOMP_KILL, gsource, tupdesc, tupstore);
+ 	put_global_rules(g->error, PG_SECCOMP_ERROR, gsource, tupdesc, tupstore);
+ 	put_global_rules(g->log, PG_SECCOMP_LOG, gsource, tupdesc, tupstore);
+ 	put_global_rules(g->allow, PG_SECCOMP_ALLOW, gsource, tupdesc, tupstore);
+ 
+ 	/* create entry for the global default rule */
+ 	values[0] = PointerGetDatum(cstring_to_text("<default>"));
+ 	values[1] = Int32GetDatum(-1);
+ 
+ 	buf = psprintf("%s->%s", gsource, get_seccomp_opt_str(gdef));
+ 	values[2] = PointerGetDatum(cstring_to_text(buf));
+ 
+ 	values[3] = PointerGetDatum(cstring_to_text("global"));
+ 
+ 	/* shove row into tuplestore */
+ 	tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+ 
+ #endif	/* USE_SECCOMP */
+ 
+ 	tuplestore_donestoring(tupstore);
+ 
+ 	/* Reset context */
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return (Datum) 0;
+ }
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 83c9514..1bf2e41 100644
*** a/src/backend/utils/init/miscinit.c
--- b/src/backend/utils/init/miscinit.c
***************
*** 29,34 ****
--- 29,38 ----
  #ifdef HAVE_UTIME_H
  #include <utime.h>
  #endif
+ #ifdef USE_SECCOMP
+ #include <seccomp.h>
+ #include <sys/prctl.h>
+ #endif
  
  #include "access/htup_details.h"
  #include "catalog/pg_authid.h"
***************
*** 36,41 ****
--- 40,46 ----
  #include "libpq/libpq.h"
  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
+ #include "nodes/bitmapset.h"
  #include "pgstat.h"
  #include "postmaster/autovacuum.h"
  #include "postmaster/postmaster.h"
*************** pg_bindtextdomain(const char *domain)
*** 1617,1619 ****
--- 1622,2010 ----
  	}
  #endif
  }
+ 
+ /*-------------------------------------------------------------------------
+  *				seccomp filtering support
+  *-------------------------------------------------------------------------
+  */
+ 
+ /*
+  * GUC variables: lists of syscall names to be filtered at postmaster
+  * start and at backend start
+  */
+ 
+ const struct config_enum_entry seccomp_options[] = {
+ 	{"allow", PG_SECCOMP_ALLOW, false},
+ 	{"log", PG_SECCOMP_LOG, false},
+ 	{"error", PG_SECCOMP_ERROR, false},
+ 	{"kill", PG_SECCOMP_KILL, false},
+ 	{NULL, 0}
+ };
+ 
+ seccomp_filter *global_filter = NULL;
+ seccomp_filter *session_filter = NULL;
+ bool	seccomp_enabled = false;
+ int		global_syscall_default = PG_SECCOMP_ALLOW;
+ char   *global_syscall_allow_string = NULL;
+ char   *global_syscall_log_string = NULL;
+ char   *global_syscall_error_string = NULL;
+ char   *global_syscall_kill_string = NULL;
+ int		session_syscall_default = PG_SECCOMP_ALLOW;
+ char   *session_syscall_allow_string = NULL;
+ char   *session_syscall_log_string = NULL;
+ char   *session_syscall_error_string = NULL;
+ char   *session_syscall_kill_string = NULL;
+ 
+ #ifdef USE_SECCOMP
+ static bool apply_seccomp_list(scmp_filter_ctx	*ctx, const char *slist,
+ 							   uint32_t rule_action, uint32_t def_action,
+ 							   seccomp_filter *current_filter);
+ static const char *expand_seccomp_list(const char *slist, const char *glist,
+ 									   const char *saction);
+ static void set_filter_def_action(int default_action,
+ 								  seccomp_filter *current_filter,
+ 								  char *context);
+ #endif
+ 
+ /*
+  * Create and load seccomp filter for the requested context.
+  *
+  * Return false on error and let the caller decide what to do
+  * rather than throwing an ERROR (or FATAL) here.
+  */
+ bool
+ load_seccomp_filter(char *context)
+ {
+ #ifdef USE_SECCOMP
+ 	const char	   *allow_list = NULL;
+ 	const char	   *log_list = NULL;
+ 	const char	   *error_list = NULL;
+ 	const char	   *kill_list = NULL;
+ 	int				default_action;
+ 	uint32_t		def_action;
+ 	scmp_filter_ctx	ctx = NULL;
+ 	int				rc;
+ 	bool			result = true;
+ 	MemoryContext	oldcontext;
+ 	seccomp_filter *current_filter = NULL;
+ 
+ 	/* should not happen */
+ 	if (context == NULL)
+ 	{
+ 		ereport(WARNING, (errmsg("invalid seccomp context")));
+ 		return false;
+ 	}
+ 
+ 	/* if seccomp is disabled just return with success */
+ 	if (!seccomp_enabled)
+ 	{
+ 		ereport(LOG, (errmsg("seccomp disabled")));
+ 		return true;
+ 	}
+ 
+ 	/*
+ 	 * If the only character of the passed syscall_list is '*'
+ 	 * then use the global allow list. Only applies to children
+ 	 * of the postmaster.
+ 	 */
+ 	if (strcmp(context, "postmaster") != 0)
+ 	{
+ 		/* in a backend session */
+ 		/* we are going to need this later */
+ 		oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+ 		session_filter = palloc0(sizeof(seccomp_filter));
+ 		session_filter->source = pstrdup("session");
+ 		MemoryContextSwitchTo(oldcontext);
+ 		current_filter = session_filter;
+ 
+ 		allow_list = expand_seccomp_list(session_syscall_allow_string,
+ 										 global_syscall_allow_string,
+ 										 "allow");
+ 		log_list = expand_seccomp_list(session_syscall_log_string,
+ 										 global_syscall_log_string,
+ 										 "log");
+ 		error_list = expand_seccomp_list(session_syscall_error_string,
+ 										 global_syscall_error_string,
+ 										 "error");
+ 		kill_list = expand_seccomp_list(session_syscall_kill_string,
+ 										 global_syscall_kill_string,
+ 										 "kill");
+ 
+ 		default_action = session_syscall_default;
+ 		/*
+ 		 * Fastpath: if the lists were all defaulted to their
+ 		 * respective global list, and the session value of
+ 		 * default_action is also the same as the global setting,
+ 		 * just exit with success immediately. This avoids creating
+ 		 * another identical seccomp bpf filter which will just
+ 		 * slow everything down for no particular reason.
+ 		 */
+ 		if (default_action == global_syscall_default &&
+ 				allow_list == global_syscall_allow_string &&
+ 				log_list == global_syscall_log_string &&
+ 				error_list == global_syscall_error_string &&
+ 				kill_list == global_syscall_kill_string)
+ 			return true;
+ 	}
+ 	else
+ 	{
+ 		/* in the postmaster */
+ 		/* we are going to need this later */
+ 		oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+ 		global_filter = palloc0(sizeof(seccomp_filter));
+ 		global_filter->source = pstrdup("global");
+ 		MemoryContextSwitchTo(oldcontext);
+ 		current_filter = global_filter;
+ 
+ 		allow_list = global_syscall_allow_string;
+ 		log_list = global_syscall_log_string;
+ 		error_list = global_syscall_error_string;
+ 		kill_list = global_syscall_kill_string;
+ 		default_action = global_syscall_default;
+ 	}
+ 
+ 	/* Disable ptrace bybass */
+ 	rc = prctl(PR_SET_DUMPABLE, 0, 0, 0, 0);
+ 	if (rc < 0)
+ 	{
+ 		ereport(WARNING,
+ 				(ERRCODE_SYSTEM_ERROR,
+ 				 errmsg("seccomp could not set dumpable: %m")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	/* set the seccomp default action */
+ 	if (default_action == PG_SECCOMP_ERROR)
+ 		def_action = SCMP_ACT_ERRNO(EACCES);
+ 	else if (default_action == PG_SECCOMP_KILL)
+ 		def_action = SCMP_ACT_KILL;
+ 	else if (default_action == PG_SECCOMP_LOG)
+ 		def_action = SCMP_ACT_LOG;
+ 	else if (default_action == PG_SECCOMP_ALLOW)
+ 		def_action = SCMP_ACT_ALLOW;
+ 	else
+ 	{
+ 		/* unknown enforce action type */
+ 		ereport(WARNING,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("seccomp default action action unknown")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 	/* preserve and log the setting */
+ 	set_filter_def_action(default_action, current_filter, context);
+ 
+ 	/* Initialize seccomp with default action */
+ 	ctx = seccomp_init(def_action);
+ 	if (ctx == NULL)
+ 	{
+ 		ereport(WARNING, (errcode(ERRCODE_OUT_OF_MEMORY),
+ 						  errmsg("out of memory")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	/*
+ 	 * By default, libseccomp will set up audit logging
+ 	 * such that actions KILL and LOG will get audit records,
+ 	 * however ERRNO will not. Arrange to have all not-allowed
+ 	 * syscalls logged instead.
+ 	 */
+ 	rc = seccomp_attr_set(ctx, SCMP_FLTATR_CTL_LOG, 1);
+ 	if (rc != 0)
+ 	{
+ 		ereport(WARNING,
+ 				(errcode(ERRCODE_SYSTEM_ERROR),
+ 				 errmsg("seccomp failed to set audit actions")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	if (!
+ 		 (apply_seccomp_list(&ctx, allow_list, SCMP_ACT_ALLOW,
+ 							 def_action, current_filter) &&
+ 		  apply_seccomp_list(&ctx, log_list, SCMP_ACT_LOG,
+ 							 def_action, current_filter) &&
+ 		  apply_seccomp_list(&ctx, error_list, SCMP_ACT_ERRNO(EACCES),
+ 							 def_action, current_filter) &&
+ 		  apply_seccomp_list(&ctx, kill_list, SCMP_ACT_KILL,
+ 							 def_action, current_filter)))
+ 	{
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	/*
+ 	 * Although libseccomp will silently throw away repeated filter
+ 	 * rules against the same syscall (unless arguments are checked,
+ 	 * which we are not supporting here), it can lead to confusing
+ 	 * results, so disallow that here.
+ 	 */
+ 	if (bms_overlap(current_filter->allow, current_filter->log) ||
+ 		bms_overlap(current_filter->error, current_filter->kill) ||
+ 		bms_overlap(bms_union(current_filter->allow, current_filter->log),
+ 					bms_union(current_filter->error, current_filter->kill)))
+ 	{
+ 		ereport(WARNING,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("seccomp failed due to overlapping rule sets")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	/* Finally, actually load the filter */
+ 	rc = seccomp_load(ctx);
+ 	if (rc != 0)
+ 	{
+ 		ereport(WARNING,
+ 				(errcode(ERRCODE_SYSTEM_ERROR),
+ 				 errmsg("seccomp failed to load rule set")));
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ out:
+ 	/* safe to release if NULL/NIL */
+ 	seccomp_release(ctx);
+ 
+ 	return result;
+ #else
+ 	return false;
+ #endif
+ }
+ 
+ #ifdef USE_SECCOMP
+ static bool
+ apply_seccomp_list(scmp_filter_ctx	*ctx, const char *slist,
+ 				   uint32_t rule_action, uint32_t def_action,
+ 				   seccomp_filter *current_filter)
+ {
+ 	char		   *rawstring = NULL;
+ 	List		   *elemlist = NIL;
+ 	ListCell	   *l;
+ 	bool			result = true;
+ 	MemoryContext	oldcontext;
+ 
+ 	/* 
+ 	 * libseccomp disallows the case where individual syscall rules
+ 	 * are created with the same as the default action. Therefore,
+ 	 * be careful not to add those rules to the filter we are creating.
+ 	 */
+ 	if (rule_action == def_action)
+ 		return true;
+ 
+ 	/* Need a modifiable copy */
+ 	rawstring = pstrdup(slist);
+ 
+ 	/* Parse string into list of syscalls */
+ 	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+ 	{
+ 		result = false;
+ 		goto out;
+ 	}
+ 
+ 	/* add syscall specific rules to the filter */
+ 	foreach(l, elemlist)
+ 	{
+ 		char   *cursyscall = (char *) lfirst(l);
+ 		int		syscallnum;
+ 		int		rc;
+ 
+ 		/*
+ 		 * Resolve the syscall name to its number on the current arch.
+ 		 * This should have already been validated by the GUC
+ 		 * check function.
+ 		 */
+ 		syscallnum = seccomp_syscall_resolve_name(cursyscall);
+ 		if (syscallnum < 0)
+ 		{
+ 			/* should not happen */
+ 			ereport(WARNING,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("seccomp failed to resolve: syscall \"%s\"",
+ 							cursyscall)));
+ 			result = false;
+ 			goto out;
+ 		}
+ 		else
+ 		{
+ 			rc = seccomp_rule_add(*ctx, rule_action, syscallnum, 0);
+ 			if (rc != 0)
+ 			{
+ 				/* should not be reachable */
+ 				ereport(WARNING,
+ 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 						 errmsg("seccomp failed to add rule: syscall \"%s\", %d",
+ 								 cursyscall, syscallnum)));
+ 				result = false;
+ 				goto out;
+ 			}
+ 			oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+ 
+ 			if (rule_action == SCMP_ACT_ALLOW)
+ 				current_filter->allow = bms_add_member(current_filter->allow,
+ 													   syscallnum);
+ 			else if (rule_action == SCMP_ACT_LOG)
+ 				current_filter->log = bms_add_member(current_filter->log,
+ 													   syscallnum);
+ 			else if (rule_action == SCMP_ACT_ERRNO(EACCES))
+ 				current_filter->error = bms_add_member(current_filter->error,
+ 													   syscallnum);
+ 			else if (rule_action == SCMP_ACT_KILL)
+ 				current_filter->kill = bms_add_member(current_filter->kill,
+ 													   syscallnum);
+ 
+ 			MemoryContextSwitchTo(oldcontext);
+ 		}
+ 	}
+ 
+ out:
+ 	/* safe to release if still NIL */
+ 	list_free(elemlist);
+ 
+ 	/* but pfree is not */
+ 	if (rawstring)
+ 		pfree(rawstring);
+ 
+ 	return result;
+ }
+ 
+ static const char*
+ expand_seccomp_list(const char *slist, const char *glist,
+ 					const char *saction)
+ {
+ 	
+ 	if (slist && strlen(slist) == 1 && slist[0] == '*')
+ 	{
+ 		/* use the global list as promised */
+ 		ereport(LOG,
+ 				(errmsg("seccomp \"%s\" list inherited from postmaster", saction)));
+ 
+ 		return glist;
+ 	}
+ 	else
+ 		return slist;
+ }
+ 
+ static void
+ set_filter_def_action(int default_action, seccomp_filter *current_filter,
+ 					  char *context)
+ {
+ 	const struct config_enum_entry *entry;
+ 
+ 	current_filter->def = default_action;
+ 	/* stringify the enforcement action levels */
+ 	for (entry = seccomp_options; entry->name; entry++)
+ 	{
+ 		if (entry->val == default_action)
+ 		{
+ 			current_filter->def_str = entry->name;
+ 			break;
+ 		}
+ 	}
+ 	ereport(LOG,
+ 			(errmsg("seccomp default action set to \"%s\": context \"%s\"",
+ 					current_filter->def_str, context)));
+ }
+ #endif /* USE_SECCOMP */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 43b9f17..aac1940 100644
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
*************** InitPostgres(const char *in_dbname, Oid
*** 1056,1061 ****
--- 1056,1076 ----
  	/* Process pg_db_role_setting options */
  	process_settings(MyDatabaseId, GetSessionUserId());
  
+ #ifdef USE_SECCOMP
+ 	/* If seccomp filtering is requested, do the backend lockdown */
+ 	if (!bootstrap &&
+ 		!IsAutoVacuumWorkerProcess() &&
+ 		 IsUnderPostmaster)
+ 	{
+ 		if(!load_seccomp_filter("session"))
+ 		{
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("failed to load session seccomp filter")));
+ 		}
+ 	}
+ #endif
+ 
  	/* Apply PostAuthDelay as soon as we've read all options */
  	if (PostAuthDelay > 0)
  		pg_usleep(PostAuthDelay * 1000000L);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89..8b548f1 100644
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** extern const struct config_enum_entry ar
*** 478,483 ****
--- 478,484 ----
  extern const struct config_enum_entry recovery_target_action_options[];
  extern const struct config_enum_entry sync_method_options[];
  extern const struct config_enum_entry dynamic_shared_memory_options[];
+ extern const struct config_enum_entry seccomp_options[];
  
  /*
   * GUC option variables that are exported from this module
*************** static struct config_bool ConfigureNames
*** 1952,1957 ****
--- 1953,1968 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"seccomp", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Turns on seccomp syscall enforcement."),
+ 			NULL
+ 		},
+ 		&seccomp_enabled,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
*************** static struct config_string ConfigureNam
*** 4199,4204 ****
--- 4210,4303 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"global_syscall_allow", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp global syscall allow list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&global_syscall_allow_string,
+ 		"",
+ 		check_global_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"global_syscall_log", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp global syscall log list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&global_syscall_log_string,
+ 		"",
+ 		check_global_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"global_syscall_error", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp global syscall error list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&global_syscall_error_string,
+ 		"",
+ 		check_global_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"global_syscall_kill", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp global syscall kill list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&global_syscall_kill_string,
+ 		"",
+ 		check_global_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"session_syscall_allow", PGC_SUSET, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp backend session syscall allow list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&session_syscall_allow_string,
+ 		"*",
+ 		check_session_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"session_syscall_log", PGC_SUSET, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp backend session syscall log list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&session_syscall_log_string,
+ 		"*",
+ 		check_session_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"session_syscall_error", PGC_SUSET, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp backend session syscall error list."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&session_syscall_error_string,
+ 		"*",
+ 		check_session_syscall_list, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"session_syscall_kill", PGC_SUSET, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp backend session syscall allow kill."),
+ 			NULL,
+ 			GUC_LIST_INPUT | GUC_SUPERUSER_ONLY
+ 		},
+ 		&session_syscall_kill_string,
+ 		"*",
+ 		check_session_syscall_list, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
*************** static struct config_enum ConfigureNames
*** 4537,4542 ****
--- 4636,4661 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"global_syscall_default", PGC_POSTMASTER, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp global syscall default action."),
+ 			NULL
+ 		},
+ 		&global_syscall_default,
+ 		PG_SECCOMP_ALLOW, seccomp_options,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"session_syscall_default", PGC_SUSET, RESOURCES_KERNEL,
+ 			gettext_noop("Seccomp beckend session syscall default action."),
+ 			NULL
+ 		},
+ 		&session_syscall_default,
+ 		PG_SECCOMP_ALLOW, seccomp_options,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39fc787..82187c1 100644
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 154,159 ****
--- 154,177 ----
  
  #max_files_per_process = 1000		# min 25
  					# (change requires restart)
+ #seccomp = off				# use seccomp
+ 					# (change requires restart)
+ 
+ #global_syscall_default = allow		# postmaster default syscall action:
+ 					# allow, log, error, kill
+ #global_syscall_allow = ''			# postmaster syscall allow list
+ #global_syscall_log = ''			# postmaster syscall log list
+ #global_syscall_error = ''			# postmaster syscall error list
+ #global_syscall_kill = ''			# postmaster syscall kill list
+ 					# (global_syscall* change requires restart)
+ 
+ #session_syscall_default = allow	# session default syscall action:
+ 					# allow, log, error, kill
+ #session_syscall_allow = '*'		# backend session syscall allow list
+ #session_syscall_log = '*'			# backend session syscall log list
+ #session_syscall_error = '*'		# backend session syscall error list
+ #session_syscall_kill = '*'		# backend session syscall kill list
+ 					# session_syscall* default '*' = use global list
  
  # - Cost-Based Vacuum Delay -
  
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cf1f409..a32522e 100644
*** a/src/include/catalog/pg_proc.dat
--- b/src/include/catalog/pg_proc.dat
***************
*** 10678,10683 ****
--- 10678,10690 ----
    proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
    proargnames => '{tablespace,name,size,modification}',
    prosrc => 'pg_ls_tmpdir_1arg' },
+ { oid => '8657', descr => 'get current effective seccomp filter actions',
+   proname => 'pg_get_seccomp_filter', prorows => '100', proretset => 't',
+   provolatile => 's', proparallel => 'r', prorettype => 'record',
+   proargtypes => '', proallargtypes => '{text,int4,text,text}',
+   proargmodes => '{o,o,o,o}',
+   proargnames => '{syscall,syscallnum,filter_action,context}',
+   prosrc => 'pg_get_seccomp_filter' },
  
  # hash partitioning constraint function
  { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/commands/variable.h b/src/include/commands/variable.h
index 5f43414..58cf427 100644
*** a/src/include/commands/variable.h
--- b/src/include/commands/variable.h
*************** extern void assign_session_authorization
*** 34,38 ****
--- 34,40 ----
  extern bool check_role(char **newval, void **extra, GucSource source);
  extern void assign_role(const char *newval, void *extra);
  extern const char *show_role(void);
+ extern bool check_global_syscall_list(char **newval, void **extra, GucSource source);
+ extern bool check_session_syscall_list(char **newval, void **extra, GucSource source);
  
  #endif							/* VARIABLE_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03f..1e6745f 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
***************
*** 26,31 ****
--- 26,32 ----
  #include <signal.h>
  
  #include "datatype/timestamp.h" /* for TimestampTz */
+ #include "nodes/bitmapset.h"	/* for seccomp */
  #include "pgtime.h"				/* for pg_time_t */
  
  
*************** extern void ChangeToDataDir(void);
*** 333,338 ****
--- 334,341 ----
  extern void SwitchToSharedLatch(void);
  extern void SwitchBackToLocalLatch(void);
  
+ extern bool load_seccomp_filter(char *context);
+ 
  /* in utils/misc/superuser.c */
  extern bool superuser(void);	/* current user is superuser */
  extern bool superuser_arg(Oid roleid);	/* given user is superuser */
*************** extern void process_session_preload_libr
*** 447,452 ****
--- 450,485 ----
  extern void pg_bindtextdomain(const char *domain);
  extern bool has_rolreplication(Oid roleid);
  
+ typedef struct seccomp_filter
+ {
+ 	char		   *source;
+ 	int				def;
+ 	const char	   *def_str;
+ 	Bitmapset	   *allow;
+ 	Bitmapset	   *log;
+ 	Bitmapset	   *error;
+ 	Bitmapset	   *kill;
+ } seccomp_filter;
+ extern seccomp_filter *global_filter;
+ extern seccomp_filter *session_filter;
+ extern bool seccomp_enabled;
+ extern int global_syscall_default;
+ extern int session_syscall_default;
+ extern char *global_syscall_allow_string;
+ extern char *global_syscall_log_string;
+ extern char *global_syscall_error_string;
+ extern char *global_syscall_kill_string;
+ extern char *session_syscall_allow_string;
+ extern char *session_syscall_log_string;
+ extern char *session_syscall_error_string;
+ extern char *session_syscall_kill_string;
+ /* seccomp enforce actions in increasing order of precedence */
+ #define PG_SECCOMP_ALLOW    0  /* allow */
+ #define PG_SECCOMP_LOG      1  /* log */
+ #define PG_SECCOMP_ERROR    2  /* permission denied error */
+ #define PG_SECCOMP_KILL     3  /* kill process */
+ 
+ 
  /* in access/transam/xlog.c */
  extern bool BackupInProgress(void);
  extern void CancelBackup(void);
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d876926..dc7fdaf 100644
*** a/src/include/pg_config.h.in
--- b/src/include/pg_config.h.in
***************
*** 353,358 ****
--- 353,361 ----
  /* Define if you have a function readline library */
  #undef HAVE_LIBREADLINE
  
+ /* Define to 1 if you have the `seccomp' library (-lseccomp). */
+ #undef HAVE_LIBSECCOMP
+ 
  /* Define to 1 if you have the `selinux' library (-lselinux). */
  #undef HAVE_LIBSELINUX
  
***************
*** 935,940 ****
--- 938,946 ----
  /* Define to 1 to build with PAM support. (--with-pam) */
  #undef USE_PAM
  
+ /* Define to 1 to build with seccomp support. (--with-seccomp) */
+ #undef USE_SECCOMP
+ 
  /* Define to 1 to use software CRC-32C implementation (slicing-by-8). */
  #undef USE_SLICING_BY_8_CRC32C

David Fetter

david@fetter.org

over 6 years ago

In reply to: Joe Conway (#1)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Neat!

Are the seccomp interfaces for other kernels arranged in a manner
similar enough to have a unified interface in PostgreSQL, or is this
more of a Linux-only feature?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 6 years ago

In reply to: Joe Conway (#1)

Re: RFC: seccomp-bpf support

On 2019-08-28 17:13, Joe Conway wrote:

* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.

Well, if we are interested in this, why don't we just ask packagers to
do so? That seems like a much easier solution.

* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.

Why does one need to know what filters are active (other than,
obviously, checking that the filters one has set were actually
activated)? What decisions would a DBA or database user be able to make
based on whether a filter is set or not?

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Also the other way around?

* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.

What are use cases for this?

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Also, the list of system calls in use changes all the time. The same
function call from PostgreSQL could be a system call or a glibc
implementation, depending on the specific installed packages or run-time
settings.

Extensions would need to maintain their own permissions list, and they
would need to be merged manually into the respective existing settings.

Without good answers to these, I suspect that this feature would go
straight to the top of the list of "if in doubt, turn off".

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Joe Conway

mail@joeconway.com

over 6 years ago

In reply to: Peter Eisentraut (#3)

Re: RFC: seccomp-bpf support

On 8/28/19 1:03 PM, Peter Eisentraut wrote:

On 2019-08-28 17:13, Joe Conway wrote:

* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.

Well, if we are interested in this, why don't we just ask packagers to
do so? That seems like a much easier solution.

For the reason listed below

* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.

Why does one need to know what filters are active (other than,
obviously, checking that the filters one has set were actually
activated)? What decisions would a DBA or database user be able to make
based on whether a filter is set or not?

So that when an enforement action happens, you can understand why it
happened. Perhaps it was a bug (omission) in your allow list, or perhaps
it was an intentional rule to prevent abuse (say blocking certain
syscalls from plperlu), or it was because someone is trying to
compromise you system (e.g. some obscure and clearly not needed syscall).

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Also the other way around?

As I stated in the original email, the filters can add restrictions but
never remove them.

* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.

What are use cases for this?

For example to allow a specific user to use plperlu to exec shell code
while others cannot.

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

So if you are worried about that make your default action 'log' and
watch audit.log. There will be no errors or crashes of postgres caused
by that because there will be no change in postgres visible behavior.

And if returning an error from a syscall causes a crash that would be a
serious bug and we should fix it.

Also, the list of system calls in use changes all the time. The same
function call from PostgreSQL could be a system call or a glibc
implementation, depending on the specific installed packages or run-time
settings.

True. That is why I did not provide an initial list and believe folks
who want to use this should develop their own.

Extensions would need to maintain their own permissions list, and they
would need to be merged manually into the respective existing settings.

People would have to generate their own list for extensions -- I don't
believe it is the extension writers' problem.

Without good answers to these, I suspect that this feature would go
straight to the top of the list of "if in doubt, turn off".

That is fine. Perhaps most people never use this, but when needed (and
increasingly will be required) it is available.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

The only people who will incur the maintenance headache are those who
need to use it. The benefits are compliance with requirements.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Joe Conway

mail@joeconway.com

over 6 years ago

In reply to: David Fetter (#2)

Re: RFC: seccomp-bpf support

On 8/28/19 12:47 PM, David Fetter wrote:

On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Neat!

Are the seccomp interfaces for other kernels arranged in a manner
similar enough to have a unified interface in PostgreSQL, or is this
more of a Linux-only feature?

As far as I know libseccomp is Linux specific, at least at the moment.

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Peter Eisentraut (#3)

Re: RFC: seccomp-bpf support

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2019-08-28 17:13, Joe Conway wrote:

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Yeah, that seems like quite a serious problem. I think you'd want
to have some sort of static-code-analysis-based way of identifying
the syscalls in use, rather than trying to test your way to it.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

After thinking about this for awhile, I really don't follow what
threat model it's trying to protect against. Given that we'll allow
any syscall that an unmodified PG executable might use, it seems
like the only scenarios being protected against involve someone
having already compromised the server enough to have arbitrary code
execution. OK, fine, but then why wouldn't the attacker just
bypass libseccomp? Or tell it to let through the syscall he wants
to use? Having the list of allowed syscalls be determined inside
the process seems like fundamentally the wrong implementation.
I'd have expected a feature like this to be implemented by SELinux,
or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.

(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

regards, tom lane

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joe Conway (#1)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 11:13:27 -0400, Joe Conway wrote:

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.
That's different from something like a browser's CSS process, or such,
which really doesn't need much beyond some IPC and memory
allocations. But postgres is going to need syscalls as broad as
fork/clone, exec, connect, shm*, etc. I guess you can argue that we'd
still reduce the attack surface for kernel escalations, but that seems
like a pretty small win compared to the cost.

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Which important syscalls would you get away with removing in backends
that postmaster needs? I think the only one - which is a good one though
- that I can think of is listen(). But even that might be too
restrictive for some PLs running out of process.

My main problem with seccomp is that it's *incredibly* fragile,
especially for a program as complex as postgres. We already had seccomp
related bug reports on list, even just due to the very permissive
filtering by some container solutions.

There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get
openat2()), different versions of glibc use different syscalls
(e.g. switching from open() to always using openat()), the system
configuration influences which syscalls are being used (e.g. using
vsyscalls only being used for certain clock sources), and kernel.
bugfixes change the exact set of syscalls being used ([1]https://lwn.net/Articles/795128/).

[1]: https://lwn.net/Articles/795128/

Then there's also the issue that many extensions are going to need
additional syscalls.

Notes on usage:
===============
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

c) Cut and paste the result as the value of session_syscall_allow.

That seems nearly guaranteed to miss a significant fraction of
syscalls. There's just no way we're going to cover all the potential
paths and configurations in our testsuite.

I think if you actually wanted to do something like this, you'd need to
use static analysis to come up with a more reliable list.

Greetings,

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joe Conway (#4)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 13:28:06 -0400, Joe Conway wrote:

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

So if you are worried about that make your default action 'log' and
watch audit.log. There will be no errors or crashes of postgres caused
by that because there will be no change in postgres visible behavior.

But the benefit of integrating this into postgres become even less
clear.

And if returning an error from a syscall causes a crash that would be a
serious bug and we should fix it.

Err, there's a lot of syscall failures that'll cause PANICs, and where
there's no reasonable way around that.

Greetings,

Andres Freund

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Tom Lane (#6)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 1:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2019-08-28 17:13, Joe Conway wrote:

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Yeah, that seems like quite a serious problem. I think you'd want
to have some sort of static-code-analysis-based way of identifying
the syscalls in use, rather than trying to test your way to it.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

After thinking about this for awhile, I really don't follow what
threat model it's trying to protect against. Given that we'll allow
any syscall that an unmodified PG executable might use, it seems
like the only scenarios being protected against involve someone
having already compromised the server enough to have arbitrary code
execution. OK, fine, but then why wouldn't the attacker just
bypass libseccomp? Or tell it to let through the syscall he wants
to use? Having the list of allowed syscalls be determined inside
the process seems like fundamentally the wrong implementation.
I'd have expected a feature like this to be implemented by SELinux,

SELinux is generally an object model and while it does implement e.g.,
capability checks, that is not the main intent, nor is it possible for
LSMs to implement syscall filters, the features are orthogonal.

or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.
(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

My preference would have been to enable it unconditionally but Joe was
being more practical.

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

Limiting the kernel attack surface for network facing daemons is
imperative to hardening systems. All modern attacks are chained
together so a kernel bug is useful only if you can execute code, and
PG is a decent vector for executing code.

At a minimum I would urge the community to look at adding high risk
syscalls to the kill list, systemd has some predefined sets we can
pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
@module.

The goal is to prevent an ACE hole in Postgres from becoming a
complete system compromise. This may not do it alone, and security
conscious integrators will want to, for example, add seccomp filters
to systemd to prevent superuser from disabling them. The postmaster
and per-role lists can further reduce the available syscalls based on
the exact extensions and PLs being used. Each step reduced the surface
more and throwing it all out because one step can go rogue is
unsatisfying.

Thank you.

#10

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#7)

Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.

I think the idea is to block access to seldom-used syscalls because
those are exactly the ones most likely to have bugs. (We certainly
hope that all the ones PG uses are well-debugged...) That seems fine.
Whether the incremental protection is really worth the effort is
debatable, but as long as it's only people who think it *is* worth
the effort who have to deal with it, I don't mind.

What I don't like about this proposal is that the filter configuration is
kept somewhere where it's not at all hard for an attacker to modify it.
It can't be a GUC, indeed it can't be in any file that the server has
permissions to write on, or you might as well not bother. So I'd throw
away pretty much everything in the submitted patch, and instead think
about doing the filtering/limiting in something that's launched from the
service start script and in turn launches the postmaster. That makes it
(mostly?) Not Our Problem, but rather an independent component.

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

regards, tom lane

#11

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joshua Brindle (#9)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 14:23:00 -0400, Joshua Brindle wrote:

or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.
(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

A postmaster restart might not be enough, because the postmaster's
restrictions can't be removed, once in place. But all that's needed to
circumvent that is force postmaster to die, and rely on systemd etc to
restart it.

My preference would have been to enable it unconditionally but Joe was
being more practical.

Well, the current approach is to configure the list of allowed syscalls
in postgres. How would you ever secure that against the attacks
described by Tom? As long as the restrictions are put into place by
postgres itself, and as long they're configurable, such attacks are
possible, no? And as long as extensions etc need different syscalls,
you'll need configurability.

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

Limiting the kernel attack surface for network facing daemons is
imperative to hardening systems. All modern attacks are chained
together so a kernel bug is useful only if you can execute code, and
PG is a decent vector for executing code.

I don't really buy that in case pof postgres. Normally, in a medium to
high security world, once you have RCE in postgres, the valuable data
can already be exfiltrated. And that's game over. The only real benefits
of a kernel vulnerabily is that that might allow to persist the attack
for longer - but there's already plenty you can do inside postgres, once
you have RCE.

At a minimum I would urge the community to look at adding high risk
syscalls to the kill list, systemd has some predefined sets we can
pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
@module.

I can see some small value in disallowing these - but we're back to the
point where that is better done one layer *above* postgres, by a process
with more privileges than PG. Because then a PG RCE doesn't have a way
to prevent those filters from being applied.

The postmaster and per-role lists can further reduce the available
syscalls based on the exact extensions and PLs being used.

I don't buy that per-session configurable lists buy you anything
meaningful. With an RCE in one session, it's pretty trivial to corrupt
shared memory to also trigger RCE in other sessions. And there's no way
seccomp or anything like that will prevent that.

An additional reason I'm quite sceptical about more fine grained
restrictions is that I think we're going to have to go for some use of
threading in the next few years. While I think that's still far from
agreed upon, I think there's a pretty large number of "senior" hackers
that see this as the future. You can have per-thread seccomp filters,
but that's so trivial to circumvent (just overwrite some vtable like
data in another thread's data, and have it call whatever gadget you
want), that it's not even worth considering.

Greetings,

Andres Freund

#12

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Tom Lane (#10)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

<snip>

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

Just going to address this one now. SELinux is an LSM and therefore
only makes decisions when LSM hooks are invoked, which are not 1:1 for
syscalls (not even close). Further, SELinux generally determines what
objects a subject can access and only implements capabilities because
it has to, to be non-bypassable.

Seccomp filtering is an orthogonal system to SELinux and LSMs in
general. We are already doing work to further restrict PG subprocesses
with SELinux via [1] and [2], but that doesn't address the unused,
high-risk, obsolete, etc syscall issue. A prime example is madvise()
which was a catastrophic failure that 1) isn't preventable by any LSM
including SELinux, 2) isn't used by PG and is therefore a good
candidate for a kill list, and 3) a clear win in the
dont-let-PG-be-a-vector-for-kernel-compromise arena.

We are using SELinux, we are also going to use this, they work together.

#13

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Tom Lane (#10)

Re: RFC: seccomp-bpf support

On 2019-08-28 14:30:20 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.

I think the idea is to block access to seldom-used syscalls because
those are exactly the ones most likely to have bugs.

Yea, I can see some benefit in that. I'm just quite doubtful that the
set of syscalls pg relies on doesn't already allow any determined
attacker to trigger kernel bugs. E.g. the whole sysv ipc code is among
those seldomly used areas of the code. As is the permission transfer
through unix sockets. As is forking from within a syscall. ...

(We certainly hope that all the ones PG uses are well-debugged...)

One would hope, but it's not quite my experience...

Whether the incremental protection is really worth the effort is
debatable, but as long as it's only people who think it *is* worth the
effort who have to deal with it, I don't mind.

It seems almost guaranteed that there'll be bug reports about ominous
crashes due to some less commonly used syscall being blacklisted. In a
lot of cases that'll be hard to debug. After all, we already got such
bug reports, without us providing anything builtin - and it's not like
us adding our own filter suport will make container solutions not use
their filter, so there's no argument that doing so ourselves will reduce
the fragility.

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

And, as I mentioned in another email, once you can corrupt shared memory
in arbitrary ways, the differing restrictions aren't worth much
anyway. Postmaster might be separated out enough to survive attacks like
that, but backends definitely aren't.

Greetings,

Andres Freund

#14

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Joshua Brindle (#12)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:47 PM Joshua Brindle
<joshua.brindle@crunchydata.com> wrote:

On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

<snip>

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

Just going to address this one now. SELinux is an LSM and therefore
only makes decisions when LSM hooks are invoked, which are not 1:1 for
syscalls (not even close). Further, SELinux generally determines what
objects a subject can access and only implements capabilities because
it has to, to be non-bypassable.

Seccomp filtering is an orthogonal system to SELinux and LSMs in
general. We are already doing work to further restrict PG subprocesses
with SELinux via [1] and [2], but that doesn't address the unused,

And I forgot the citations *sigh*, actually there should have only been [1]:

1. https://commitfest.postgresql.org/24/2259/

Show quoted text

high-risk, obsolete, etc syscall issue. A prime example is madvise()
which was a catastrophic failure that 1) isn't preventable by any LSM
including SELinux, 2) isn't used by PG and is therefore a good
candidate for a kill list, and 3) a clear win in the
dont-let-PG-be-a-vector-for-kernel-compromise arena.

We are using SELinux, we are also going to use this, they work together.

#15

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joshua Brindle (#12)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

Greetings,

Andres Freund

#16

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#13)

Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

On 2019-08-28 14:30:20 -0400, Tom Lane wrote:

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

And, as I mentioned in another email, once you can corrupt shared memory
in arbitrary ways, the differing restrictions aren't worth much
anyway. Postmaster might be separated out enough to survive attacks like
that, but backends definitely aren't.

Another point in this area is that if you did feel a need for per-process
syscall sets, having a restriction that the postmaster's allowed set be a
superset of all the childrens' allowed sets seems quite the wrong thing.
The set of calls the postmaster needs is probably a lot smaller than what
the children need, seeing that it does so little. It's just different
because it includes bind+listen which the children likely don't need.
So the hierarchical way seccomp goes about this seems fairly wrong for
our purposes regardless.

regards, tom lane

#17

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Andres Freund (#15)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

seccomp also allows argument whitelisting so things can get very
granular, depending on who is setting up the lists.

#18

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Andres Freund (#15)

Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

I think Andres is right here. There are madvise calls in glibc:

glibc-2.28/malloc/malloc.c: __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
glibc-2.28/malloc/arena.c: __madvise ((char *) h + new_size, diff, MADV_DONTNEED);

It appears that the first is only reachable from __malloc_trim which
we don't use, but the second is reachable from free(). However,
strace'ing says that it's never called during our standard regression
tests, confirming Andres' thought that it's in seldom-reached paths.
(I did not go through the free() logic in any detail, but it looks
like it is only reached when dealing with quite-large allocations,
which'd make sense.)

So this makes a perfect example for Peter's point that testing is
going to be a very fallible way of finding the set of syscalls that
need to be allowed. Even if we had 100.00% code coverage of PG
proper, we would not necessarily find calls like this.

regards, tom lane

#19

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joshua Brindle (#17)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

And we'd hit mysterious slowdowns in production uses of PG when seccomp
is enabled.

Greetings,

Andres Freund

#20

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Andres Freund (#19)

Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 3:22 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

And we'd hit mysterious slowdowns in production uses of PG when seccomp
is enabled.

It seems like complete system compromises should be prioritized over
slowdowns, and it seems very unlikely to cause a noticeable slowdown
anyway. Are there PG users that backed out all of the Linux KPTI
patches due to the slowdown?

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone but we would like the capability to harden a PG installation
for many reasons already stated. This is being done in places all
across the Linux ecosystem and is IMO a very useful mitigation.

Thank you.

#21

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Joshua Brindle (#17)

Re: RFC: seccomp-bpf support

On Thu, Aug 29, 2019 at 7:08 AM Joshua Brindle
<joshua.brindle@crunchydata.com> wrote:

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

seccomp also allows argument whitelisting so things can get very
granular, depending on who is setting up the lists.

Just by the way, there may also be differences between architectures.
After some head scratching, we recently discovered[1]/messages/by-id/CA+hUKGLiR569VHLjtCNp3NT+jnKdhy8g2sdgKzWNojyWQVt6Bw@mail.gmail.com that default
seccomp whitelists currently cause PostgreSQL to panic for users of
Docker, Nspawn etc on POWER and ARM because of that. That's a bug
being fixed elsewhere, but it reveals another thing to be careful of
if you're trying to build your own whitelist by guessing what libc
needs to call.

[1]: /messages/by-id/CA+hUKGLiR569VHLjtCNp3NT+jnKdhy8g2sdgKzWNojyWQVt6Bw@mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

#22

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Joshua Brindle (#20)

Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 15:38:11 -0400, Joshua Brindle wrote:

It seems like complete system compromises should be prioritized over
slowdowns, and it seems very unlikely to cause a noticeable slowdown
anyway.

The point isn't really this specific issue, but that the argument that
you'll not cause problems by disabling certain syscalls, or that it's
easy to find which ones are used, just plainly isn't true.

Are there PG users that backed out all of the Linux KPTI patches due
to the slowdown?

Well, not backed out on a code level, but straight out disabled at boot
time (i.e. pti=off)? Yea, I know of several.

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone

Well, we'll have to deal with the fallout of this to some degree. When
postgres breaks people will complain, when it's slow, people will
complain, ...

Greetings,

Andres Freund

#23

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 6 years ago

In reply to: Joshua Brindle (#20)

Re: RFC: seccomp-bpf support

On 2019-08-28 21:38, Joshua Brindle wrote:

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone

Features come with a maintenance cost. If we ship it, then people are
going to try it out. Then weird things will happen. They will report
mysterious bugs. They will complain to their colleagues. It all comes
with a cost.

but we would like the capability to harden a PG installation
for many reasons already stated.

Most if not all of those reasons seem to have been questioned.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Joshua Brindle (#20)

Re: RFC: seccomp-bpf support

On 2019-Aug-28, Joshua Brindle wrote:

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone but we would like the capability to harden a PG installation
for many reasons already stated.

My own objection to this line of development is that it doesn't seem
that any useful policy (allowed/denied syscall list) is part or intends
to be part of the final feature. So we're shipping a hook system for
which each independent vendor is going to develop their own policy. Joe
provided an example syscall list, but it's not part of the patch proper;
and it seems, per the discussion, that the precise syscall list to use
is a significant fraction of this.

So, as part of a committable patch, IMO it'd be good to have some sort
of final list of syscalls -- maybe as part of the docbook part of the
patch.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#25

Joe Conway

mail@joeconway.com

over 6 years ago

In reply to: Peter Eisentraut (#23)

Re: RFC: seccomp-bpf support

On 8/28/19 4:07 PM, Peter Eisentraut wrote:

On 2019-08-28 21:38, Joshua Brindle wrote:

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone

Features come with a maintenance cost. If we ship it, then people are
going to try it out. Then weird things will happen. They will report
mysterious bugs. They will complain to their colleagues. It all comes
with a cost.

but we would like the capability to harden a PG installation
for many reasons already stated.

Most if not all of those reasons seem to have been questioned.

Clearly Joshua and I disagree, but understand that the consensus is not
on our side. It is our assessment that PostgreSQL will be subject to
seccomp willingly or not (e.g., via docker, systemd, etc.) and the
community might be better served to get out in front and have first
class support.

But I don't want to waste any more of anyone's time on this topic,
except to ask if two strategically placed hooks are asking too much?

Specifically hooks to replace these two stanzas in the patch:

8<--------------------------
diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 62dc93d..2216d49 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** PostmasterMain(int argc, char *argv[])
*** 963,968 ****
--- 963,982 ----

[...]

diff --git a/src/backend/utils/init/postinit.c
b/src/backend/utils/init/postinit.c
index 43b9f17..aac1940 100644
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
*************** InitPostgres(const char *in_dbname, Oid
*** 1056,1061 ****
--- 1056,1076 ----

[...]

8<--------------------------

We will continue to pursue this development for customers that require
it and plan to provide an update on our analysis and results.

We thank you for your comments and suggestions.

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#26

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Joe Conway (#25)

Re: RFC: seccomp-bpf support

Joe Conway <mail@joeconway.com> writes:

Clearly Joshua and I disagree, but understand that the consensus is not
on our side. It is our assessment that PostgreSQL will be subject to
seccomp willingly or not (e.g., via docker, systemd, etc.) and the
community might be better served to get out in front and have first
class support.

Sure, but ...

But I don't want to waste any more of anyone's time on this topic,
except to ask if two strategically placed hooks are asking too much?

... hooks are still implying a design with the filter control inside
Postgres. Which, as I said before, seems like a fundamentally incorrect
architecture. I'm not objecting to having such control, but I think
it has to be outside the postmaster, or it's just not a credible
security improvement. It doesn't help to say "I'm going to install
a lock to keep out a thief who *by assumption* is carrying lock
picking tools."

regards, tom lane

#27

Joshua Brindle

joshua.brindle@crunchydata.com

over 6 years ago

In reply to: Tom Lane (#26)

Re: RFC: seccomp-bpf support

On Thu, Aug 29, 2019 at 10:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joe Conway <mail@joeconway.com> writes:

Clearly Joshua and I disagree, but understand that the consensus is not
on our side. It is our assessment that PostgreSQL will be subject to
seccomp willingly or not (e.g., via docker, systemd, etc.) and the
community might be better served to get out in front and have first
class support.

Sure, but ...

But I don't want to waste any more of anyone's time on this topic,
except to ask if two strategically placed hooks are asking too much?

... hooks are still implying a design with the filter control inside
Postgres. Which, as I said before, seems like a fundamentally incorrect
architecture. I'm not objecting to having such control, but I think
it has to be outside the postmaster, or it's just not a credible
security improvement. It doesn't help to say "I'm going to install
a lock to keep out a thief who *by assumption* is carrying lock
picking tools."

I recognize this discussion is over but this is a mischaracterization
of the intent. Upthread I said:
"This may not do it alone, and security
conscious integrators will want to, for example, add seccomp filters
to systemd to prevent superuser from disabling them. The postmaster
and per-role lists can further reduce the available syscalls based on
the exact extensions and PLs being used. Each step reduced the surface
more and throwing it all out because one step can go rogue is
unsatisfying."

There are no security silver bullets, each thing we do is risk
reduction, and that includes this patchset, whether you can see it or
not.

Thank you.

#28

Joe Conway

mail@joeconway.com

over 6 years ago

In reply to: Tom Lane (#26)

Re: RFC: seccomp-bpf support

On 8/29/19 10:00 AM, Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

Clearly Joshua and I disagree, but understand that the consensus is not
on our side. It is our assessment that PostgreSQL will be subject to
seccomp willingly or not (e.g., via docker, systemd, etc.) and the
community might be better served to get out in front and have first
class support.

Sure, but ...

But I don't want to waste any more of anyone's time on this topic,
except to ask if two strategically placed hooks are asking too much?

... hooks are still implying a design with the filter control inside
Postgres. Which, as I said before, seems like a fundamentally incorrect
architecture. I'm not objecting to having such control, but I think
it has to be outside the postmaster, or it's just not a credible
security improvement.

I disagree. Once a filter is loaded there is no way to unload it short
of a postmaster restart. That is an easily detected event that can be
alerted upon, and that is definitely a security improvement.

Perhaps that is a reason to also set the session level GUC to
PGC_POSTMASTER, but that is an easy change if deemed necessary.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#29

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Joe Conway (#28)

Re: RFC: seccomp-bpf support

Hi,

This patch is currently in "needs review" state, but the last message is
from August 29, and my understanding is that there have been a couple of
objections / disagreements about the architecture, difficulties with
producing the set of syscalls, and not providing any built-in policy.

I don't think we're any closer to resolve those disagreements since
August, so I think we should make some decision about this patch,
instead of just moving it from one CF to the next one. The "needs
review" status seems not reflecting the situation.

Are there any plans to post a new version of the patch with a different
design, or something like that? If not, I propose we mark it either as
rejected or returned with feedback (and maybe get a new patch in the
future).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30

Joe Conway

mail@joeconway.com

about 6 years ago

In reply to: Tomas Vondra (#29)

Re: RFC: seccomp-bpf support

On 1/6/20 8:37 PM, Tomas Vondra wrote:

Hi,

This patch is currently in "needs review" state, but the last message is
from August 29, and my understanding is that there have been a couple of
objections / disagreements about the architecture, difficulties with
producing the set of syscalls, and not providing any built-in policy.

I don't think we're any closer to resolve those disagreements since
August, so I think we should make some decision about this patch,
instead of just moving it from one CF to the next one. The "needs
review" status seems not reflecting the situation.

Are there any plans to post a new version of the patch with a different
design, or something like that? If not, I propose we mark it either as
rejected or returned with feedback (and maybe get a new patch in the
future).

I assumed it was rejected.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#31

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Joe Conway (#30)

Re: RFC: seccomp-bpf support

On Tue, Jan 07, 2020 at 06:02:14AM -0500, Joe Conway wrote:

On 1/6/20 8:37 PM, Tomas Vondra wrote:

Hi,

This patch is currently in "needs review" state, but the last message is
from August 29, and my understanding is that there have been a couple of
objections / disagreements about the architecture, difficulties with
producing the set of syscalls, and not providing any built-in policy.

I don't think we're any closer to resolve those disagreements since
August, so I think we should make some decision about this patch,
instead of just moving it from one CF to the next one. The "needs
review" status seems not reflecting the situation.

Are there any plans to post a new version of the patch with a different
design, or something like that? If not, I propose we mark it either as
rejected or returned with feedback (and maybe get a new patch in the
future).

I assumed it was rejected.

I don't know. I still see it in the CF app with "needs review" status:

https://commitfest.postgresql.org/26/2263/

Barring objections, I'll mark it as rejected.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#31)

Re: RFC: seccomp-bpf support

On Tue, Jan 7, 2020 at 7:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Barring objections, I'll mark it as rejected.

I think that's right. Since I just caught up on this thread, I'd like
to offer a few belated comments:

1. I don't think it would kill us to add a few hooks that would allow
this feature to be added by a loadable module. Someone may argue that
we should never add a hook unless we know exactly how it's going to be
used and agree with it as a goal, but I don't agree with that.
Refusing to add hooks just causes people to fork the server. If
somebody loads code that uses a hook at least you can tell that
they've done it by looking at shared_preload_libraries; if they fork
the server it may be much harder to tell that you're not dealing with
straight-up PostgreSQL. At any rate, ease-of-debugging considerations
for core developers should not take precedence over letting people do
with PostgreSQL what they wish.

2. I feel strongly that shipping this feature with mechanism but not
policy is unwise; I thought Alvaro articulated this problem
particularly well. I think the evidence on this thread is pretty
clear: this WILL break for some users, and it WILL need fixing. If the
mechanism is in core and the policy is not, then it seems likely that
employees at Crunchy, who apparently run into customers that need this
on a regular basis, will develop a set of best practices which will
allow them to advise customers as to what settings will or will not
work well, but because that knowledge will not be embedded in core, it
will be pretty hard for anybody else to support such customers, unless
they too have a lot of customers who want to run in this mode. I would
be a lot more supportive of this if both the mechanism and the policy
were going to ship in core and be maintained in core, with adequate
documentation.

3. The difficulty in making that happen is that the set of system
calls that need to be whitelisted seems likely to vary based on
platform, kernel version, glibc version, PostgreSQL build options,
loadable modules used, and which specific PostgreSQL features you care
about. I can't help feeling that this is designed mostly for processes
that do far simpler things than PostgreSQL. It would be interesting,
for example, to know what bash or perl does about this. They have the
same problem that PostgreSQL does, namely, that they are intended to
let users do almost arbitrary things by design -- not a totally
unlimited set of things, but an awful lot of things. Perhaps over time
this mechanism will undergo design changes, or a clearer set of best
practices will emerge, so that it's easier to see how PostgreSQL could
use this without breaking things. If indeed this is the future, you
can imagine something like glibc getting a "seccomp-clean" mode in
which it can be built, and if that happened and were widely used, then
the difficulties for PostgreSQL would be reduced. Because such
improvements typically happen over time through trial and error and
the efforts of many people, I think it is to our advantage to allow
people to experiment with the feature even as it exists today out of
core, which gets me back to point #1. I agree with Joshua Brindle's
point that holding our breath in response to a widely-adopted
technology is not a very useful response.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 6 years ago

In reply to: Robert Haas (#32)

Re: RFC: seccomp-bpf support

On Tue, Jan 07, 2020 at 11:33:07AM -0500, Robert Haas wrote:

On Tue, Jan 7, 2020 at 7:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Barring objections, I'll mark it as rejected.

I think that's right.

Done.

Since I just caught up on this thread, I'd like to offer a few belated
comments:

1. I don't think it would kill us to add a few hooks that would allow
this feature to be added by a loadable module. Someone may argue that
we should never add a hook unless we know exactly how it's going to be
used and agree with it as a goal, but I don't agree with that.
Refusing to add hooks just causes people to fork the server. If
somebody loads code that uses a hook at least you can tell that they've
done it by looking at shared_preload_libraries; if they fork the server
it may be much harder to tell that you're not dealing with straight-up
PostgreSQL. At any rate, ease-of-debugging considerations for core
developers should not take precedence over letting people do with
PostgreSQL what they wish.

Not sure I understand. I don't think anyone argued by hooks vs. forking
the server in this particular thread, but the thread is fairly long so
maybe I'm missing something.

I think the "hook issue" is that the actual code is somewhere else. On
the one hand that minimizes the dev/testing/maintenance burden for us,
on the other hand it means we're not really testing the hooks. Meh.

But in this case I think the main argument against hooks was that Tom
thinks it's not really the right way to implement this. I don't know if
he's right or not, I don't have an opinion on how to integrate seccomp.

2. I feel strongly that shipping this feature with mechanism but not
policy is unwise; I thought Alvaro articulated this problem
particularly well. I think the evidence on this thread is pretty clear:
this WILL break for some users, and it WILL need fixing. If the
mechanism is in core and the policy is not, then it seems likely that
employees at Crunchy, who apparently run into customers that need this
on a regular basis, will develop a set of best practices which will
allow them to advise customers as to what settings will or will not
work well, but because that knowledge will not be embedded in core, it
will be pretty hard for anybody else to support such customers, unless
they too have a lot of customers who want to run in this mode. I would
be a lot more supportive of this if both the mechanism and the policy
were going to ship in core and be maintained in core, with adequate
documentation.

Well, but this exact argument applies to various other approaches:

1) no hooks, forking PostgreSQL
2) hooks added, but neither code nor policy included
3) hooks aded, code included, policy not included

Essentially the only case where Crunchy would not have this "lock-in"
advantage is when everything is included into PostgreSQL, at which point
we can probably make this work without hooks I suppose.

3. The difficulty in making that happen is that the set of system calls
that need to be whitelisted seems likely to vary based on platform,
kernel version, glibc version, PostgreSQL build options, loadable
modules used, and which specific PostgreSQL features you care about. I
can't help feeling that this is designed mostly for processes that do
far simpler things than PostgreSQL. It would be interesting, for
example, to know what bash or perl does about this. They have the same
problem that PostgreSQL does, namely, that they are intended to let
users do almost arbitrary things by design -- not a totally unlimited
set of things, but an awful lot of things. Perhaps over time this
mechanism will undergo design changes, or a clearer set of best
practices will emerge, so that it's easier to see how PostgreSQL could
use this without breaking things. If indeed this is the future, you can
imagine something like glibc getting a "seccomp-clean" mode in which it
can be built, and if that happened and were widely used, then the
difficulties for PostgreSQL would be reduced. Because such improvements
typically happen over time through trial and error and the efforts of
many people, I think it is to our advantage to allow people to
experiment with the feature even as it exists today out of core, which
gets me back to point #1. I agree with Joshua Brindle's point that
holding our breath in response to a widely-adopted technology is not a
very useful response.

I think this is probably the main challenge of this patch - development,
maintenance and testing of the policies. I think it's pretty clear the
community can't really support this on all possible environments, or
with third-party extensions. I don't know if it's even possible to write
a "universal policy" covering significant range of systems? It seems
much more realistic that individual providers will develop policies for
systems of customers.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Tomas Vondra (#33)

Re: RFC: seccomp-bpf support

On Tue, Jan 7, 2020 at 12:56 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I think the "hook issue" is that the actual code is somewhere else. On
the one hand that minimizes the dev/testing/maintenance burden for us,
on the other hand it means we're not really testing the hooks. Meh.

I don't care about the testing the hooks. If we provide hooks and
someone finds them useful, great. If not, they don't have to use them.
The mere existence of this hook isn't going to put any significant
maintenance burden on the community, while the feature as a whole
probably would.

Well, but this exact argument applies to various other approaches:

1) no hooks, forking PostgreSQL
2) hooks added, but neither code nor policy included
3) hooks aded, code included, policy not included

Essentially the only case where Crunchy would not have this "lock-in"
advantage is when everything is included into PostgreSQL, at which point
we can probably make this work without hooks I suppose.

Well, from my point of view, in case 1 or 2, the feature is entirely
Crunchy's. If it works great, good for them. If it sucks, it's their
problem. In case 3, the feature is ostensibly a community feature but
probably nobody other than Crunchy can actually make it work. That
latter situation seems a lot more problematic to me. I don't like
PostgreSQL features that I can't make work. If it's too complicated
for other developers to figure out, it's probably going to be a real
pain for users, too.

Putting my cards on the table, I think it's likely that the proposed
design contains a significant amount of suckitude. Serious usability
and security concerns have been raised, and I find those concerns
legitimate. On the other hand, it may still be useful to some people.
More importantly, if they can more easily experiment with it, they'll
have a chance to find out whether it sucks and, if so, make it better.
Perhaps something that we can accept into core will ultimately result.
That would be good for everybody.

Also, generally, I don't think we should block features (hooks or
otherwise) because some other company might get more benefit than our
own employer. That seems antithetical to the concept of open source.
Blocking them because they're poorly designed or will impose a burden
on the community is a different thing.

I think this is probably the main challenge of this patch - development,
maintenance and testing of the policies. I think it's pretty clear the
community can't really support this on all possible environments, or
with third-party extensions. I don't know if it's even possible to write
a "universal policy" covering significant range of systems? It seems
much more realistic that individual providers will develop policies for
systems of customers.

I generally agree with this, although I'm not sure that I understand
what you're advocating that we do. Accepting the feature as-is seems
like a no-go given the remarks already made, and while I don't think I
feel as strongly about it as some of the people who have already
spoken on the topic, I do share their doubts to some degree and am not
in a position to override them even if I disagreed completely. But,
hooks would give individual providers those same options, without
really burdening anyone else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company