RFC: seccomp-bpf support

Started by Joe Conwayover 6 years ago34 messageshackers
Jump to latest
#1Joe Conway
mail@joeconway.com

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

---------
A seccomp (bpf) filter is comprised of a default action, and a set of
rules with actions pertaining to specific syscalls (possibly with even
more specific sets of arguments). Once loaded into the kernel, a filter
is inherited by all child processes and cannot be removed. It can,
however, be overlaid with another filter. For any given syscall match,
the most restrictive (a.k.a. highest precedence) action will be taken by
the kernel. PostgreSQL has already been run "in the wild" under seccomp
control in containers, and possibly systemd. Adding seccomp support into
PostgreSQL itself mitigates issues with these approaches, and has
several advantages:

* Container seccomp filters tend to be extremely broad/permissive,
typically allowing about 6 out 7 of all syscalls. They must do this
because the use cases for containers vary widely.
* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.
* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.
* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.
* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.
* With built-in support, it is possible to calculate and return (in the
form of an SRF) the effective filters being applied to the postmaster
and the current backend.
* With built-in support, it could be possible (this part not yet
implemented) to have separate filters for different backend types,
e.g. autovac workers, background writer, etc.

---------
Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

The code itself has been through several rounds of revision based on
discussions I have had with the author of libseccomp as well as a few
other folks. However as of the moment:

* Documentation - general discussion missing entirely
* No regression tests

---------
For convenience, here are a couple of additional links to relevant
information regarding seccomp:
https://en.wikipedia.org/wiki/Seccomp
https://github.com/seccomp/libseccomp

---------
Specific feedback requested:
1. Placement of pg_get_seccomp_filter() in
src/backend/utils/adt/genfile.c
originally made sense but after several rewrites no longer does.
Ideas where it *should* go?
2. Where should a general discussion section go in the docs, if at all?
3. Currently this supports a global filter at the postmaster level,
which is inherited by all child processes, and a secondary filter
at the client backend session level. It likely makes sense to
support secondary filters for other types of child processes,
e.g. autovacuum workers, etc. Add that now (pg13), later release,
or never?
4. What is the best way to approach testing of this feature? Tap
testing perhaps?
5. Default GUC values - should we provide "starter" lists, or only a
procedure for generating a list (as below).

---------
Notes on usage:
===============
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

0. Setup:
* install libseccomp, libseccomp-dev, and seccomp
* install auditd if not already installed
* configure postgres --with-seccomp and maybe --enable-tap-tests to
improve feature coverage (see below)

1. Modify postgresql.conf and/or create <pg_source_dir>/postgresql_tmp.conf
8<--------------------
seccomp = on
global_syscall_default = allow
global_syscall_allow = ''
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''
session_syscall_default = log
session_syscall_allow = '*'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<--------------------

2. Modify /etc/audit/auditd.conf
* disp_qos = 'lossless'
* change max_log_file_action = 'ignore'

3. Stop auditd, clear out all audit.logs, start auditd:
* systemctl stop auditd.service # if running
* echo -n "" > /var/log/audit/audit.log
* systemctl start auditd.service

4. Start/restart postgres.

5. Exercise postgres as much as possible (one or more of the following):
* make installcheck-world
* make check world \
EXTRA_REGRESS_OPTS=--temp-config=<pg_source_dir>/postgresql_tmp.conf
* run your application through its paces
* other random testing of relevant postgres features

Note: at this point audit.log will start growing quickly. During `make
check world` mine grew to just under 1 GB.

6. Process results:
a) systemctl stop auditd.service
b) Run the provided "get_syscalls.sh" script
c) Cut and paste the result as the value of session_syscall_allow.

7. Optional:
a) global_syscall_default = 'log'
b) Repeat steps 3-5
c) Repeat step 6a and 6b
d) Cut and paste the result as the value of global_syscall_allow

8. Iterate steps 3-6b.
* Output should be empty.
* If there are any new syscalls, add to global_syscall_allow and
session_syscall_allow.
* Iterate until output of "get_syscalls.sh" script is empty.

9. Optional:
* Change global and session defaults to "error" or "kill"
* Reduce the allow lists if desired
* This can be done for specific database users, by doing
ALTER ROLE... SET session_syscall_allow to '<some reduced allow list>'

10. Adjust settings to taste, restart postgres, and monitor audit.log
going forward.

Below are some values from my system. Note that I have made no attempt
thus far to do static code analysis -- this list was build using `make
check world` several times.
8<-------------------------
seccomp = on

global_syscall_default = log
global_syscall_allow =
'accept,access,bind,brk,chmod,clone,close,connect,dup,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getppid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,listen,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,pipe,poll,prctl,pread64,prlimit64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,seccomp,select,sendto,setitimer,set_robust_list,setsid,setsockopt,shmat,shmctl,shmdt,shmget,shutdown,socket,stat,statfs,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,wait4,write'
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''

session_syscall_default = log
session_syscall_allow =
'access,brk,chmod,close,connect,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,poll,pread64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,select,sendto,setitimer,setsockopt,shutdown,socket,stat,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,write'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<-------------------------

That results in the following effective filters at the ("context"
equals) global and session levels:

8<-------------------------
select * from pg_get_seccomp_filter() order by 4,1;
syscall | syscallnum | filter_action | context
-----------------+------------+----------------+---------
accept | 43 | global->allow | global
access | 21 | global->allow | global
bind | 49 | global->allow | global
brk | 12 | global->allow | global
chmod | 90 | global->allow | global
clone | 56 | global->allow | global
close | 3 | global->allow | global
connect | 42 | global->allow | global
<default> | -1 | global->log | global
dup | 32 | global->allow | global
epoll_create1 | 291 | global->allow | global
epoll_ctl | 233 | global->allow | global
epoll_wait | 232 | global->allow | global
exit_group | 231 | global->allow | global
fadvise64 | 221 | global->allow | global
fallocate | 285 | global->allow | global
fcntl | 72 | global->allow | global
fdatasync | 75 | global->allow | global
fstat | 5 | global->allow | global
fsync | 74 | global->allow | global
ftruncate | 77 | global->allow | global
futex | 202 | global->allow | global
getdents | 78 | global->allow | global
getegid | 108 | global->allow | global
geteuid | 107 | global->allow | global
getgid | 104 | global->allow | global
getpeername | 52 | global->allow | global
getpid | 39 | global->allow | global
getppid | 110 | global->allow | global
getrandom | 318 | global->allow | global
getrusage | 98 | global->allow | global
getsockname | 51 | global->allow | global
getsockopt | 55 | global->allow | global
getuid | 102 | global->allow | global
ioctl | 16 | global->allow | global
kill | 62 | global->allow | global
link | 86 | global->allow | global
listen | 50 | global->allow | global
lseek | 8 | global->allow | global
lstat | 6 | global->allow | global
mkdir | 83 | global->allow | global
mmap | 9 | global->allow | global
mprotect | 10 | global->allow | global
mremap | 25 | global->allow | global
munmap | 11 | global->allow | global
openat | 257 | global->allow | global
pipe | 22 | global->allow | global
poll | 7 | global->allow | global
prctl | 157 | global->allow | global
pread64 | 17 | global->allow | global
prlimit64 | 302 | global->allow | global
pwrite64 | 18 | global->allow | global
read | 0 | global->allow | global
readlink | 89 | global->allow | global
recvfrom | 45 | global->allow | global
recvmsg | 47 | global->allow | global
rename | 82 | global->allow | global
rmdir | 84 | global->allow | global
rt_sigaction | 13 | global->allow | global
rt_sigprocmask | 14 | global->allow | global
rt_sigreturn | 15 | global->allow | global
seccomp | 317 | global->allow | global
select | 23 | global->allow | global
sendto | 44 | global->allow | global
setitimer | 38 | global->allow | global
set_robust_list | 273 | global->allow | global
setsid | 112 | global->allow | global
setsockopt | 54 | global->allow | global
shmat | 30 | global->allow | global
shmctl | 31 | global->allow | global
shmdt | 67 | global->allow | global
shmget | 29 | global->allow | global
shutdown | 48 | global->allow | global
socket | 41 | global->allow | global
stat | 4 | global->allow | global
statfs | 137 | global->allow | global
symlink | 88 | global->allow | global
sync_file_range | 277 | global->allow | global
sysinfo | 99 | global->allow | global
umask | 95 | global->allow | global
uname | 63 | global->allow | global
unlink | 87 | global->allow | global
utime | 132 | global->allow | global
wait4 | 61 | global->allow | global
write | 1 | global->allow | global
accept | 43 | session->log | session
access | 21 | session->allow | session
bind | 49 | session->log | session
brk | 12 | session->allow | session
chmod | 90 | session->allow | session
clone | 56 | session->log | session
close | 3 | session->allow | session
connect | 42 | session->allow | session
<default> | -1 | session->log | session
dup | 32 | session->log | session
epoll_create1 | 291 | session->allow | session
epoll_ctl | 233 | session->allow | session
epoll_wait | 232 | session->allow | session
exit_group | 231 | session->allow | session
fadvise64 | 221 | session->allow | session
fallocate | 285 | session->allow | session
fcntl | 72 | session->allow | session
fdatasync | 75 | session->allow | session
fstat | 5 | session->allow | session
fsync | 74 | session->allow | session
ftruncate | 77 | session->allow | session
futex | 202 | session->allow | session
getdents | 78 | session->allow | session
getegid | 108 | session->allow | session
geteuid | 107 | session->allow | session
getgid | 104 | session->allow | session
getpeername | 52 | session->allow | session
getpid | 39 | session->allow | session
getppid | 110 | session->log | session
getrandom | 318 | session->allow | session
getrusage | 98 | session->allow | session
getsockname | 51 | session->allow | session
getsockopt | 55 | session->allow | session
getuid | 102 | session->allow | session
ioctl | 16 | session->allow | session
kill | 62 | session->allow | session
link | 86 | session->allow | session
listen | 50 | session->log | session
lseek | 8 | session->allow | session
lstat | 6 | session->allow | session
mkdir | 83 | session->allow | session
mmap | 9 | session->allow | session
mprotect | 10 | session->allow | session
mremap | 25 | session->allow | session
munmap | 11 | session->allow | session
openat | 257 | session->allow | session
pipe | 22 | session->log | session
poll | 7 | session->allow | session
prctl | 157 | session->log | session
pread64 | 17 | session->allow | session
prlimit64 | 302 | session->log | session
pwrite64 | 18 | session->allow | session
read | 0 | session->allow | session
readlink | 89 | session->allow | session
recvfrom | 45 | session->allow | session
recvmsg | 47 | session->allow | session
rename | 82 | session->allow | session
rmdir | 84 | session->allow | session
rt_sigaction | 13 | session->allow | session
rt_sigprocmask | 14 | session->allow | session
rt_sigreturn | 15 | session->allow | session
seccomp | 317 | session->log | session
select | 23 | session->allow | session
sendto | 44 | session->allow | session
setitimer | 38 | session->allow | session
set_robust_list | 273 | session->log | session
setsid | 112 | session->log | session
setsockopt | 54 | session->allow | session
shmat | 30 | session->log | session
shmctl | 31 | session->log | session
shmdt | 67 | session->log | session
shmget | 29 | session->log | session
shutdown | 48 | session->allow | session
socket | 41 | session->allow | session
stat | 4 | session->allow | session
statfs | 137 | session->log | session
symlink | 88 | session->allow | session
sync_file_range | 277 | session->allow | session
sysinfo | 99 | session->allow | session
umask | 95 | session->allow | session
uname | 63 | session->allow | session
unlink | 87 | session->allow | session
utime | 132 | session->allow | session
wait4 | 61 | session->log | session
write | 1 | session->allow | session
(170 rows)
8<-------------------------

If you made it all the way to here, thank you for your attention :-)

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Attachments:

get_syscalls.shapplication/x-shellscript; name=get_syscalls.shDownload
seccomp-2019.08.28.00.difftext/x-patch; name=seccomp-2019.08.28.00.diffDownload+1521-0
#2David Fetter
david@fetter.org
In reply to: Joe Conway (#1)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Neat!

Are the seccomp interfaces for other kernels arranged in a manner
similar enough to have a unified interface in PostgreSQL, or is this
more of a Linux-only feature?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Joe Conway (#1)
Re: RFC: seccomp-bpf support

On 2019-08-28 17:13, Joe Conway wrote:

* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.

Well, if we are interested in this, why don't we just ask packagers to
do so? That seems like a much easier solution.

* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.

Why does one need to know what filters are active (other than,
obviously, checking that the filters one has set were actually
activated)? What decisions would a DBA or database user be able to make
based on whether a filter is set or not?

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Also the other way around?

* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.

What are use cases for this?

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Also, the list of system calls in use changes all the time. The same
function call from PostgreSQL could be a system call or a glibc
implementation, depending on the specific installed packages or run-time
settings.

Extensions would need to maintain their own permissions list, and they
would need to be merged manually into the respective existing settings.

Without good answers to these, I suspect that this feature would go
straight to the top of the list of "if in doubt, turn off".

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Joe Conway
mail@joeconway.com
In reply to: Peter Eisentraut (#3)
Re: RFC: seccomp-bpf support

On 8/28/19 1:03 PM, Peter Eisentraut wrote:

On 2019-08-28 17:13, Joe Conway wrote:

* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.

Well, if we are interested in this, why don't we just ask packagers to
do so? That seems like a much easier solution.

For the reason listed below

* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.

Why does one need to know what filters are active (other than,
obviously, checking that the filters one has set were actually
activated)? What decisions would a DBA or database user be able to make
based on whether a filter is set or not?

So that when an enforement action happens, you can understand why it
happened. Perhaps it was a bug (omission) in your allow list, or perhaps
it was an intentional rule to prevent abuse (say blocking certain
syscalls from plperlu), or it was because someone is trying to
compromise you system (e.g. some obscure and clearly not needed syscall).

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Also the other way around?

As I stated in the original email, the filters can add restrictions but
never remove them.

* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.

What are use cases for this?

For example to allow a specific user to use plperlu to exec shell code
while others cannot.

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

So if you are worried about that make your default action 'log' and
watch audit.log. There will be no errors or crashes of postgres caused
by that because there will be no change in postgres visible behavior.

And if returning an error from a syscall causes a crash that would be a
serious bug and we should fix it.

Also, the list of system calls in use changes all the time. The same
function call from PostgreSQL could be a system call or a glibc
implementation, depending on the specific installed packages or run-time
settings.

True. That is why I did not provide an initial list and believe folks
who want to use this should develop their own.

Extensions would need to maintain their own permissions list, and they
would need to be merged manually into the respective existing settings.

People would have to generate their own list for extensions -- I don't
believe it is the extension writers' problem.

Without good answers to these, I suspect that this feature would go
straight to the top of the list of "if in doubt, turn off".

That is fine. Perhaps most people never use this, but when needed (and
increasingly will be required) it is available.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

The only people who will incur the maintenance headache are those who
need to use it. The benefits are compliance with requirements.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#5Joe Conway
mail@joeconway.com
In reply to: David Fetter (#2)
Re: RFC: seccomp-bpf support

On 8/28/19 12:47 PM, David Fetter wrote:

On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote:

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Neat!

Are the seccomp interfaces for other kernels arranged in a manner
similar enough to have a unified interface in PostgreSQL, or is this
more of a Linux-only feature?

As far as I know libseccomp is Linux specific, at least at the moment.

Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#3)
Re: RFC: seccomp-bpf support

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2019-08-28 17:13, Joe Conway wrote:

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Yeah, that seems like quite a serious problem. I think you'd want
to have some sort of static-code-analysis-based way of identifying
the syscalls in use, rather than trying to test your way to it.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

After thinking about this for awhile, I really don't follow what
threat model it's trying to protect against. Given that we'll allow
any syscall that an unmodified PG executable might use, it seems
like the only scenarios being protected against involve someone
having already compromised the server enough to have arbitrary code
execution. OK, fine, but then why wouldn't the attacker just
bypass libseccomp? Or tell it to let through the syscall he wants
to use? Having the list of allowed syscalls be determined inside
the process seems like fundamentally the wrong implementation.
I'd have expected a feature like this to be implemented by SELinux,
or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.

(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Joe Conway (#1)
Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 11:13:27 -0400, Joe Conway wrote:

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.
That's different from something like a browser's CSS process, or such,
which really doesn't need much beyond some IPC and memory
allocations. But postgres is going to need syscalls as broad as
fork/clone, exec, connect, shm*, etc. I guess you can argue that we'd
still reduce the attack surface for kernel escalations, but that seems
like a pretty small win compared to the cost.

* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.

Which important syscalls would you get away with removing in backends
that postmaster needs? I think the only one - which is a good one though
- that I can think of is listen(). But even that might be too
restrictive for some PLs running out of process.

My main problem with seccomp is that it's *incredibly* fragile,
especially for a program as complex as postgres. We already had seccomp
related bug reports on list, even just due to the very permissive
filtering by some container solutions.

There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get
openat2()), different versions of glibc use different syscalls
(e.g. switching from open() to always using openat()), the system
configuration influences which syscalls are being used (e.g. using
vsyscalls only being used for certain clock sources), and kernel.
bugfixes change the exact set of syscalls being used ([1]https://lwn.net/Articles/795128/).

[1]: https://lwn.net/Articles/795128/

Then there's also the issue that many extensions are going to need
additional syscalls.

Notes on usage:
===============
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

c) Cut and paste the result as the value of session_syscall_allow.

That seems nearly guaranteed to miss a significant fraction of
syscalls. There's just no way we're going to cover all the potential
paths and configurations in our testsuite.

I think if you actually wanted to do something like this, you'd need to
use static analysis to come up with a more reliable list.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Joe Conway (#4)
Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 13:28:06 -0400, Joe Conway wrote:

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

So if you are worried about that make your default action 'log' and
watch audit.log. There will be no errors or crashes of postgres caused
by that because there will be no change in postgres visible behavior.

But the benefit of integrating this into postgres become even less
clear.

And if returning an error from a syscall causes a crash that would be a
serious bug and we should fix it.

Err, there's a lot of syscall failures that'll cause PANICs, and where
there's no reasonable way around that.

Greetings,

Andres Freund

#9Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Tom Lane (#6)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 1:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 2019-08-28 17:13, Joe Conway wrote:

Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

To compute the initial set of allowed system calls, you need to have
fantastic test coverage. What you don't want is some rarely used error
recovery path to cause a system crash. I wouldn't trust our current
coverage for this.

Yeah, that seems like quite a serious problem. I think you'd want
to have some sort of static-code-analysis-based way of identifying
the syscalls in use, rather than trying to test your way to it.

Overall, I think this sounds like a maintenance headache, and the
possible benefits are unclear.

After thinking about this for awhile, I really don't follow what
threat model it's trying to protect against. Given that we'll allow
any syscall that an unmodified PG executable might use, it seems
like the only scenarios being protected against involve someone
having already compromised the server enough to have arbitrary code
execution. OK, fine, but then why wouldn't the attacker just
bypass libseccomp? Or tell it to let through the syscall he wants
to use? Having the list of allowed syscalls be determined inside
the process seems like fundamentally the wrong implementation.
I'd have expected a feature like this to be implemented by SELinux,

SELinux is generally an object model and while it does implement e.g.,
capability checks, that is not the main intent, nor is it possible for
LSMs to implement syscall filters, the features are orthogonal.

or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.
(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

My preference would have been to enable it unconditionally but Joe was
being more practical.

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

Limiting the kernel attack surface for network facing daemons is
imperative to hardening systems. All modern attacks are chained
together so a kernel bug is useful only if you can execute code, and
PG is a decent vector for executing code.

At a minimum I would urge the community to look at adding high risk
syscalls to the kill list, systemd has some predefined sets we can
pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
@module.

The goal is to prevent an ACE hole in Postgres from becoming a
complete system compromise. This may not do it alone, and security
conscious integrators will want to, for example, add seccomp filters
to systemd to prevent superuser from disabling them. The postmaster
and per-role lists can further reduce the available syscalls based on
the exact extensions and PLs being used. Each step reduced the surface
more and throwing it all out because one step can go rogue is
unsatisfying.

Thank you.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.

I think the idea is to block access to seldom-used syscalls because
those are exactly the ones most likely to have bugs. (We certainly
hope that all the ones PG uses are well-debugged...) That seems fine.
Whether the incremental protection is really worth the effort is
debatable, but as long as it's only people who think it *is* worth
the effort who have to deal with it, I don't mind.

What I don't like about this proposal is that the filter configuration is
kept somewhere where it's not at all hard for an attacker to modify it.
It can't be a GUC, indeed it can't be in any file that the server has
permissions to write on, or you might as well not bother. So I'd throw
away pretty much everything in the submitted patch, and instead think
about doing the filtering/limiting in something that's launched from the
service start script and in turn launches the postmaster. That makes it
(mostly?) Not Our Problem, but rather an independent component.

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

regards, tom lane

#11Andres Freund
andres@anarazel.de
In reply to: Joshua Brindle (#9)
Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 14:23:00 -0400, Joshua Brindle wrote:

or some similar technology where the filtering is done by logic
that's outside the executable you wish to not trust.
(After googling for libseccomp, I see that it's supposed to not
allow syscalls to be turned back on once turned off, but that isn't
any protection against this problem. An attacker who's found an ACE
hole in Postgres can just issue ALTER SYSTEM SET to disable the
feature, then force a postmaster restart, then profit.)

A postmaster restart might not be enough, because the postmaster's
restrictions can't be removed, once in place. But all that's needed to
circumvent that is force postmaster to die, and rely on systemd etc to
restart it.

My preference would have been to enable it unconditionally but Joe was
being more practical.

Well, the current approach is to configure the list of allowed syscalls
in postgres. How would you ever secure that against the attacks
described by Tom? As long as the restrictions are put into place by
postgres itself, and as long they're configurable, such attacks are
possible, no? And as long as extensions etc need different syscalls,
you'll need configurability.

I follow the idea of limiting the attack surface for kernel bugs,
but this doesn't seem like a useful implementation of that, even
ignoring the ease-of-use problems Peter mentions.

Limiting the kernel attack surface for network facing daemons is
imperative to hardening systems. All modern attacks are chained
together so a kernel bug is useful only if you can execute code, and
PG is a decent vector for executing code.

I don't really buy that in case pof postgres. Normally, in a medium to
high security world, once you have RCE in postgres, the valuable data
can already be exfiltrated. And that's game over. The only real benefits
of a kernel vulnerabily is that that might allow to persist the attack
for longer - but there's already plenty you can do inside postgres, once
you have RCE.

At a minimum I would urge the community to look at adding high risk
syscalls to the kill list, systemd has some predefined sets we can
pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and
@module.

I can see some small value in disallowing these - but we're back to the
point where that is better done one layer *above* postgres, by a process
with more privileges than PG. Because then a PG RCE doesn't have a way
to prevent those filters from being applied.

The postmaster and per-role lists can further reduce the available
syscalls based on the exact extensions and PLs being used.

I don't buy that per-session configurable lists buy you anything
meaningful. With an RCE in one session, it's pretty trivial to corrupt
shared memory to also trigger RCE in other sessions. And there's no way
seccomp or anything like that will prevent that.

An additional reason I'm quite sceptical about more fine grained
restrictions is that I think we're going to have to go for some use of
threading in the next few years. While I think that's still far from
agreed upon, I think there's a pretty large number of "senior" hackers
that see this as the future. You can have per-thread seccomp filters,
but that's so trivial to circumvent (just overwrite some vtable like
data in another thread's data, and have it call whatever gadget you
want), that it's not even worth considering.

Greetings,

Andres Freund

#12Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Tom Lane (#10)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

<snip>

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

Just going to address this one now. SELinux is an LSM and therefore
only makes decisions when LSM hooks are invoked, which are not 1:1 for
syscalls (not even close). Further, SELinux generally determines what
objects a subject can access and only implements capabilities because
it has to, to be non-bypassable.

Seccomp filtering is an orthogonal system to SELinux and LSMs in
general. We are already doing work to further restrict PG subprocesses
with SELinux via [1] and [2], but that doesn't address the unused,
high-risk, obsolete, etc syscall issue. A prime example is madvise()
which was a catastrophic failure that 1) isn't preventable by any LSM
including SELinux, 2) isn't used by PG and is therefore a good
candidate for a kill list, and 3) a clear win in the
dont-let-PG-be-a-vector-for-kernel-compromise arena.

We are using SELinux, we are also going to use this, they work together.

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: RFC: seccomp-bpf support

On 2019-08-28 14:30:20 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Maybe I'm missing something, but it's not clear to me what meaningful
attack surface can be reduced for PostgreSQL by forbidding certain
syscalls, given the wide variety of syscalls required to run postgres.

I think the idea is to block access to seldom-used syscalls because
those are exactly the ones most likely to have bugs.

Yea, I can see some benefit in that. I'm just quite doubtful that the
set of syscalls pg relies on doesn't already allow any determined
attacker to trigger kernel bugs. E.g. the whole sysv ipc code is among
those seldomly used areas of the code. As is the permission transfer
through unix sockets. As is forking from within a syscall. ...

(We certainly hope that all the ones PG uses are well-debugged...)

One would hope, but it's not quite my experience...

Whether the incremental protection is really worth the effort is
debatable, but as long as it's only people who think it *is* worth the
effort who have to deal with it, I don't mind.

It seems almost guaranteed that there'll be bug reports about ominous
crashes due to some less commonly used syscall being blacklisted. In a
lot of cases that'll be hard to debug. After all, we already got such
bug reports, without us providing anything builtin - and it's not like
us adding our own filter suport will make container solutions not use
their filter, so there's no argument that doing so ourselves will reduce
the fragility.

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

And, as I mentioned in another email, once you can corrupt shared memory
in arbitrary ways, the differing restrictions aren't worth much
anyway. Postmaster might be separated out enough to survive attacks like
that, but backends definitely aren't.

Greetings,

Andres Freund

#14Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Joshua Brindle (#12)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:47 PM Joshua Brindle
<joshua.brindle@crunchydata.com> wrote:

On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

<snip>

(And, TBH, I'm still wondering why SELinux isn't the way to address this.)

Just going to address this one now. SELinux is an LSM and therefore
only makes decisions when LSM hooks are invoked, which are not 1:1 for
syscalls (not even close). Further, SELinux generally determines what
objects a subject can access and only implements capabilities because
it has to, to be non-bypassable.

Seccomp filtering is an orthogonal system to SELinux and LSMs in
general. We are already doing work to further restrict PG subprocesses
with SELinux via [1] and [2], but that doesn't address the unused,

And I forgot the citations *sigh*, actually there should have only been [1]:

1. https://commitfest.postgresql.org/24/2259/

Show quoted text

high-risk, obsolete, etc syscall issue. A prime example is madvise()
which was a catastrophic failure that 1) isn't preventable by any LSM
including SELinux, 2) isn't used by PG and is therefore a good
candidate for a kill list, and 3) a clear win in the
dont-let-PG-be-a-vector-for-kernel-compromise arena.

We are using SELinux, we are also going to use this, they work together.

#15Andres Freund
andres@anarazel.de
In reply to: Joshua Brindle (#12)
Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

Greetings,

Andres Freund

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#13)
Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

On 2019-08-28 14:30:20 -0400, Tom Lane wrote:

Admittedly, you can't get per-subprocess restrictions that way, but the
incremental value from that seems *really* tiny. If listen() has a bug
you need to fix the bug, not invent this amount of rickety infrastructure
to limit who can call it.

And, as I mentioned in another email, once you can corrupt shared memory
in arbitrary ways, the differing restrictions aren't worth much
anyway. Postmaster might be separated out enough to survive attacks like
that, but backends definitely aren't.

Another point in this area is that if you did feel a need for per-process
syscall sets, having a restriction that the postmaster's allowed set be a
superset of all the childrens' allowed sets seems quite the wrong thing.
The set of calls the postmaster needs is probably a lot smaller than what
the children need, seeing that it does so little. It's just different
because it includes bind+listen which the children likely don't need.
So the hierarchical way seccomp goes about this seems fairly wrong for
our purposes regardless.

regards, tom lane

#17Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Andres Freund (#15)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

seccomp also allows argument whitelisting so things can get very
granular, depending on who is setting up the lists.

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#15)
Re: RFC: seccomp-bpf support

Andres Freund <andres@anarazel.de> writes:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

I think Andres is right here. There are madvise calls in glibc:

glibc-2.28/malloc/malloc.c: __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
glibc-2.28/malloc/arena.c: __madvise ((char *) h + new_size, diff, MADV_DONTNEED);

It appears that the first is only reachable from __malloc_trim which
we don't use, but the second is reachable from free(). However,
strace'ing says that it's never called during our standard regression
tests, confirming Andres' thought that it's in seldom-reached paths.
(I did not go through the free() logic in any detail, but it looks
like it is only reached when dealing with quite-large allocations,
which'd make sense.)

So this makes a perfect example for Peter's point that testing is
going to be a very fallible way of finding the set of syscalls that
need to be allowed. Even if we had 100.00% code coverage of PG
proper, we would not necessarily find calls like this.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Joshua Brindle (#17)
Re: RFC: seccomp-bpf support

Hi,

On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

And we'd hit mysterious slowdowns in production uses of PG when seccomp
is enabled.

Greetings,

Andres Freund

#20Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Andres Freund (#19)
Re: RFC: seccomp-bpf support

On Wed, Aug 28, 2019 at 3:22 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote:

On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote:

A prime example is madvise() which was a catastrophic failure that 1)
isn't preventable by any LSM including SELinux, 2) isn't used by PG
and is therefore a good candidate for a kill list, and 3) a clear win
in the dont-let-PG-be-a-vector-for-kernel-compromise arena.

IIRC it's used by glibc as part of its malloc implementation (also
threading etc) - but not necessarily hit during the most common
paths. That's *precisely* my problem with this approach.

As long as glibc handles a returned error cleanly the syscall could be
denied without harming the process and the bug would be mitigated.

And we'd hit mysterious slowdowns in production uses of PG when seccomp
is enabled.

It seems like complete system compromises should be prioritized over
slowdowns, and it seems very unlikely to cause a noticeable slowdown
anyway. Are there PG users that backed out all of the Linux KPTI
patches due to the slowdown?

I think we need to reign in the thread somewhat. The feature allows
end users to define some sandboxing within PG. Nothing is being forced
on anyone but we would like the capability to harden a PG installation
for many reasons already stated. This is being done in places all
across the Linux ecosystem and is IMO a very useful mitigation.

Thank you.

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Joshua Brindle (#17)
#22Andres Freund
andres@anarazel.de
In reply to: Joshua Brindle (#20)
#23Peter Eisentraut
peter_e@gmx.net
In reply to: Joshua Brindle (#20)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joshua Brindle (#20)
#25Joe Conway
mail@joeconway.com
In reply to: Peter Eisentraut (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joe Conway (#25)
#27Joshua Brindle
joshua.brindle@crunchydata.com
In reply to: Tom Lane (#26)
#28Joe Conway
mail@joeconway.com
In reply to: Tom Lane (#26)
#29Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Joe Conway (#28)
#30Joe Conway
mail@joeconway.com
In reply to: Tomas Vondra (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Joe Conway (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#31)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#32)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#33)