Capsicum
Capsicum [1] is a sandboxing system based on capabilities. In short, it removes access to global namespaces, but allows sandboxed programs to access things through capability-blessed file descriptors and mapped memory that they receive or inherit. Experiment: what might be involved in sandboxing PostgreSQL? Here's some early speculation with unfinished code. Code is here, but it doesn't run correctly yet.
Current status: failing on wait4() where postmaster tries to reap child process after receiving SIGCHLD. Next step there is obvious.
- To-do list so far, crashing into problems one by one and fixing them
Enter capabilities mode in PostmasterMain() after establishing global shmem and sockets, and then see what breaksDon't call procctl(PROC_PDEATH_CTL) in children, it's not allowed (it probably should be: write patch for FreeBSD?); get around this by switching to SIGIO notification instead (existing patch I had for another reason by happy coincidence), and then see what breaks nextTeach fd.c to open the data directory at startup, before we enter cap mode, with new function called in postmaster init sequence and keep it around in global variable datadir_fdTeach fd.c to open all files with openat(datadir_fd, ...) and see what breaks nextTeach miscinit.c to do its open() calls via fd.c wrappers so it benefits from the above, and see what breaks nextTeach AllocateFile() (a wrapper for fopen() that is used for example to read postgresql.conf) to do the openat(datadir_fd, ...) + fdopen dance- Switch fork_process.c to using pdfork() and track the pd of every child
- Replace postmaster's select() + SIGCHLD handler that does traditional wait4() with new code that waits for all pds (current wait4() fails with ERR#94 'Not permitted in capability mode', unsurprisingly)
- Hrrmph, it looks like you can't use libc locales without running into the problem that it wants to open them on demand and we don't know which ones users are going to access yet; seems to require teaching libc to open /usr/share/locale, or a new interface there (?)
- Things I predicted at the thought experiment stage but haven't actually crashed into yet
Teach latch.c to use umtx/futex instead of kill(), because that's the main place that processes signal arbitrary other processes, and that's obviously not going to work; done with existing unfinished patches that I have already been working on, by another happy coincidence- Replace postmaster.c's kill(child) calls with pdkill()
- Replace pmsignal.c's kill() with writing a byte to the postmaster pipe? Obviously that kill() isn't going to work (AFAIK there is no special pass for killing your parent, you only get a special pass for killing your own PID). Then add the other end of that pipe to the set of fds the postmaster's main loop waits on (the rest of them being pds to learn about child death).
- Although I fixed miscinit.c to open eg postmaster.pid via openat() wrappers in fd.c, it also has calls to unlink, which will need to become unlinkat().
- Surely I'm going to need to open the library directory and change dlopen() to dlopenat() in order to be able to use store proc languages and contrib modules
- More things to think about later
- Should this be controlled by building --with-capsicum, or just detected by configure?
- Should this be enabled with a new setting sandbox=capsicum, leaving space for other similar stuff (pledge(), ...)?
- How does this related to the existing selinux support? I've never looked at it.
- Does the new Linux Landlock system have the ability to work just like Capsicum? I got that impression from a Tweet, but Tweets tend to hide the true complexity of the universe
Longer version (rambling notes from before I actually started writing code, ideas may be wrong/out of date already...):
FILES
- Data and log files. A PostgreSQL server manages a set of files under one directory known as "the data directory" with various subdirectories. Capsicum doesn't let you call open("arbitrary path", ...) because that's a global namespace, but it does allow you to call openat(fd, "path under fd"), given an fd that is set up to allow access. Luckily, data files are always opened with paths relative to the data directory already, through a single wrapper BasicOpenFilePerm() [2].
Its open(...) would openat(data_dir_fd, ...). Likewise, all calls to opendir() go through AllocateDir() in fd.c, which needs the same type of treatment.
- There are some special cases: src/backend/utils/init/miscinit.c calls open() directly, and there are a few more cases like that. All of these can probably be changed to use BasicOpenFilePerm(). (Some tools under src/bin/ are excluded from the scope of this thought experiment.)
- Access to arbitrary file paths: A few high-privileged (meaning PostgreSQL privileges) facilities that access the filesystem directory like COPY xxx FROM '/some/path' or `SELECT pg_ls_dir('/tmp') won't work, but preventing that sort of thing seems to be the whole point of this exercise. Or perhaps they could be constrained to the data directory, or some other blessed place.
- Shared libraries. Many parts of PostgreSQL are implemented as libraries that provide extensions like PL/PGSQL, Python, Perl etc for stored procedures, and new data types, index types, table types etc. These are opened with dlopen(), and that'd need to become dlopenat(), given a suitable fd for the library install directory.
- A small number of cases like /dev/random may need special treatment, perhaps just being opened before entering capabilities mode.
MEMORY
- System V shared memory: PostgreSQL creates a vestigial 64 byte memory segment; once it was a large segment for the buffer pool etc, but its remaining use is as a sort of interlocking to prevent two servers from running in the same data directory at the same time. This can probably be replaced with a plain old lock file in the data directory, or it could be set up before entering capabilities mode.
- Anonymous shared mmap() memory: The main shared memory region is created with mmap() in the "postmaster" (supervisor) process and then inherited by all children by forking. That requires no special treatment.
- POSIX shared memory: Extra shared memory regions, known as "dynamic shared memory" or DSM in the PostgreSQL source, are sometimes created for the lifetime of a parallel query, for extra work space. By default, this is done with POSIX shm_open("/PostgreSQL/xxxx"), which isn't allowed under Capsicum. The simplest thing would be to use dynamic_shared_memory=mmap would work: it opens temporary files under the data directory, so that would automatically work, though it may be more likely to write data back to disk than the POSIX alternative (?), which is undesirable.
PROCESSES
- The postmaster creates children with fork(), handles notification of their exit with SIGCHILD handler and sends signals with them to control them with kill(). This would need to be changed to pdfork(), poll()/kevent() to wait for their termination, and pdkill() to send signals (for example for shutdown, to notify them that a worker process they asked for has started or exited, and other rare special conditions).
IPC
- Most interlocking is done with spinlocks and "LWLocks" that use semaphores to wait, but those are the POSIX unnamed kind, so they're really just bits of shared memory used for waiting via umtx/futex. (On some OSes they are POSIX named or SysV semaphores that wouldn't be allowed under Capsicum, but not on FreeBSD or Linux).
- Child processes also need to signal the postmaster. How can you get the fd corresponding to your parent, so you can use pdkill()? Perhaps we should just get rid of the signals in that direction, and instead use the (existing, mostly unused) pipe between parent and children?
- Some interlocking is done with what we call "latches", an abstraction that consists of a shared memory flag that can be waited on and cleared by its owner and set by any other process to wake the owner up. When the setter sets a latch, it checks if the owner is waiting and if so sends a signal with kill(SIGURG). When the owner sleeps waiting for the latch to be set, it does so using various race-free techniques described in the comment at the top of latch.c [3].
Simple latch waits could be replaced with umtx/futex waits and I (TM) already have a prototype that I was already working on for performance reasons, so that would, erm, kill two birds with one stone.
I thought there was going to be a case where we still needed signals, but I've just had a tentative epiphany. Latches can also be multiplexed with waits for sockets (which maps to kevent()/epoll_wait() containing socket + a signal event for a signal that is otherwise ignored or blocked), and that seems to be a problem for a pdfork()-based system: every backend would need the process fd for every peer, which creates an N^2 file descriptor problem. But... it might be that we really only do that so that our SIGUSR1 and SIGTERM handlers (which usually receive signals from postmaster) can prevent WaitLatchOrSocket() from waiting forever for socket data when we should be waking up to check various flags; in other words, in the multiplexing-with-sockets case, maybe the only sender of SIGURG is the same process! And capability mode always allows you to send a signal to yourself (see kern_kill() in kern_sig.c). So there may be no problem here after all (and that could be changed to EVFILT_USER or a self-pipe).