oss-fuzz and why it might not be ideal for qubes-os
For the past few weeks, I have been working on integrating the qubes-os code base with the continuous automated fuzzing infrastructure: ClusterFuzz provided by Google’s oss-fuzz. While the infrastructure itself has helped find a lot of bugs in a lot of open source software, it doesn’t seem to fit very well with qubes-os. Qubes OS is a full fledged operating system after all, and running the qubes components requires a certain level of control on the system. Unfortunately, for security purposes and possibly to achieve better automation, oss-fuzz does not provide much control on the environment in which the fuzz targets are made to run.
Possible race conditions and subsequent timeouts in the libqubes-rpc-filecopy target
The libqubes-rpc-filecopy target has been integrated with oss-fuzz for around a month now, however the crash stats don’t tell a very positive story. There are a large number of crashes due to timeouts daily, which reduces the efficiency of fuzzing.
The timeouts are caused during the file I/O operations of the fuzz target.
One possible reason that I can think of which might be causing the timeouts is a possible race condition while the fuzz target is performing file I/O from
/tmp/out. LibFuzzer being multithreaded, could have many threads waiting on the same two files for file I/O, which could be causing a race condition. Of course, I am not certain that this is happening, but I can’t find any other explanation.
Earlier I had thought that the timeouts were being caused by another issue: qrexec-lib/pack.c waiting on STDIN input (which is /dev/null when the fuzz targets are run on clusterfuzz) due to the wait_on_end() function. I fixed this by patching out the functionality of wait_on_end() and expected the fuzz targets to run properly thereafter, but unfortunately they didn’t because of the issue mentioned above.
The solution that the oss-fuzz devs have suggested is not to create files altogether. Doing that does not seem feasible for these reasons:
Existing functions like do_fs_walk() will have to be patched out, and the file copy operation will have to be essentially reduced to copy from one buffer to another.
The coverage of the target will be reduced because of the same.
Doing such patches will hinder the continuous integration that we are aiming for.
Other fuzz targets: Replacing socket I/O with file I/O
A solution to the server-client architecture of most of the qubes-components is to have a mock libvchan library with the same interface as the original one, which would replace the socket I/O with file I/O and be statically linked with the fuzz target. However, this might too lead to the same race conditions as above, because it too would also involve creation of files on the ClusterFuzz machines in a similar way to the above target which the oss-fuzz devs suggest might be the cause of the timeouts after all.
Other fuzz targets: qubes-gui-daemon
Another fuzz target that me and my mentor had initially intended to integrate with oss-fuzz was the qubes-gui-daemon. Here’s where the lack of control on the fuzzing environment causes problems. For the qubes-gui-daemon (or its fuzz target) to run (at least directly), we would require the X server (preferably xvfb) to be running on the ClusterFuzz machines. Unfortunately we are not even allowed to install, much less run, anything other than our fuzz target executables on ClusterFuzz. So the integration of qubes-gui-daemon seems difficult unless we give up coverage, and continuous integration by patching out the code which interacts with the X server. Another solution my mentor has suggested is to embed the X server in the address space of the gui-daemon, replace sockets with in-memory fifos etc. which could be possible by some dlopen() hacks with xvfb. But since that seems to be a large endeavor without a promise of success, I should move on to other goals of my proposals.
So what about oss-fuzz?
I am not giving up the integration with oss-fuzz just yet. It’s just that it would take up more time than we had thought it would, and I think I should instead complete the other goals (static and taint analysis) first, and if time permits, continue with the oss-fuzz integration later, during or after GSoC.
However, if anyone wants to take up the task, here is the collection of links (along with this blog post itself) that I think might help:
A better fuzzing alternative?
oss-fuzz seems to be presenting a lot of difficulties, but solving them will be worth it because of the huge computation power which Google is giving us. Another fuzzing alternative which my mentor has suggested, and which now seems to be very appealing to me, is to have dedicated VMs on a qubes-machine running the fuzz targets. We could even have the executables run under valgrind in such VMs. Apparently, we do actually have certain tests which are run on qubes machines, and not on cloud (because xen on cloud is a no go). One could dedicate as many resources as one wants to the fuzzing VM. Of course, we won’t be able to even come close to oss-fuzz’s resources. We need to choose between control on the system or computation power. Both the fuzzing solutions have their merits and demerits.
While discussing how best to have continuous static analysis for the qubes code-base, me and my mentor have come to the conclusion that there are three major goals we can aim for (in the order of highest benefit:work_required ratio):
- Integration of tools like shellcheck and scan-build in the build process.
- Continuous integration with travis CI using Coverity and the tools already integrated above.
- Custom static analysis passes to ensure that untrusted_* values are being checked before being assigned to trusted values, and checking what values they impact. Originally Frama-C seemed useful, but we’re not sure if it is the best tool for this task. Suggestions are welcome :)
Currently, I am working on making the build-process more flexible in terms of the environment in which the components are built. Many of the current qubes components have hardcoded CC in the Makefiles and there are issues such as not enough security protections in some components, which can be fixed by allowing setting the environment variables from the builder.conf file itself. However, we want the build to be deterministic, so this doesn’t seem to be a good idea for release builds. For testing and dev builds however, this will indeed be very helpful.
For instance, while building my first component (qubes-gui-daemon) itself, setting the CC=clang detected a flaw in the xside.c code
There can be multiple end goals with such flexibility for the test/dev builds:
- Locally build the components and report/fix any issues like the above one that come up
- Integrate the test builds with travis CI too
- Build the components with sanitizers and execute them on dedicated VMs (like the fuzzing/testing VMs mentioned above.
- Build all components and check security protections using checksec.sh
- oss-fuzz does not seem ideal because of less control on the environment in which the fuzz targets are run
- libqubes-rpc-filecopy timeouts are still happening due to file creation
- a mock libvchan library can help eliminate socket i/o, but that too will require file creation, possibly leading to the same kind of timeouts
- qubes-gui-daemon can’t run on oss-fuzz machines because we are not allowed to run X server (xvfb)
- Should explore fuzzing/testing on dedicated VMs on a qubes machine
- Static analysis: integrating tools with qubes-builder, continuous integration with travis CI, custom static analysis passes
- Making build process more flexible for dev/test builds