Of threads, forks and PCLSRing in high-level languages
Amongst the little Common Lisp projects I have been working on at ITA, that we have published as free software, there is single-threaded-ccl, a small modification to Clozure CL (CCL) that allows it to start and run in a single thread. Why is that hack necessary at all? Because I'd like to do useful computations in forks (including but not limited to parallel compilation with POIU or XCVB or robust Erlang-style concurrency), but (a) the upstream CCL insists on always spawning ancillary threads during system initialization to manage interactive I/O, and (b) threads pretty much guarantee that most attempts to compute in a fork will result in massive instability. I'll cover the first issue quickly, but the second issue is what I will develop.
Unhappily regarding the first issue, Gary Byers, the top CCL author and maintainer, has expressed his disinterest in providing an option for single-threaded startup in the upstream CCL, even though I offered that I could do the porting and maintaining myself. Note that such supported interface would probably involve more than my small hack: my hack sacrifices some interactive features, meaning that you have to manually call finish-output to flush your output stream buffers before you may expect the user to see what you printed (which you should do to be standard-compliant anyway), and to that you need to manually poll for any Ctrl-C interrupts to have them processed. Yet supporting the option wouldn't be hard, as long as you don't care too much for interrupts: 1- you'd have to refactor and clean up the startup process, which would be a good thing anyway, and 2- you'd have to do the right thing with streams, which is just a special variable away (like Perl's $|), though it would probably be better to keep an interactive? slot in the stream class (possibly initialized by default from such special variable). Now, getting interrupts right may be tricky (and may involve some of the PCLSRing issues discussed below) but I'm sure the target audience of the single-threaded-ccl, who write system software, would be quite happy to live with that restriction. There is little to add to this issue, so I'll say no more.
The second issue is more interesting. The conclusion is that if you have threads when you fork, the only safe things to do are pretty much to either call execve(2) or to dump a core(5), unless you go through excruciating pains to otherwise make it work. These excruciating pains are what I'd like to walk you through.
The reason that fork doesn't marry well with threads is that when you fork, the child process will have only one active thread (a copy of the one that forked) in an address space that is a copy (on write by either party) of the memory of the parent process; therefore, all the data structures that are being shared by the current thread with other threads may be in whatever transient state these threads (not to speak about signal handlers!) may have been putting them at the moment the kernel copied the address space (which itself may or may not be atomic when multiprocessors are involved). The consequence is that unless the forking thread took care to hold a lock on each and every single shared data-structure that the child might possibly need to use (which might involve properly blocking signals), any of these data-structures may have been made available to the child in an unusable state that is completely unfixable (including it possibly being unsafe to attempt to reclaim associated resources).
Thus, unless you can solve this difficult concurrency problem, if you use fork and want to continue useful computations in the child, you must eschew the problem and either fork before any other thread is started, or after all other threads have exited cleanly — and you cannot just kill threads, for the same reason that they might be in the middle of modifying shared data-structures at the time you would shoot them down. Alternatively, you might not care at all about continuing any computation in the child process with any of the datastructures that these other threads may be accessing: you might be about to execve(2) a new program without the need for any of those datastructures to setup that call, or you might be about to dump a core in the middle of the (possibly suspect) computation for purely analytic purposes without the need to resume computation afterwards. Or you have indeed protected the few datastructures you care for with appropriate locks, and might not care about the minefield in the rest of your address space because you know that neither your child process nor any of the libraries that it uses will ever roam in that minefield (not even through a signal handler) nor need to reclaim the memory resources.
But what if you want to actually solve the difficult problem? What if you want to achieve multithreaded PCLSRing and synchronize all your threads in a state that is stable enough for the child to resume computations? The problem is solvable in theory (indeed was a topic of my PhD thesis), and there even exist tools to solve it in practice, though we may soon find these tools to be lacking.
Speaking of tools, some smart ass may point you at an interface function meant to help you solve the problem somewhat modularly: pthread_atfork. But the existence of this function doesn't guarantee that it will be used properly, or even that it is possible to use it properly. Indeed, you can grep(1) in vain for calls to it in most libraries, including allegedly thread-safe libraries and downright multithreaded libraries; you'll find that few if any use atfork at all. But don't blame the library authors too fast before you consider the cost/benefit analysis of even trying to use this interface properly.
First problem is, when you attempt to grab all those locks you need, you not only have to track down what all those locks are, you also have to order them properly. Indeed, if one thread tries to grab lock A then lock B, while another thread tries to grab lock B then lock A, then you are due for a deadlock; and even if you modify your code so the second thread will back out of B after failing to get A before it tries again, then for all your efforts you may avoid the deadlock but still have a livelock. The only safe and live solution is there should be a clear dependency (partial) order between locks, and that you should never try to grab a higher-level lock while holding a lower-level lock. Now, if your atfork is trying to grab all the locks, then you have to define a total order that is compatible with the previous partial order, and if two threads attempt to each compute such a total order, they better agree on the topmost element for whichever common subset of locks they are trying to grab.
In the case of simple low-level C libraries without dynamic datastructures, getting such an order should be straightforward: each library would have a global lock and only call lower-level libraries, so you'd grab the higher level library lock first, then the lower level library lock; hopefully, the way your linker works, it calls initializers for lower-level libraries first, and the atfork handlers will be registered in the correct order. But then again, are you sure your library writers got this right? Considering how little the feature is used, it doesn't encourage anyone to bother trying and debugging a non-deterministic concurrency issue. And because you don't expect others to bother, why bother yourself, when your work will be in vain unless they do their part of the job? In absence of an enforcement mechanism these things just don't happen. Since there is no automated such mechanism, and the social incentives are lacking, you better be ready to go through the source code to ensure things are done right when you need them. And when the code is proprietary, you better have a great relation with the authors to get what you need, and the authors better be just as madly clever as you are to get it to work.
But we were talking about the simple case of low-level libraries with only one or a few static global locks each. Now consider what happens if you're writing a higher-level library that has to take into account a dynamic set of locks, or worse where some user-provided callbacks or other higher-order functions may be called while holding a lock, that may themselves cause operations requiring lower locks to be called. And what if these callbacks even potentially cause the dependency order between the locks of various libraries to not be known statically at compile-time, or worse to change with phases of your program? Remember that before this ordering issue kicked in, we were assuming that you could trace the set of all the resources involved in your high-level computation. To say the least, it all requires that you should possess a jolly good understanding of all concurrency issues throughout the code. Worse even, you will have to maintain that deep understanding as your code evolves, and as the libraries you use evolve. If you are writing a small part of some biggish software involving tens of developers past, present and future, your chance of getting this right are pretty thin.
It would be nice if some automatic tool helped you get it right by construction, by tracking all the resources in use and their dependency order; but such tool doesn't exist, and even if it did, it would only be useful if it were used on each and every module that will run in the same address space as your module. There again, we're far from it. Does your favorite high-level programming language implementation properly use atfork in its runtime? Does it allow you to declare resources in a way that allows them to be automatically tracked down with their dependency order always well-established? Does it provide hooks for your application to register its own atfork handlers? I bet it doesn't. But then again, I may be proven wrong! And if enough people start using it, that might build enough pressure to get all these libraries to properly use it.
However, holding locks is only one part of the problem. Then comes the cleanup of the resources in the child process. If the fork is a one-off for a special-purpose operation, or even an n-off for a small-enough bounded number n, you may not actually care; you may tolerate that a large part of your address space be occupied with unused and unusable resources reserved for the use of now non-existent threads, that these resources will never be used, may bitrot quietly, and hopefully will quickly end up being swapped out. But if spawning threads and forking are both part of your normal computation model, you will find that after a few iterations, the descendent's memory will be all clogged with unfreed resources. Actually, address space is only one problem, and other resources may leak and run out, like file descriptors and other operating system handles and capabilities; Finally, not being able to track down these leaking resources and get rid of them might not just be a performance issue, but also be a security issue: what if you fork to run unsafe libraries, in which a malicious cracker can now inject code that captures passwords and other sensitive information?
Once again, when you deal with simple low-level libraries, the problem is relatively straightforward: in your cleanup handler, just reset your global datastructures to remove any trace of datastructures used by other threads; then be sure to restart any required ancillary thread before the next time it is used, informing it about any datastructure you kept for the current thread (if any). If you were maintaining a stateful session with a server, you may have to reinitialize a new session, any existing transaction being aborted as if a "normal" communication error had happened (and you handle these cases, don't you?). Then again, for reasons already discussed, just because it is straightforward doesn't mean it has been provided to you; after all, it is tedious code to maintain, code for which there is no expected user.
Now once you try to generalize the procedure to arbitrary higher levels of abstractions, including whichever layers of application reside above the language or module you are trying to write, then you realize that to fully and properly solve the problem of forking with threads, you have to fully solve the reputedly difficult problem of properly interrupting and killing a thread; indeed what you are trying to do is precisely cleaning up resources from threads that were interrupted at a random moment in some parent but were never otherwise properly interred.
And there the difficulty lies in that while you may have a proper understanding of your code, and do what it takes to correctly preserve its invariants and not break any of the invariants of the lower level layers that you rely upon, you cannot know what invariants matter to the application layers above you: they could be just anything. You just can't do what it takes when you can't know what it takes, and some applications will want something different and opposite from what want other applications. That is why language implementers can with a lot of work achieve enough PCLSRing to synchronize every thread in a stable state with respect to say memory allocation invariants, so that garbage collection can be successfully undertaken; but they can't make it so that any thread (much less all threads) will be in a stable enough state as to be safely killed irregardless of what the application invariants are. Surely you could easily kill the thread, unlink its stack from the GC roots, and reclaim its storage. But little good would that do if that meant your application was then stuck because that thread was in the middle of some high-level operation, and no one will be able to cleanup after this half-done operation when the thread is dead.
Was the interrupted thread in the middle of transfering money from account A to account B, with the transferred amount being currently included in both (respectively neither) account? You better not interrupt there, or money will have appeared (respectively disappeared) from your accounting system, presumably a bad thing. Hopefully, if the thread was in the middle of any transaction, it will be holding some kind of lock, and you will only be able to interrupt it when the lock is released. Now the thread may be constantly in the middle of such transactions, and the odds of interrupting it at the right moment may be slim; so you can't just poll for the thread to be in a stable state, you need the interruption to be queued and processed as soon as the thread exits its atomic operation and reaches a stable state. Except that what is stable to one upper level of the code may not be stable enough for one yet higher level of the code, so you want to queue your interruption at the highest level that matters. Finally, the thread might not naturally reach a stable state in any reasonable, predictable or otherwise satisfactory amount of time if left to its own device, especially where optimizations may reorder operations and skip any stable state. Therefore you may have to not just wait for the target thread to stabilize, but make it enter a special execution mode where it will do its best to promptly rollback or rollforward its current transaction and leave the thread in a proper state suitable for interruption.
And of course, once you've gotten the PCLSRing itself correct, you still need be able to walk the stable state of the process and do whatever cleanups and resets are necessary to account for the death of the threads that you killed. Tracing garbage collection might do a lot of work for you, but you may still have to unregister the deceased activities and their resources from whichever registries you use before the GC may be able to do its job. And the unregistration will probably be different during a fork's child cleanup from what it is in the parent: in the parent, you want to do things gracefully, completing transactions by committing or aborting them, returning resources to the system and various servers; in the child, you want to just forget about those things of which you have a duplicate, neither commit nor abort transactions that are the parent's property to complete, just relinquishing access to those resources to which you were given an undeserved duplicated handle. Thus you can't "just" unwind the stack and play the normal unwind handlers and other finalizers — unless the handlers were built in such a way that they can recognize that you're unwinding in a special "throw it away" mode. Note that this part is tricky even in the single-threaded case; the multi-threaded case only compounds it slightly in that you have to not forget to walk the resources from other threads to deallocate them, in addition to those from parts of the current thread's continuation that are irrelevant to the child. The part that is really made harder by multithreading is the PCLSRing problem; the subsequent cleanup is mostly just "normally hard".
Another approach would be to not try to free resources of interrupted activities in the current image, but only allocate fresh resources for relevant activities in a fresh image. In addition to the cost of initializing the fresh image and all its libraries (which can be paid in advance by keeping an preforked initialized server image listening), this implies the cost of serializing the relevant state in the current image, and either save it aside while the process somehow reinitializes from scratch, or send it over the wire to a newly created or forked process that is being execve'd and initialized. The cost of such a "Copying GC" approach to resource reclamation is somewhat high, but so is the cost of returning resources from other threads actually, and at least this cost doesn't depend on what a gazillion other activities written and controlled by different authors may be running in the parent image. This approach also has the advantage that you can often abstract away from the state of underlying low-level libraries you link to, and that the protocol for forkability, while more constraining than it can be without serializability, is quite simple: the ability to serialize the relevant state of the thread — which can be useful for other purposes (persistence, migration).
What might do the trick to manage issues with PCLSRing (or serialization) would be an interface for declarative resource-dependency management in your programming language API: which resource depends on which other resource having been initialized (and what does that mean for the encoding and decoding data). Did you see one? No? Too bad. Problem is, it might be non-trivial to implement right, difficult to prove correct statically, and costly to enforce dynamically. In low-level languages, things may be simple, programmers will receive no help whatsoever to get them right. Higher-level languages might conceivably offer support to automate these things, but solving the problem in the general case might be quite tricky for language implementers. With higher-order functions and modules, establishing a sound concurrency protocol may depend on clients and servers respecting the mutual contract. Now what language support is there or can there be to clearly define and enforce such contract? Is there significant research in solving that problem? Not that I know of — but I haven't followed recent developments. Region-based memory management and substructural typing look like a good trail to follow; but does it go all the way to PCLSRing? If you know, please tell me.
Meanwhile, I'll keep using the usual "solution" — let the garbage accumulate, and forget the TUNESy hopes of a system capable of holistic bookkeeping, instead manually create a controlled subsystem that does its own bookkeeping of the things I manually specify as important, so I may export them from a cluttered system and reimport them into a fresh one.