Software Task Cancellation Is Complicated

· ngp's blog


This is based largely on a twitter thread, though I've expanded it substantially. Enjoy 🙂

Background #

My day job involves developing distributed, soft realtime command and control systems for use in scientific and industrial hardware systems. One of the core concepts of these systems is that any individual component can fail at any time and in ways that a dependent processes may not be given explicit notification of the failure. This leads to a requirement of many components within the system needing to have timeouts. If some process exceeds its time window, it should cancel itself so the module can be available to service other requests in a timely manner to maintain liveness and some degree of fairness for servicing requests. Early on, we adopted a primarily single-threaded model using custom event loops and a client-server architecture using a centralized broker that implements caching and a pub-sub model for every event (or rather, "process variables" or "PVs" for short) in the system. It has been extremely successful, but not without limitations. One of the limitations brought about by the synchronous model is preventing unintentionally long tasks from starving out the servicing of other requests/events. It turns out that handling cancellation of otherwise synchronous code is not very straightforward.

The Basics #

Tasks are somewhat abstract, so let's get a basic definition out of the way:

Some arbitrary computation that may happen synchronously or asynchronously that is expected to produce some value or have some side effect.

Sound familiar? It's basically just a function. Next. I don't intend for it to mean any particular implementation, but we'll get into some details later.

To disambiguate, I'm going to simply refer to them as "functions" from here on out.

So what happens when you have a function that might take awhile, but if it doesn't complete its work in some pre-determined amount of time you want to stop performing further work and free up the resources the execution was consuming? Well, it's complicated.

In the standard approach that most programming languages support (ie. with simple synchronous functions), you simply can't. At least, not generically without modifying the code itself. If your requirements are limited exclusively to time/deadline based cancellation, you can do something like this:

[!example]-

 1#include <stdlib.h>
 2#include <time.h>
 3#include <stdio.h>
 4#include <unistd.h>
 5
 6const unsigned long COUNT = 5;
 7
 8enum long_exit {
 9  OK,
10  TIMEOUT,
11};
12
13int past_deadline(struct timespec deadline) {
14  struct timespec now;
15  if (clock_gettime(CLOCK_MONOTONIC, &now)) {
16    abort();
17  }
18  return (now.tv_sec > deadline.tv_sec ||
19          (now.tv_sec == deadline.tv_sec && now.tv_nsec > deadline.tv_nsec));
20}
21
22enum long_exit my_long_function(struct timespec deadline) {
23  for (unsigned int x = 0; x < COUNT; x++) {
24    sleep(1);
25    printf("Hello!\n");
26    struct timespec now;
27    if (past_deadline(deadline)) {
28      abort();
29    }
30  }
31  printf("Success!");
32  return OK;
33}
34
35int main() {
36  struct timespec tp;
37  if (clock_gettime(CLOCK_MONOTONIC, &tp)) {
38    abort();
39  }
40  // Add some time to now to get a deadline
41  tp.tv_sec += (COUNT - 1);
42  const enum long_exit exit = my_long_function(tp);
43}
44

This is a contrived, simple example but it gets the point across. You can periodically check if you've passed your deadline. This works because the state tracking the condition used to exit early is handled asynchronously for us by the OS: updating the (monotonic) system clock while the application runs. However, in order to support other arbitrary conditions that can be controlled by the programmer, we need some form of concurrency.

C and [[pthreads]] #

The most common form of concurrency in C programs is threading using something like pthreads or processes using fork. The latter case requires setting up some form of RPC to get working, so we'll focus on pthreads.

pthreads support the concept of cancellation naturally, but there's more than one "type" of cancellation and the API is rather confusing. To be brief, cancellation is sort of like a signal, it's typically triggered externally and asynchronously to a process. The two "types" of cancellation are called "deferred" and "asynchronous". Deferred cancellation can be thought of as similar as our example above, a condition is checked at "cancellation points" and the thread exits if it is set. Asynchronous can happen at any time. It's almost always incorrect to use the asynchronous cancellation mode.

[!warning] The condition used internally by pthread_cancel is atomic and irreversible.

[!info]+ If you'd like to find more info on pthreads and cancellation, refer to the pthreads(7) manual page using man 7 pthreads

I'm too lazy to write a code example of this, but maybe I'll find the motivation and time to do this before publishing. Please refer to the documentation for pthreads for more information.

Java #

Java has a stronger threading model than C does, or at least one that is more consistent across operating systems thanks to the JVM. [[Java]] Threads, similar to C's pthread, also has the natural concept of cancellation. Also very similar, there are 2 types of cancellation: synchronous and asynchronous. Fortunately, the Java runtime makes it a bit more manageable to

Go #

Python #

Rust #