Finding multi-threading bugs with gdb

Multi-threading is a useful tool, but programs that use multiple threads are susceptible to types of bugs that cannot occur in single threaded programs. Race conditions can exist when multiple threads access the same data concurrently. Synchronization primitives, like mutexes and condition variables, are useful in avoiding race conditions, but can cause other problems when used incorrectly. In particular, a deadlock can occur when several threads are competing for the same mutexes in such a way that none of the threads can proceed. This post demonstrates how to use gdb to find deadlocks and similar threading bugs in FlasCC programs.

What a deadlock looks like

A simple deadlock can occur with two threads, thread 1 and thread 2, and two mutexes, A and B. Suppose that thread 1 acquires mutex A while thread 2 acquires mutex B. Then thread 1 attempts to acquire mutex B, so it waits for thread 2 to release it. If thread 2 attempts to acquire mutex A, it will wait for thread 1, and cause both threads to wait forever.

In order to discover and debug a deadlock, you can interrupt the waiting threads with a debugger and examine their state, like their stack traces. Once you have this information, you can determine that the threads are in fact deadlocked and not just performing some long running operation, for example.

Background

Before jumping into gdb, let’s discuss how the underlying Flash debugger works and how it is relevant to threading. When you compile ActionScript code for debugging, the compiler inserts special opcodes called debuglines into the ActionScript bytecode it produces. Debuglines serve two purposes: they provide line number information for the debugger and they mark locations in the bytecode where the debugger is able to suspend execution of your program. When the Flash Player encounters a debugline, it checks to see if it should suspend the program because it hit a breakpoint, has completed a stepping operation, or any other reason. Execution of a thread only stops when that thread encounters a debugline. [1]

Mutexes can also stop the execution of a thread: if a thread tries to acquire a mutex currently held by a different thread, it will wait until the mutex becomes available. If several threads are trying to acquire a mutex, only one will succeed and continue executing; the others will wait. Suppose that a debugger attempts to interrupt these threads at this point. The thread that has acquired the mutex and is executing will eventually encounter a debugline, suspend, and give control to the debugger. But the threads that are waiting to acquire the mutex cannot give control to the debugger, since they have not found a debugline after receiving the debugger’s request to stop. Therefore, threads that are waiting on mutexes cannot be interrupted by the debugger until they acquire the mutex, resume execution, and encounter a debugline.

Some FlasCC internals

The above discussion applies to ActionScript programs that use the flash.concurrent.Mutex class for synchronization. Most FlasCC programs will instead use the pthread version of a mutex, pthread_mutex_t. Not surprisingly, the implementation of pthread_mutex_t uses Mutex, but there isn’t a one-to-one mapping between a pthread_mutex_t and an instance of the ActionScript Mutex class. Instead, all synchronization in pthreads is handled by a single ActionScript Mutex. When several threads are waiting, because they called functions like pthread_mutex_lock or pthread_cond_wait, they each acquire the Mutex for a brief period of time. While such a thread holds the Mutex, the debugger is able to interrupt it, but doing so prevents any of the other threads from the taking the Mutex. Since a waiting thread cannot be interrupted by the debugger unless it holds the Mutex, only one waiting thread can be controlled by the debugger at any point in time.

A simple example

With these limitations in mind, let’s debug a simple program that deadlocks. Here’s the code:

#include <pthread.h>
#include <stdio.h>
 
static pthread_mutex_t first_lock = PTHREAD_MUTEX_INITIALIZER;
static pthread_mutex_t second_lock = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t ready_to_lock = PTHREAD_COND_INITIALIZER;
static pthread_mutex_t cond_mutex = PTHREAD_MUTEX_INITIALIZER;
 
void *second_thread(void *thrdata)
{   
    pthread_mutex_lock(&cond_mutex);
    pthread_mutex_lock(&second_lock);
    pthread_cond_signal(&ready_to_lock);    
    pthread_mutex_unlock(&cond_mutex);
    fprintf(stderr, "about to deadlock\n");
    pthread_mutex_lock(&first_lock);
    pthread_exit(NULL);
}   
 
void *first_thread(void *thrdata)
{
    pthread_t thread2;
    pthread_mutex_lock(&first_lock);
    pthread_mutex_lock(&cond_mutex);
    int err = pthread_create(&thread2, NULL, second_thread, NULL);
    if (err) {
        perror("pthread_create");
    } else {
        pthread_cond_wait(&ready_to_lock, &cond_mutex);
        pthread_mutex_lock(&second_lock);
    }
    pthread_exit(NULL);
}
 
int
main(int argc, char **argv)
{
    pthread_t thread1;
    int err = pthread_create(&thread1, NULL, first_thread, NULL);
    if (err) {
        perror("pthread_create");
    }
    pthread_exit(NULL);
}

Note that this example uses a condition to ensure that two threads deadlock each other reliably. This is for the sake of example, but not very realistic. Often deadlocks only happen from time to time.

Start by compiling the program for debugging:

gcc -o dlock.swf -g -O0 -pthread -emit-swf dlock.c

Run the program in the debugger, but don’t bother setting any breakpoints or anything like that; we’ll just let the program run until it deadlocks and then we’ll interrupt it to see where it got stuck. Since we’ll be debugging each thread individually, make sure to turn on non-stop mode first:

(gdb) set pagination off
(gdb) set target-async on
(gdb) set non-stop on
(gdb) run
Starting program: dlock.swf

Once the text “about to deadlock” appears, press CTRL+C in gdb to interrupt the program. You should see something like this:

^C
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000 in ?? ()
(gdb)

At this point, the debugger has interrupted the UI thread, but our deadlocked threads are still attempting to run in the background. But given that they are deadlocked, they can’t get very far. Let’s ask gdb to show us what each thread is doing with the info threads command:

(gdb) info threads
[New Worker 4]
[New Worker 5]
  Id   Target Id         Frame 
  3    Worker 5          (running)
  2    Worker 4          (running)
* 1    Worker 1          0x00000000 in ?? ()

We would like to interrupt each running thread and see what it’s doing. The interrupt -a command does that:

(gdb) interrupt -a
(gdb) 
[Worker 4] #2 stopped.
0x00000000 in ?? ()

Of our two threads, only one has stopped and given control over to the debugger. As discussed above, that’s because only one waiting thread can be interrupted by a debugger at a time. Thread #2 happened to stop first, but you might find when you try this example that thread #3 stops first. That’s okay, because we’ll be able to examine each thread eventually. Let’s switch to whichever thread has stopped and see what code it is executing:

(gdb) thread 2
[Switching to thread 2 (Worker 4)]
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf0001c0c in avm2_msleep () from remote:2.elf
#2  0xf0001fa3 in kmsleep () from remote:2.elf
#3  0xf001507b in __do_lock_umutex () from remote:49.elf
#4  0xf001575f in _do_lock_umutex () from remote:49.elf
#5  0xf0014d8b in ___umtx_op_wait_umutex () from remote:49.elf
#6  0xf0014423 in k_umtx_op () from remote:49.elf
#7  0xf0000ae0 in _umtx_op () from remote:2.elf
#8  0xf0010e69 in __thr_umutex_lock () from remote:27.elf
#9  0xf000efc3 in _mutex_lock_common () from remote:17.elf
#10 0xf000f0fa in __pthread_mutex_lock () from remote:17.elf
#11 0xf0011a60 in pthread_mutex_lock_exp () from remote:33.elf
#12 0xf00000e3 in first_thread (thrdata=0x0) at dlock.c:31
#13 0xf000d3c5 in _thread_start () from remote:7.elf
#14 0xf00015ce in _thread_run () from remote:2.elf
#15 0x00000000 in ?? ()

The backtrace shows that this thread has called __pthread_mutex_lock, so we know that it is in the process of attempting to lock a mutex. Since the thread is currently in avm2_msleep, we can tell that it is waiting for another thread to release the mutex. Frame #12 shows that we got here from line 31 in our program:

pthread_mutex_lock(&second_lock);

In order to look at the stack trace for thread #3, we need to allow it to acquire the synchronization mutex. Once we do that, it will run until it encounters a debugline and then give control to the debugger. While threads normally do not hold the synchronization mutex for a long time, the mutex is currently held by thread #2, which is suspended in the debugger. Resuming thread #2 will allow it to release the mutex and give thread #3 a chance to give control to the debugger:

(gdb) c&
Continuing.
(gdb) 
[Worker 5] #3 stopped.
0x00000000 in ?? ()

Now that thread #3 is stopped, we can examine its stack trace:

(gdb) thread 3
[Switching to thread 3 (Worker 5)]
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0xf0001c0c in avm2_msleep () from remote:2.elf
#2  0xf0001fa3 in kmsleep () from remote:2.elf
#3  0xf001507b in __do_lock_umutex () from remote:49.elf
#4  0xf001575f in _do_lock_umutex () from remote:49.elf
#5  0xf0014d8b in ___umtx_op_wait_umutex () from remote:49.elf
#6  0xf0014423 in k_umtx_op () from remote:49.elf
#7  0xf0000ae0 in _umtx_op () from remote:2.elf
#8  0xf0010e69 in __thr_umutex_lock () from remote:27.elf
#9  0xf000efc3 in _mutex_lock_common () from remote:17.elf
#10 0xf000f0fa in __pthread_mutex_lock () from remote:17.elf
#11 0xf0011a60 in pthread_mutex_lock_exp () from remote:33.elf
#12 0xf0000080 in second_thread (thrdata=0x0) at dlock.c:17
#13 0xf000d3c5 in _thread_start () from remote:7.elf
#14 0xf00015ce in _thread_run () from remote:2.elf
#15 0x00000000 in ?? ()

As expected, this thread is also waiting for a mutex, this time on line 17 of our test program:

pthread_mutex_lock(&first_lock);

In our test program, it’s easy to see that the root cause of the deadlock is faulty lock ordering. If both threads acquired the locks in the same order, the deadlock would not be possible. These sorts of problems aren’t as easy to spot by just examining the code in more complex programs, though. In such cases, gdb’s thread debugging features can be invaluable.

Notes


[1] The debugger can also suspend a program when it throws an exception, but that case is not relevant to this discussion.