Interview With the Creator of the Automatic Speech Alignment Feature in Audition CS6
Brian King is a PhD candidate in Electrical Engineering, and for the last several summers he’s worked as an intern in Adobe’s Creative Technologies Lab, an internal “think tank” where some of the technology that makes our products work like magic is developed. Brian’s research led to the hugely popular Automatic Speech Alignment feature, also known as Rubbadub, which he initially previewed at Adobe’s MAX conference in 2011.
To use the feature, you need two selected clips in a Multitrack session. Choose Clip > Automatic Speech Alignment, confirm the settings in the dialog box (below), and click ok. Using what seems like magic, the replacement audio lines up perfectly with the original.
In this interview, Brian talks about how the technology underpinning this feature works.
Ellen: Brian, what got you interested in your area of research, digital signal processing?
Brian: I started playing guitar when I was in second grade, and I’ve always loved music and technology. Digital signal processing really fuses my two passions.
Ellen: Since you’re almost finished with your PhD, you must be writing a thesis. What’s the topic?
Brian: It’s on single channel source separation. Let’s say you have a number of people talking in a room at the same time—you can use this algorithm to extract individual voices.
Ellen: That sounds useful!
Brian: Yeah, definitely.
Ellen: You’ve been working on features in Audition for a couple of releases now; the big one for CS6 was Automatic Speech Alignment. How did you get started solving that problem?
Brian: We wanted to make ADR easier. Let’s say we’re shooting a hit movie and we get a great take in terms of performance, but the audio is really noisy. What studios typically do is bring the actors back into the studio and have them overdub their lines, then manually align the new audio with the original performance—it’s a process known as automatic dialogue replacement.
As you can imagine that can be a really tedious process because if the actors don’t get their timing exactly right it’s not going to line up with the original video and it’s going to look like an old Godzilla film or Kung-Fu movie.
Ellen: Because the lip movements don’t match what you’re hearing?
Brian: Exactly. Typically the actors have to record each phrase of dialogue dozens of times to get the timing as close to the original performance as possible, and then a poor studio engineer has to take all of these multiple takes and somehow find the best snippets to fuse into a coherent phrase, which is both time-consuming and kind of tedious.
So given a noisy, original dialogue recording that was taken out in the field and a high quality overdub that has no noise but really bad timing, I developed a method for automatically stretching and compressing every instant of the overdub so that it matches the timing of the original recording.
Ellen: You say every instant—how much time is that?
Brian: Right now it’s set to about every 15 milliseconds, so it’s small enough that it’s imperceptible. I think (laughter). A typical video frame corresponds to about 33 milliseconds, so 16 milliseconds is approximately one-half of a video frame.
Ellen: How do you identify the signal that matters in the original recording with all of the ambient noise?
Brian: What I’ve been working on for my Ph.D. is source separation, but for this I’m using a matrix factorization method, where you learn the important features or spectral characteristics of the audio.
Ellen: So you build a fingerprint of the existing audio first?
Brian: What we do is identify specific spectral characteristics of the original dialogue. For example, if I say “Hello,” and you say “Hello,” a human will understand we are both saying the same words, but we can tell if it’s me saying it vs. you saying it, because my voice has a lower pitch, or maybe I have slightly different timing or other vocal characteristics.
So this feature learns the vocal characteristics of the speaker or the actor from the overdub clip, and it also learns the timing of that clip. Then when we compare it to the noisy clip, it knows what to look for. It finds the same features in the noisy clip, and for everything else that doesn’t match it says, “Oh, well that’s noise. Let’s ignore that.”
Remember Highlights magazine, which you used to find in dentist offices when you were a kid? You’d see this picture, and there’d be all of these hidden things in it. That’s kind of what the method that I developed is doing: the overdub recording shows you exactly what you’re looking for, kind of the hat in the picture. Then when you get to the noisy recording, it’s as if all of those pictures are still in there but hidden with all sorts of other things. Since you know what you’re looking for, you’re able to find it easily and accurately.
Ellen: Once the pieces that match are identified, what happens?
Brian: That’s taken care of by a dynamic time warping algorithm, so that’s fairly simple.
Ellen: A dynamic time warping algorithm? That doesn’t sound that simple!
Brian: Well, it has three stages. The first finds the important features: “These frequency characteristics at this point match these characteristics at this other point.”
Then, it figures out how to align those two points temporally. If the overdub is faster than the original, you want to stretch the overdub to slow it down, but if a word in the overdub took twice as long to say as that same word in the original reference recording, then you would compress it instead.
Then the third part of it is synthesis. Once we know how to shrink, compress and expand each instant, then we simply synthesize a new signal.
Ellen: Any tips for working with different clip lengths?
Brian: The best way to use this tool is to synchronize two clips at a time, and to work with relatively short, phrase-length snippets that have natural start and end points. You definitely wouldn’t want to take a two hour long movie and then try to align two hours of overdubbed dialogue.
Ellen: What’s the optimal length?
Brian: Anywhere from a few words to a paragraph. In most spoken dialogue there are natural breaks that are really easy to identify. “Here’s the beginning of this piece of dialogue. Here’s the end.” Just use those natural breaks. You don’t have to go mid sentence or anything.
Ellen: So if you have a long clip, your recommendation would be to break the original noisy clip into smaller segments and then replace each small piece of dialogue?
Ellen: What happens if the lengths of the clips aren’t roughly the same length? Say you’ve got a 12 second original piece and a two second replacement clip for a phrase or two of dialogue in the middle of that longer piece.
Brian: It’s all or nothing at the moment, so if you try to align a very large clip and that doesn’t work then you can try breaking it up in to smaller pieces and playing around with the parameters.
Ellen: What happens if the words don’t match?
Brian: That’s an interesting question. It tries to align them as best as possible. We’ve seen situations where it’s the same actor and the same dialogue, but maybe he said, “This will be great.” Then in the overdub he says, “This’ll be great.” You have that “wi” gone, and in that case when it stretches it says, “Well, all right there are no kind of frames corresponding to the wi sound, so we’re just going to take the wi in the overdub and make it as fast as possible,” and kind of almost shrink it to nothing so that we can kind of get the rest of the alignment to happen as best as possible. In that situation, it can sound kind of funny because you will have a natural cadence and then you’ll have this really short piece when you listen to it and think, “What the heck happened there?” Often in those cases, Audition is trying to match something that wasn’t there in the first place.
Ellen: So this feature doesn’t do anything to help with overdubbing in a completely different language?
Brian: Not yet! But we’ve been looking at it.
Ellen: Are there any uses for this feature that may not immediately apparent?
Brian: I know some people have used it for aligning different music tracks. I experimented a little bit with just aligning vocals. For example, I overdubbed myself singing a couple of different rap parts, and it was able to align those really well.
We also aligned some different guitar parts that should have the same timing. Instruments that have strong attacks like guitars, cymbals, and high hats work really well. Softer stuff didn’t really work that well, but it could work for aligning back vocals in some cases—it would really depend on the characteristics of the sound.
We chatted with Brian about a number of other things he’s been working on—but unfortunately, can’t talk about them yet! Between the fabulous developers who work on Audition and our internal research teams, we’ve got a great roadmap for Audition. But we’d love to hear from you about difficult and time-consuming problems you think our research teams might be able to help solve—so please, leave your comments or get in touch by submitting a feature request.
Audition is part of the Adobe Creative Cloud, so you can subscribe once and get access to Audition CS6, Photoshop CS6, and all of Adobe’s other creative tools.