By Lawson Hancock on September 9, 2009 5:59 PM
|No Comments
I'm on my way to the IBC convention, which starts this Friday and runs through next Tuesday at the RAI in Amsterdam. You can checkout the latest versions of Soundbooth CS4 and Audition 3 at one of our demo stations in the Adobe booth located in Hall 7 (7.H.23). Also, our evangelist Jason Levine will be giving an Audition presentation once a day in our theatre area. It's scheduled for the afternoon the first few days, but times may vary so check the daily schedule at the booth.
So if you're attending the show please come by and say hello. I'd like to hear about your experiences with Audition, Soundbooth or any other audio tools.
The whitepaper covers using Speech Search in Premiere Pro to extract spoken words from a video file into keywords in the XMP metadata that can then be exported to F4V or FLV using the Adobe Media Encoder. The next step in the process is to use Soundbooth CS4 to export the speech metadata into an XML file that contains cue points that can be referenced in Flash. This enables Flash developers to create custom video players that make use of these cue points to trigger specific actions/events in ActionScript. The whitepaper also includes example ActionScript 3 code to do this.
In order to make full use of this speech metadata, it's important to understand what's in the XML generated from Soundbooth CS4.
When you select File>Export>Speech Transcription in Soundbooth CS4 an XML file is created that contains cue points that conform to the Flash cue points exchange format (see example above). Each speaker along with every word is stored as a Flash cue point. Each cue point will have the start time contained in the element and is measured in milliseconds. The element stores the actual word or the number of the speaker. From there, each cue point will have a set of parameters. Each parameter is stored as a name/value pair. The parameters are source, duration, and confidence.
source refers to whether this cue point represents a speaker (numbered 0, 1, 2, 3, etc.) or a word. The two values are segmentation or transcription that refer to either the speaker number or the word respectively.
duration refers to the duration (in milliseconds) each speaker spoke or the duration (also in milliseconds) of the particular word.
confidence measures the confidence (from -1 to 100) that the transcription engine has about the particular word being correct. The higher the value the greater the confidence the engine has that the word is accurate and vice versa for lower values. A special value of -1 indicates the user has manually edited the word.
With an understanding of the XML file that Soundbooth CS4 exports, you can take advantage of the speech metadata generated in Production Premium CS4 to create searchable video experiences on the web. To see this in action we've included an example built using this workflow on Adobe.com. http://www.adobe.com/products/creativesuite/production/videosearch/