Valve finally released an updated SDK allowing the public to mod Team Fortress 2, Portal, and Half-Life 2: Episode 2 on November 7, 2007. As soon as the update came out, I gleefully booted up FacePoser to do some lip syncing for a machinima we're working on. Unfortunately, I ran into a nasty bug that halted my forward progress fairly quickly; the dreaded extraction error. It took me hours of poking and prodding to finally figure out how to get around that error, and I thought some of the other new blood in the community might like a tutorial to ease the pain. In the end, I managed to produce some pretty spectacular results, and hopefully you'll manage the same.
In FacePoser, Extraction is what happens when the program attempts to automatically detect the
phonetic structure of the sound file. It then plugs those detected phonemes into the file, resulting in the character's lips syncing to the audio.
If it works, extraction can save a lot of work. Even then, there are a lot of instances in which the extraction will produce less than desirable results. For instance, if the voice actor is not speaking clearly or is speaking very rapidly, extraction could result in unsynchronized lip movement or fail altogether.
As of November 11, 2007, extraction does not work at all in Windows Vista. I don't know for sure why that is, but I suspect it's because extraction requires Microsoft's Speech API 5.1, while Vista comes preloaded with Speech API 5.3. I might be off base with this, so if anyone can definitively tell me why extraction fails in Vista, I'll update the tutorial.
Rather than rely on FacePoser's automatic extraction, it's possible (and often desirable) to manually add phonemes to the file. This is extremely time consuming and can be tedious. To manually add phonemes for two and a half minutes of a rapidly delivered monologue, I put in approximately 10 hours of work. That said, the results are great, and for a lot of projects I think you'll find the end product is worth the effort.
Lip syncing without the use of automatic extraction requires four main steps:
- Sound File Preparation
- Transcribing the Script from the Audio
- Matching the Script to the Waveform
- Inputting Phonemes
Sound File Preparation
After you have your source material (for example, a monologue recorded by an actor) you'll need to prepare the sound file for use in FacePoser. Using an audio editor such as Audacity or Sound Forge, chop the whole thing into pieces no longer than six or seven seconds. Shorter files are required for automatic extraction to work; we're not using extraction, but it's still easier to process shorter sound files. The shorter the file, the easier the segment will be to work with. If possible, break the audio up between sentences. Failing that, break between words. Never chop the audio in the middle of a word, as that'll make it next to impossible to get the syncing right.
Important note: The resulting audio files must be 22050bps 11/22/44kHz 8/16 Bit PCM WAV. Always save your files in that format. (Source: Creating your first Faceposer scene) If you try another format, it won't work. For voice work, mono is generally better than stereo.
Place your sound files in the .\sound directory for whatever game you're working on. It'll be easier to keep a handle of the project if you use a sub folder. For example, while working on a TF2 project called "The Heavy's Fury," My directory is:
C:\Program Files\steam\SteamApps\username\team fortress 2\tf\sound\HeavysFury\
Once you've got all of your files in place, you're ready to start FacePoser. If you would like the sound file used in the tutorial example, it can be found at (URL here). I recorded it using a USB microphone, so the quality isn't great and there's electronic interference, but it's good enough for our purposes.
Transcribing the Script from the Audio
Once you've opened FacePoser, you'll be looking at the Choreography workspace. First, load the model for the character you'll be syncing audio to:
File > Load Model
Models for games other than Half-Life 2 are usually located in the [ROOT]\models\player\ directory. In this example, we'll load the Heavy. You should see a series of tabs at the bottom of the window. These tabs give you access to all of the tools in the FaceMapper package. Lip syncing is done in the Phoneme Editor, so double click that to open it.
Now load the file you'll be working with by using the button at the bottom of the window. You'll be presented with the waveform of the audio file. Resize the Phoneme Editor to give you as much work area as you can get without covering the 3D View, which you'll need to see the lip syncing.
Among the many quirks that you'll have to get used to when working with FacePoser is that there are different, invisible context sensitive zones in the Phoneme Editor. What that means is that you'll get a different menu right-clicking at the top of the window versus the middle or bottom. Play around with it until you get the hang of it.
Now, play the sound file a couple of times using the play button at the bottom of the window or with the space bar. You'll notice that the sound quality is pretty terrible, but that's just artifacting within the editor and quality will be much better with the finished product.
Memorize the line of spoken dialog. It's very important that you know the line exactly, or else you'll screw up later. If you have a script it can be used to learn the line, but be wary of the actor ad libbing. One misplaced word or syllable and you'll have trouble later on. If you're using the example sound clip, the line is, "I am the heavy and I am furious."
Right-click near the top of the waveform area and select "Edit Sentence Text..." Enter the text of the clip exactly as it's spoken. Don't use any punctuation such as periods or commas. Apostrophes are okay if they'll help you with pronunciation. Don't type the whole word if the actor doesn't say the whole word. For example, if the actor says "Crush 'em!" don't type "crush them".
The script as you've typed it will appear above the waveform. At this point, you'll probably get the message "Last Extraction Result: an error has occurred during extraction." Ignore the message and go on to the next step.
Matching the Script to the Waveform
Next comes my least favorite part of the process. You see those words above the waveform? You have to move them around until they match up exactly (or as close to it as humanly possible) to the audio. If you're off by even a little bit, you make the last step that much more difficult. There are a couple of tools you'll need to do this.
First, see the green box with a time code in it at the top of the window? This box is your friend. You can drag it back and forth through the clip to find the exact moment where the actor has moved from one word to the next.
Second, at the bottom of the window there is a playback speed adjustment. Slowing playback down will help you tell more precisely if a word is lined up properly.
By Ctrl-left dragging the edge of a word, you can move the edge back and forth. By Shift-left dragging, you can drag the entire word back and forth. I use Ctrl-clicking a lot more, but use whatever works best for you.
Important Note: In this process, whenever you left-click a word, you'll select that word. If you then click a second work, that word will also be selected. If more than one word is selected, you'll be unable to move the edge of a word, and dragging a whole word will drag every word selected. This can seriously screw things up, so be prepared to make liberal use of undo. To unselect all words, press Esc.
Generally, this is how I work on this part. I drag the green time code marker slowly forward and back until I find the moment where two phonetic sounds are overlapping. For example, between the words "heavy" and "and", there is a precise moment where the /iy/ sound from "heavy" and the /ae/ sound from "and" overlap. Find that moment and leave the time code marker there. Next, Ctrl-drag the line between those two words until they line up with the time code marker. Continue this process until all of the words line up.
Note: There are are two special circumstances. The first is when the boundary between words is indistinct, such as in our example between the words "I" and "am". In those cases, just do your best and estimate where the boundary is. Play it back several times until you're sure you have it in the right spot. The second special circumstance is when there is a substantial pause between two words. In that case, it would look pretty silly for the animation to seem to connect the words. To separate the words, left click to select both words, right click on them, and select "Separate Words". You can reverse the process by clicking "Merge Words".
Once all of the words are lined up, play the clip several times at varying speeds to ensure that everything lines up.
Important Note: Sometimes, the playback bugs out and will not play the clip at the correct speed,
making it impossible to judge the the accuracy of the lineup. Sometimes playing the clip repeatedly will fix the problem, but sometimes you'll have to just save and restart the program.
Once everything is as close to accurate as you can manage, go on to the next step.
At last! The fun part! At least, I think it's fun, but I'm a linguistic nerd.
Before you continue, you need to have a solid understanding of what phonemes are and how they work. A decent primer on phonemes can be found at here or at Wikipedia. You really only need a basic understanding of phonetics to get good results, but the more you understand the faster things will go. A solid understanding will also give you better results.
Select the first word of your sentence by left clicking it. Then right click and select "Add phoneme to 'word'...". This will bring up the Phoneme/Viseme Properties window. Here you can choose the phonemes that make up the word.
If you've never worked with phonemes before, this can be a bit confusing. As a friend said when I was showing him this, "I don't speak phoneme." Fortunately, Valve has provided a cheat sheet. As you mouse over the different phonemes, the bottom of the window will give you an example of how the phoneme sounds. For example, if you mouse over 'uw', the bottom of the window will read 'tOO: high backed rounded vowel". This indicates that /uw/ is the OO sound, as in too, you, moo, etcetera.
The best way to get the right sounds is to say the word very slowly out loud. Break the work into its component parts to see what sounds you make when you say the word. For example, let's take the word 'furious,' the most complex word in our example sentence. Pay attention to the consonant sounds you make and the vowel sounds. Written phonetically using Valve's phonetic system, it would look like this:
/f/ /y/ /er/ /iy/ /ah/ /s/
Say the word over and over while mousing over each sound to understand why it is written that way.
After you've inputed the phonemes and pressed the okay button, the phonemes will appear below the waveform. Now you can press the play button to see your character lip-syncing to the audio! If you did step three well and matched up the words with the waveform accurately, you'll find that for the most part the phonemes will match up very well. In some cases where the actor draws one part of a word out longer than the others, however, you'll have to manually adjust the phonemes
to match up more accurately. Do this the same way you moved the words around earlier; Ctrl-drag the edges of the phonemes to places where they best match up with the audio clip.
There are a lot of things to look out for during this step, such as:
If you have more than one word in the sentence selected, the "Add phoneme to 'word'..." option will not appear in the right-click menu. Press Esc to quickly deselect all words. Alternatively, you can use the Tab key to move from one word to the next.
There are a few tricky sounds that will take some practice to accurately describe phonetically. Some examples include:
The long 'I' sound, as in fly, my, tie, etcetera is not a single phoneme as are most vowel sounds. Instead, the sound is made up of two phonemes; /aa/ and /iy/. Say the two separately slowly, and then together quickly to understand why.
The letters 'th' in the English language can make two different sounds, one unvoiced and the other voiced. The unvoiced sound can be noted in the words 'theocracy' or 'Bethesda,' and is represented phonetically by the symbol /th/. The voiced sound can be heard in the words 'them' or 'there,' and is represented by the symbol /dh/. Don't confuse the two.
The [sil] symbol will help you to make the lip syncing more convincing in certain situations. When a long vowel sound is followed by a sharp consonant, such as in the word 'nap' or 'not,' a short period of silence falls between the two. Accurately modeling that silence will give you fantastic results. Overusing it will look dumb. Try to find a good balance.
Finally, the most important piece of advice I know to give is this; Model the phonemes pronounced by the actor, not the phonemes you would use. Trust me, you'll catch yourself doing this wrong all the time. Continually play the file while inputting phonemes to help keep your work accurate.
Once you have all of your phonemes inputted for the entire line, play it several time to make sure everything looks good. Playing it backwards can help as well, as it let's you see where the mouth moves when you aren't expecting a certain result.
Once everything looks good, save the file. To use it, simply import the .wav file into the Choreography time line! For more information on using the Choreography tool, check out the RomeoJGuy's great tutorial here.
Thanks for checking out this tutorial! I hope it's been useful.
This tutorial was written in entirety by Spencer Williams, AKA Devilturnip. If you have any questions or anything to add, shoot me an email at devilturnip-at-deadworkers.com.
For more Team Fortress 2 tips, news, and machinima info, check out the Dead Workers Party podcast Control Point at tf2.deadworkers.com!