What is VCV?

VCV is also known as 連続音 ( renzokuon ), or Triphones, because of it's "stringed" nature― one vowel, followed by one consonant and one more vowel (Diphone). This is what sets VCV apart from CV, is why VCV can produce a much smoother and realistic sound. VCV was the second recording method to come around, and has also been characterized by the use of multiple pitches to obtain even further realism.

Using VCV

Using VCV is relatively easy to use, but fitting the .ust to the UTAU is a must with VCV in order to prevent slurring of syllables, you can read more about fitting a .ust to an UTAU here. If you're new to UTAU, I'd reccomend spending time on CV and getting comfortable before moving on to VCV.
We'll be using the same .ust from the CV guide, Frog Song/ カエルの歌 ( kaeru no uta )

We can see that this .ust is in CV format due to the fact that there's one character per note. VCV format is different; which can be seen below. Converting the .ust to VCV is a must, or it won't be possible for the UTAU to sing it―unless CV is configured in, which is something more and more UTAU users do―regardless, you wouldn't have the smooth, realistic sound you get VCV; so just convert it.

To convert a .ust to VCV format you can either have UTAU automatically convert it, or use a plugin to convert. We'll go through both methods here.

Converting through UTAU is simple, you click on the "A" button on the bottom left screen, and UTAU will detect the voicebank's notes. The only caveat about this is that it's a feature included only in Shareware UTAU.

Converting a .ust by using a plugin is still easy and simple though. To do so, press ctrl+A, then navigate to Tools and click. After going to tools, hover over Plugin, and then click the plugin. Be sure to remember to fit the .ust to the voicebank.

"歌詞を連続音にする" means "To make lyrics VCV"

The plugin―often called a "Diphonizer", "Converter", or "CV to VCV" directly converts the notes instead of trying to read the voicebank like the shareware method does, thus resulting in a slightly different look; that actually I happen to like better.

We can see that it's formatted as [- か][a え][e る][u の][o う][u た][a が]. This is because can each note has the vowel from the previous note placed at the front of it―besides the beginning note, which can either have blank space or a breath before it. You'll also notice that it's in Hiragana; almost all VCV voicebanks are aliased in Hiragana, and Romaji aliased VCV isn't reccomended from me, personally. More info on aliasing here.

You can hear how much smoother VCV is than CV.

Recording VCV

Due to VCV being triphonic, you record "strings" of syllables―this is done to get triphonic combinations for each syllable, meaning that there are seven different versions of one syllable, each containing a blank space or a vowel (a i u e o or n) in front of it. This is what makes VCV smooth and realistic, and why it takes longer to record than other methods. However, VCV is still very easy to record, and doesn't take much time at all once used to it.

example: かかきかく ( ka ka ki ka ku )

In this example, we can see the that multiple syllables are recorded, and because VCV is triphonic, this sample would only be used for certain instances of these syllables. This recording would make up the [- か][a か][a き][i か][a く] sounds in a voicebank.

This example is also considered to be "5 mora", as it is comprised of five syllables. While most VCV voicebanks are either 5 or 7 mora, you can find voicebanks recorded in practically any number of mora.


OTO'ing VCV is remarkably simple once you have figured out what values to use―it's literally drag and drop after that point. Values, meaning msec numbers for the Consonant, Cutoff, Preutterance, and Overlap. The below settings are an example of values that can be used. These values can also be determined by tempo to better fit the UTAU's samples, for ease of oto'ing as well. (see: Setparam).

Listed below is are examples of how to OTO VCV, but please refer to the OTO'ing page if you would like more information.

Starting sound (labeled as "- sound" or "sound); Consonant goes after the consonant, and Overlap is placed on the silence or breath.

Strung sound (labeled "vowel sound diphone/consonant vowel sound"); Consonant goes after the consonant, and Overlap is placed towards the end of the vowel on the previous note. This is mainly what VCV voicebanks are comprised of.

Ending sound (labeled "vowel R" "vowel ・" "vowel_hh" etc.); Usually an ending vowel followed by either blank space or air. These are extras, but are widely used and included in VCV―well actually all types of voicebanks; adding more depth and realism to the singing.