Separation

The VAUX format models spoken content and speaker identity separately, and this allows us to isolate one speaker from another — even if they’re speaking simultaneously! We began looking into this during October and found that we were able to take an audio file of two people speaking over one another and tease out the separate utterances after the fact. Some samples of this are below.

Input

Speaker 1

Speaker 2

Example 1

Input
Speaker 1
Speaker 2

Example 2

Input
Speaker 1
Speaker 2

Example 3

Input
Speaker 1
Speaker 2

Multi Language

Because our AI thinks about speech differently to humans, we were curious to see how well it could handle languages other than English (what it has primarily been trained on). The samples below are the result of brief exposure to new languages.

Input

Reconstruction

Sample of Target

Output

Example 1

Input
Reconstruction
Sample of Target
Output

Example 2

Input
Reconstruction
Sample of Target
Output

Example 3

Input
Reconstruction
Sample of Target
Output

Compression

The VAUX format currently achieves over 60x compression, and it does so without compromising on reconstruction quality like other formats. The examples below show you the input and the output quality — what’s not obvious is that before it gets rebuilt by our decoder, the file is 64x smaller than it was to begin with. This is exciting to think about not just in the context of storage, but transmission!

Input

Output

Example 1

Input
Output

Example 2

Input
Output

Example 3

Input
Output

Transfer

As mentioned above, the VAUX format separates identity information from information about the contents of an utterance. We can take some original speaker and their original speech, and then transfer the contents of that speech to the voice of a new speaker.

Input

Sample of Target

Output

Example 1

Input
Sample of Target
Output

Example 2

Input
Sample of Target
Output

Example 3

Input
Sample of Target
Output

Copyright © 2019 VAUX - All Rights Reserved