Exploring AI Voice Clones: From Concept to Conversation

27.06.2024 9min read

Felix Weiland
Jakob Friberg

Imagine having a conversation with an AI that sounds just like you, mimics your mannerisms, and understands your conversational style. With the current technology that is widely available, how close can we get to an actual real-world conversation between two people? This intriguing question drove us to explore the possibilities of fine-tuning a language model on our own dialogues and combining them with a voice clone in natural conversations.

Introduction

In this blog post, we’ll take you through our journey of creating personalized AI clones that could engage in realistic conversations and replicate our voices with uncanny accuracy. From the challenges of data collection to the intricacies of fine-tuning large language models and generating lifelike voice clones, we’ll share our insights, successes, and the challenges we faced along the way.

Our goal was simple: to see how closely AI can emulate the nuances of our own one-on-one interactions. By the end of this post, you’ll hopefully leave with a deeper understanding of the potential and limitations of current AI technologies in creating personalized conversational agents. Look at the result in the video below before diving into how we brought these digital doppelgängers to life.

Data Collection

The initial challenge for any project like this is gathering the right data. We opted for the most straightforward approach: recording and transcribing a conversation between the two of us. Each transcript was saved with timestamps and tagged with the speaker's identity, and all audio files were stored accordingly. Given our goal of capturing the tonality and style of a conversation, using an actual dialogue between the two seemed like a simple yet powerful strategy.

Our research indicated that we needed at least half an hour of high-quality audio and approximately 50 to 100 entries per person to ensure the models picked up on our speaking style well enough.

Capturing tonality is one thing, but another foundational part is the understanding that the model has of us as persons. Information on what we work with, where we live, and similar concepts sets the stage for these models and provides them with key information in the conversations. Without this information, the models are just guessing at all the facts.

We decided against creating a knowledge base for both of us since the concept of knowing specifics about someone could expand indefinitely. Knowing the basics is one thing, but what if they started talking about restaurants they like, someone's grandma, or what they were up to last weekend? This would probably also require some retrieval system to work well. Instead, we just went with the fundamental information about ourselves in the system prompt.

We used high-quality microphones to achieve great audio quality, set up a small Python service to manage the recordings, and used FasterWhisper for the transcriptions. When everything was up and running, we just started talking like usual. Or maybe not quite, knowing that the way you talk will eventually be your AI does something to you…

Fine-tuning LLM

We decided to fine-tune a GPT-3.5 Instruct model due to its rather good performance, the open Python SDK, and our experience fine-tuning OpenAI models for other projects.

For those of you who are new to fine-tuning or who have not been working with LLM inference using APIs, whenever you send a request to the model, you pass a system message and user message, and the client produces the assistant’s response. In fine-tuning, we would also provide the assistant’s response in the series of messages. In that way, we are replacing the ‘vanilla’ model response with our own. This is a simple example:

# Regular structure
messages = [
   {'role': 'system', 'content': system_prompt},
   {'role': 'user', 'content': "Okay, so we need
 to wait between in between yeah, okay. 
Yeah, but I mean, I think you covered it really well. 
Data and AI we can start anywhere.
It's so No, I'm just kidding."},
 ]
# Example response from GPT-4
"Absolutely! Starting anywhere with data and AI
is totally fine. It's a broad field with so many 
possibilities. Let's dive in whenever you're ready!"

# When fine tuning we also provide the answer from
the conversation transcript in the messages object
messages = [
   {'role': 'system', 'content': system_prompt},
   {'role': 'user', 'content': "Okay, so we need to
wait between in between yeah, okay. Yeah, but I mean,
I think you covered it really well. Data and AI we
can start anywhere. It's so No, I'm just kidding."},
   {'role': 'assistant', 'content': "Yeah, maybe RAG?
What do you think about it?",
]

Another key thing is memory, in the sense of the models being aware of what the conversation has been about. It should be a fluid conversation that takes the responses from earlier and proceeds accordingly. As these models are stateless - meaning that whenever we make a call, it knows nothing about our previous calls - we have to construct the message history in every call for the model to be aware of this.

Due to this, managing conversation history was a key consideration when constructing the fine-tuning dataset. Since most of our phrases are only properly comprehended within the context of the previous ones, we need to make sure that enough context is provided for the model to understand ‘why’ it should respond like it is. If we do not provide sufficient context, it may rather harm performance.

We also decided to create a separate fine-tuned model for each of us. Even though introducing more conversational content could be beneficial to create this kind of dialog style of response, we wanted to avoid contaminating one model with the other person’s way of speaking. Tailoring a model to one person requires more resources, but experimenting with this seemed like a good solution.

For each one of the fine-tuned models, we created a set of conversation objects, and one of us was attributed the role of the assistant. The other person’s transcript would work as the user message, and the response from the person we wanted to mimic was the assistant’s. We also crafted a system prompt to give the model the correct setting and their role in the conversation. All parts of the conversation were recreated using a sliding window technique, where we looped over all messages and created objects with two to five messages to be added as history. So, any response could appear in many different places in the training data, both as part of the message to be trained on and as part of the conversation history. The idea was to tune the model for longer and shorter conversations. This might not have been the best approach in retrospect, as it has the model training on the same response multiple times in different settings, making it a bit too overfitted.

Creating Voice Clones

We used ElevenLabs to create our respective voice clones as their model holds very high quality, latency is quite low, and they provide the option of making a “professional voice clone.” That is done by training a dedicated model on a larger voice data set. They recommend providing at least 30 minutes and up to three hours of audio content for the best result. We had half an hour each, and since it all came from the same kind of situation we wanted to recreate - a conversation - we thought we would have a good foundation to capture our conversational tonality.

Creating a voice using ElevenLabs is easy. You upload the recordings, add high-level information about what kind of voice they are, and then wait. For their instant voice clone it is a matter of seconds, for the professional voice clone it takes a few hours.

We had worked with voice clones before and had high expectations of the results. But hearing yourself with this level of precision was really strange. These voice clones are eerily similar to their real counterparts.

When we had the conversational models and our voice clones set up, all we needed to do was hooking them up in a conversation with each other.

Simulating an Interaction

To simulate interaction between our AI voice clones, we set up two different instances of our models to talk with each other in the following way:

Initialize Conversation Topic: We start by initializing a conversation with a specific theme to provide context for the interaction.
Generate Text: One AI clone generates a text response based on the theme and the conversation history.
Generate Sound: The generated text is then converted into speech using the voice clone.
Alternate and Repeat: The roles switch, and the other AI clone generates its response based on the first clone-generated message. This process continues, alternating between the two clones, producing a conversation.

This kind of set-up requires us to keep two separate history objects for each model since they both respond as the assistant, but the other model should always interpret the input as a user message. Besides that, we just passed the history and the last message between the two models.

The results were varying; most responses didn’t keep the conversation moving forward. In many of the conversations, we got stuck in specific paths, repeating similar content. They were quite creative at times, though, and we laughed a lot. Tonality was far from 100%, but we definitely saw some direct improvements compared to the base GPT-3.5-instruct model we fine-tuned. Comparing the generated output to our recorded sound, we could easily notice that it had picked up on our style of talking. For instance, us both using the word “yeah” a lot, or saying that something is “very interesting”. Below is a conversation between us about writing a blog post on this topic:

Felix: Hey, I saw your blog post about
creating ourselves as AI clones.
What made you want to write about that?
Jakob: Well, I think it's interesting
to think about our future selves,
and what we're like now compared to that.
And I've been thinking a bit about AI
lately as well, so it was just natural
for me at this point.
Felix: Yeah, totally. I saw that you
were writing about it also in the
context of relationships. That was
interesting as well.
What are your thoughts on that?
Jakob: I think it's interesting that
when we're talking about our future
selves, like how much of ourselves can
we offload to an AI? And I mean,
if you have a partner in the future who
has an AI clone of themselves, what
does that mean for your relationship?
Like can you just take the AI out on a date
and like leave your partner at home?
Felix: Yeah, it's like stuff from Black Mirror.
Have you seen that episode? Where they
have like the little thing in their ear
that records everything and they can replay
conversations and stuff. I think that
would be interesting as well to see
how much of our relationships we could
kind of quantify in a way.
How do you think about those things?
Jakob: I mean, that's very interesting.
And I think it's also like when you think
about your partner in the future having
an AI clone of themselves and what does
that mean for your relationship?
Like how independent can they be from
you in a way? Like if they can recall
everything you've said to them, if
they can mimic their personality
completely and stuff like that.

Lessons Learned

There is a lot of learning to be gained from experimentation and exploration like this. And as there is no common solution to achieve our desired end result, it's been quite exciting to try out some different approaches. But based on our final concept, we found three major things to take into future projects.

Providing knowledge and context throughout is key

Without at least some level of guidance, the model will not be aware of who they are. The adjusted weights might still influence the model in the right direction, but it is highly inaccurate. We did it by adding the information to the system prompt. Keeping this part of the prompt fixed throughout the whole dataset worked fine, but it would be interesting to have it working more dynamically. Imagine adding a RAG-system that can fetch correct information and reduce redundancy.

And since all conversations happen in some context, making sure that we set the stage and provide adequate information to the models throughout the conversation is key. Most messages by themselves are not intuitively understood, but within the context of the conversation they are.

Evaluating performance is hard

What is tonality when you boil it down to its smallest common components? And how can we assess it, especially at scale? Generally, when we are developing products using these technologies, we are always keen on finding evaluation methods and ways to iterate and measure that it’s improving with further training or prompt engineering. It can be difficult to define exactly what is accurate and what is inaccurate tonality since it’s a fuzzy distinction dependent on the context in which a phrase is said, the subject's past experiences, culture, and so on.

We worked exclusively with a manual evaluation of our own responses, and it was quite time-consuming. Finding ways to handle this and iterate quicker would make development for this faster.

Data, data, and more data

No model is ever better than the data that goes into it. We realized, when listening to the recordings, that our conversation was quite unnatural. We were explicitly agreeing with each other frequently and a bit too excited about what the other person said. It was a good, polite discussion, but it didn’t capture our natural style of conversation. To get models performant, it is important to have high-quality datasets. To get our models to imitate ourselves more closely, the conversations that are used should definitely be better aligned with our natural way of speaking - and preferably with more examples.

Conclusion

Generative tools like ChatGPT have made it easier than ever for anyone to create high-quality digital content quickly. You no longer need to be an expert to write an engaging blog post, develop software, or produce multimedia content. With improving capabilities and decreasing costs of generative AI, we’re inching closer to autonomous systems being able to perform these tasks without a human in the loop. What happens then? Our experiment with AI voice cloning is a silly example. Still, it reveals how a combination of generative AI modalities and simple request chaining could let us produce digital media autonomously.

We’re seeing new types of user experiences beyond chatbots and productivity-boosting—experiences that are tailor-made for you. Services like Suno and Udio, which generate full songs using AI, and Spotify's personal DJ feature are already out there exploring hyper-personalized digital services. Another intriguing example is the Infinite Conversation project, which presents an endless AI-generated podcast between a famous director and philosopher. Imagine listening to a custom-made podcast episode of your favorite podcaster, tailored for precisely the topic you’re in the mood for.

Most of us have based our view on LLM:s on the widely available models like GPT-4 and Gemini. Fine-tuning models for tonality might not be the most relevant use case, but there are so many applications where this approach could be useful. We believe fine-tuned models will serve an important purpose for many services in the coming years.

We will investigate these new technologies as part of our mission to provide top-of-the-line products for everyday users. If you want to know more or research what Artificial Intelligence means for you and your brand, don’t hesitate to contact us at curious@bontouch.com so we can explore together.