Matt Hocking is the co-founder and CEO of WellSaid Labs, a leading enterprise-grade AI Voice Generator. He has more than 15 years of experience leading teams and delivering technology solutions at scale.
Your background is fairly entrepreneurial, how did you initially get involved in AI?
I guess I’ve always considered myself pretty entrepreneurial. I started my first business out of college and with a background in product design, have found myself gravitating toward helping folks with early-stage ideas. Throughout my career, I’ve been lucky enough to work with a number of startups that have gone on to have some pretty incredible runs. During those experiences, I’ve had exposure to a lot of great founders first-hand, in turn inspiring me to pursue my own ideas as a founder. AI was relatively new to me when I joined AI2; however, that experience provided me with an opportunity to apply my product and startup lens to some truly amazing research and imagine how these new advancements were going to be able to help a lot of folks in the coming years. My goal since the beginning has been to develop real businesses for real people, and I believe AI has the potential to create a lot of exciting opportunities and efficiencies in our future if applied thoughtfully.
Could you share the story of how the idea for WellSaid Labs was conceived when you were an entrepreneur in residence at The Allen Institute for AI?
I joined The Allen Institute for Artificial Intelligence (AI2) as an Entrepreneur in Residence in 2018. Arguably the most innovative incubator in the world, AI2 houses the brightest minds in AI that apply solutions from the edge of what’s possible today to tangible products that solve problems around the globe. My background in design and technology nurtured a long-time interest in the creative fields, and with the AI boom we are all witnessing today, I wanted to explore a way to connect the two. I was introduced to Michael Petrochuk (WellSaid Labs co-founder and CTO) while developing an interactive healthcare app that guided the patient through various sensitive scenarios. During the process of developing the content for the experience, my team worked with voice talent to pre-record thousands of lines of voiceover for the avatar. When I was exposed to some of the breakthroughs Michael had achieved during his research, we both quickly saw the value of how human-parity text-to-speech (TTS) could transform not only the product I was working on but also impact a number of other applications and industries. Technology and tooling had struggled to keep up with the needs of producers creating with voice as a medium. We saw a path to putting this technology in the hands of all creators, allowing voice to be an integral part of all stories.
WellSaid Labs is one of the few companies that provides voice actors with an avenue into the AI voiceover space. Why did you believe it was important to integrate real voices into the product?
Our answer to this is two-pronged: first, we wanted to create solutions that complimented professional voice actors’ capabilities, expanding opportunities for voice. And second, we strive to have the highest level of human quality in our products. Our voice actors are long-term collaborative partners and receive compensation and revenue share for both their voice data and the subsequent content produced with it. Every voice actor we hire to create an AI voice avatar based on the likeness of their voice is paid based on how much their voice is used on our platform. We encourage talent to partner with us; fair compensation for their contributions is incredibly important to us.
To offer the highest level of human-quality products on the market, we must be rigorous about where we get our data. This process gives us more control over the quality, as we train our deep learning models to speak both to human parity and specific contextually relevant styles. We don’t just create a voice that recites the provided input. Our models offer a variety of voice styles that perform what is on the page. Whether users are creating voiceover by using an avatar from our library or creating voiceover with a custom-built voice for their brand, we use real voice data to ensure a seamless process and easy-to-use platform. If our customers had to manipulate and edit our voices in post-production, the process of getting the desired output would be clunky and long. Our voices take the context of the written content and provide a contextually accurate reading. We offer voices for all types of use cases – whether it’s reading the news, making an audio ad, or automated call center support – so partnering with professional voice talent specific for each use case provides us with both the context and high-quality voice data.
We regularly update and add new styles and accents to our avatar library to ensure that we represent the voices of our customers. In WellSaid Labs’ Studio, customers and brands can audition different voices based on region, style, and use case, allowing for a more seamless, unified production of audio content personalized to the maker’s needs. Once an initial recording is sampled, users can cue specific words, spellings, and pronunciations to ensure the AI consistently speaks specifically to their needs.
WellSaid Labs is staking its claim as the first ethical AI voice platform. Why are AI ethics important to you?
As AI adoption increases and becomes more mainstream, fears of harmful use cases and bad actors are at the center of every conversation – and these concerns are unfortunately validated by real-world occurrences. AI voice is no exception; nearly every day, a new report of a celebrity, public figure or politician being deepfaked for advertisements or political purposes makes news headlines. Though formal federal regulation regarding this technology is still evolving, detecting and combating malicious actors and uses of synthetic voice will become increasingly difficult as the technology continues to advance.
Coming from AI2, where AI ethics is a core principle, Michael and I had these conversations on day one. Developing AI speech technology comes with significant responsibilities regarding consent, privacy, and overall safety. We know that we, as developers, must build our technology safely, address ethical concerns, and lay the groundwork for the future development of synthetic voices. We recognize the potential of AI speech technology for misuse and embrace our responsibility to reduce the potential misuse of our product. We need to lay this foundation from day one rather than run fast and make mistakes along the way. That wouldn’t be doing right by our enterprise customers and voice actors, who count on us to build a high-quality, trustworthy product.
We fully support the call for legislation in this field; however, we will not wait for federal regulations to be enacted. We have always prioritized and will continue to prioritize practices that support privacy, security, transparency, and accountability.
We strictly abide by our company’s ethical code of intent, which is based on building with responsible innovation in every decision we make. This is in the best interest of our global customers – enterprise brands.
How do you develop an ethical AI voice platform?
WellSaid Labs has been committed to ethical innovation from the start. We centralize trust and transparency through the use of in-house data models, explicit consent requirements, our content moderation program, and our commitment to brand protection. At WellSaid, we lean on the principles of Responsible AI to shape our decisions and designs, and those principles extend to the use of our voices. Our code of ethics represents these principles as Accountability, Transparency, Privacy and Security, and Fairness.
Accountability: We maintain strict standards for appropriate content, prohibiting the use of our voices for content that is harmful, hateful, fraudulent, or intended to incite violence. Our Trust & Safety team upholds these standards with a rigorous content moderation program, blocking and removing users who attempt to violate our Terms of Service.
Transparency: We require explicit consent before building a synthetic voice with someone’s voice data. Users are not able to upload voice data from politicians, celebrities, or anyone else to create a clone of their voice unless we have that person’s explicit, written consent.
Privacy and Security: We protect the identities of our voice actors by using stock images and aliases to represent the synthetic voices. We also encourage them to exercise caution about how and with whom they share their association with WellSaid Labs or other synthetic voice companies to reduce the opportunity for misuse of their voice.
Fairness: We compensate all voice actors who provide voice data for our platform, and we provide them with ongoing revenue share for the use of the synthetic voice we build with their data.
Along with these principles, we also strictly respect intellectual property. We do not claim ownership over the content provided by our users or voice actors. We prioritize integrity, fairness, and transparency in everything we do, ensuring that our synthetic speech technology is used responsibly and ethically. We actively seek partnerships with voices from diverse backgrounds and experiences to ensure that we provide a voice for everyone.
Our commitment to responsible innovation and developing AI voice technology with ethics in mind sets us apart from others in the space who are seeking to capitalize on a new, unregulated industry through any means. Our early investments in ethics, safety, and privacy establish trust and loyalty within our voice actors and customers, who increasingly seek ethically-made products and services from the companies at the forefront of innovation.
WellSaid Labs has created its own in-house AI model that enabled its AI voices to achieve human parity, and it has achieved this by bringing the imperfections humans have to conversations. What is it about these imperfections that make the AI better, and how are these imperfections implemented?
WellSaid Labs isn’t just another TTS generator. Where early TTS technology was unable to recognize human speech qualities like pitch, tone, and dialect that convey the context and emotion behind the words, WellSaid voices have achieved human parity, bringing uniquely human imperfections to AI-generated speech.
Our primary measure of voice quality is and has always been human naturalness. This guiding belief has shaped our technology at every stage, from the script libraries we’ve built to the instructions we give talent and, more recently, how we iterate on our core TTS algorithms.
We train on authentic human vocalizations. Our voice talent reads their scripts authentically and engagingly when they record for us. Speech perfection, on the other hand, is a mechanical concept that leads to a robotically flawless, unnatural output. When professional voice talent performs, their rate of speech fluctuates. Their loudness moves in conjunction with the content they are reading. Their vocal pitch may rise in a passage requiring an excited read and fall again in a more somber line. These dynamic variations make up an engaging human vocal performance.
By building AI processes that work in coordination with the dynamic performances of our professional talent, we have built a truly natural TTS platform. We developed the first long-form TTS system with predictive controls throughout the entire creative process. Our phonetic library holds a diverse collection of audio data, allowing users to incorporate specific vocal cues, like pronunciation guidance or controllability, into the model during the production phase. In one platform, WellSaid users can record, edit, and stylize their voiceover without needing to import external data.
Could you discuss some of the challenges behind building a text-to-speech (TTS) AI company?
The development of AI voice technology has created an entirely new set of obstacles for both its producers and consumers. One of the main challenges is not getting caught up in the noise and hype that floods the AI sector. As a new, buzzy technology, many organizations are trying to cash in on short-term AI voiceover developments. We want to provide a voice for everyone, guided by central ethical principles and authenticity. This adherence to authenticity can delay the development and deployment of our technologies but solidifies the safety and security of WellSaid voices and their data.
Another challenge of developing our TTS platform was developing specific consent guidelines to ensure that organizations or individual actors won’t misuse our technology. To combat this challenge, we seek out collaborative, long-term partnerships and are fully involved with voiceover development to increase accountability, transparency, and user security. We actively seek partnerships with voice talent from various backgrounds, organizations, and experiences to ensure that WellSaid Labs’ library of voices reflects its creators and audiences. These processes are designed to be intentional and detail-oriented to ensure our technology is being used as safely and ethically as possible, which can slow the development and launch timeline.
What is your vision for the future of generative AI voices?
For the longest time, AI speech technology has not reached high enough quality to enable companies to create meaningful content at scale. Now that audio technology no longer requires expensive equipment and hardware, all written content can be produced and published in an audio format to create engaging, multi-modal experiences.
Today, AI voices can produce human-like audio and capture the nuance required to make digital storytelling more accessible and natural. The future of generative AI voice will be all-encompassing audible experiences that touch every aspect of our lives. As technology continues to advance, we will see increasingly natural and expressive synthetic voices blur the line between human and machine-generated speech – opening new doors for business, communications, accessibility, and how we interact with the world around us.
Businesses will find enhanced personalization in AI voice interfaces and use them to make interactions with virtual assistants more immersive and user-friendly. These enhancements are happening already, from intelligent call center agents to fast-food drive-thrus. Content creation, including advertising, product marketing, news narration, podcasts, audiobooks, and other multimedia, will see increased efficiency by using tools to develop engaging content – ultimately increasing lift and revenue for organizations, especially now that multilingual models can expand a company’s reach from a single point of origin to having a global presence. Production teams will find great benefit in synthetic voices to create voices tailor-made to the brand’s needs or customized to the listener.
Before the introduction of AI, TTS technology lacked the crucial human emotion, intonation, and pronunciation abilities required to tell a full story at scale and with ease. Now, AI-powered TTS offers more immersive and accessible experiences, including real-time speech capabilities and interactive conversational agents.
Achieving human-like speech capabilities has been a journey, but now that it’s attainable, we’re witnessing the complete scope of AI voice to create real business value for organizations.
Thank you for the great interview, readers who wish to learn more should visit WellSaid Labs.