John Flynn comes from the world of film, having worked in the editing pipeline of Hollywood blockbusters like the Harry Potter series, The Dark Knight, and Bohemian Rhapsody.
Zeena Qureshi has years of sales experience at tech startups. She also had a background in teaching speech and language therapy to children with Autism which gave me a different perspective of speech expertise. The duo became good friends during their stint at Talent investor Entrepreneur First.
“At Entrepreneur First, John and I were friends first, but as the programme drew to a close, John showed me this incredible demo of an artificial voice that sounded perfect. You could even hear the breath, I couldn’t believe it,” Qureshi mentions in an official blog post.
“We knew that current text-to-speech solutions sounded robotic, lacking natural performance and quality. We also knew that speech synthesis was very subjective, unlike speech recognition, which is more objective. So we set out to fix this problem,” she adds.
Equipped with Flynn’s technical acumen and Qureshi’s business savviness, the duo founded Sonantic in 2018.
Yesterday, the UK-based startup launched the first AI-powered speech technology with true emotional depth, conveying complex human emotions from fear and sadness to joy and surprise.
Capturing the nuances of the human voice
According to the company, its approach is built upon the existing framework of text-to-speech, thereby differentiating a standard robotic voice from one that sounds genuinely human. “Creating that “believability” factor is at the core of Sonantic’s voice platform, which captures the nuances of the human voice,” Sonantic mentions in a press release.
Sonantic partners with experienced actors to create voice models. Clients can choose from existing voice models or work with Sonantic to build custom voices for unique characters. Project scripts are then uploaded to Sonantic’s platform, where a client’s audio team can choose from a variety of high fidelity speech synthesis options, including pitch, pacing, projection and an array of emotions, claims the company.
Notably, the actors receive a profit share every time their voice model is used in a project.
Obsidian partnership
Sonantic has also partnered with Obsidian, a AAA gaming studio and subsidiary of Xbox Game Studios, to test this tech. It has also released a demo video highlighting this partnership in order to demonstrate its voice-on-demand technology.
“Working in game development, we could send a script through Sonantic’s API — and what we would get back is no longer just robotic dialogue: it is human conversation. This technology can empower our creative process and ultimately help us to tell our story,” says Obsidian Entertainment Audio Director Justin E. Bell. The company claims that the gaming studio’s production timelines and associated costs would be slashed through this new capability.
First Cry
The company’s official launch follows the last year’s beta release, which was captured in a video entitled “Faith: The First AI That Can Cry.”
Both the co-founders agree that hearing that cry was an incredible moment for their team last year. The duo believes that the launch of the full spectrum of human emotion is an exciting milestone not just for them, but for the entertainment industry. “The possibilities for studios are endless. With a technology this comprehensive, it frees them up to experiment with scripts and produce work in an unprecedented timeframe, converting months of work down to minutes,” the co-founders mention in a joint statement.
The company’s ultimate mission is to work with both studios and professional actors to build the entertainment products of the future.
Building a company in six weeks
According to Qureshi, while setting up Sonantic, time was of the essence. “The program ( Entrepreneur First) lasts six months, and the first two are all about finding your teammate. John and I were up against the clock because we met on the last day of team building and had a month to prove our business before going in front of the investment committee. The timing was difficult as Christmas was right in the middle, but we used that to our advantage.”
“John built a live demo that I could share with both customers and investors. Within six weeks, we managed to found a company, build a prototype, and most importantly, get several AAA pilot customers onboard,” she adds.
The Algorithm
Talking about the company’s algorithm, Qureshi tells Silicon Canals, “We’ve developed our algorithms to focus on the nuances and subtleties of the human voice that most algorithms miss. The devil is in the details, so we’ve done a lot of work to make sure that small things are mapped and calculated, like a voice quiver for sadness, exertion for anger and varied pitch patterns (to name a few).”
She continues, “Even a casual listener is very sensitive to small changes in voice quality. That’s what makes us know if someone is being slightly sarcastic or deadly serious, so modelling microscopic details is key to doing a great actor’s voice, justice.”
The data sources for training the algorithm come from actors through a voice engine that helps actors train their own models. Notably, the company claims to ensure that its algorithms are never trained on publicly available data without the voice owner’s permission.
So, what is the algorithm learning as it encounters data?
“The algorithms first learn how to speak generally; for example, making roughly the correct vowel and consonant sounds. Then as training progresses, the models learn to pronounce better; for example, t’s and d’s start to get sharper. Up to this point, it’s similar to the way a child learns. From here, more subtle things like pitch inflections, emotional elements are modelled,” explains Qureshi.
“At this later stage, the models start to sound exactly like the actor on which the model was based. We work with fantastic actors who are very talented performers; they form the base of every model,” she further explains.
Somatic has filed for three patents for its technology.
Sonantic’s business model
The company has a B2B enterprise Saas (Software-as-a-Service) model. Sonantic licenses its technology strictly to entertainment studios only. Depending on the volume of text to be rendered, the platform provides different tiers for licensing.
In addition, Sonantic’s other revenue stream is creating custom voice models for its clients. The company claims to have over 1000 companies on its waitlist.
Further development
Qureshi tells SC, “We will continue to build out more voices, features with controls, and languages as the possibilities of dialogue are endless.”
The company has plans to start building the next generation of voice with its studio partners to go beyond what’s possible now, like runtime generation of content on the fly.
For instance, if a character is running in a game, they should sound out of breath and react like a human would to their state of being and environment.
Sonantic looking for talents
The company has 12 employees, including Flynn as co-founder & CTO, and Qureshi as co-founder & CEO. The team comprises three deep learning speech researchers; three engineers; one full-time actress, one casting & performance director; one VP of customers; and one operations associate. Flynn manages the tech team, including research, acting, and engineering, while Qureshi works with the business team on customers, strategy, marketing, and sales.
The company is currently hiring talented speech researchers and engineers, particularly those with experience in this niche technology.
Back in March 2020, the company had secured €2.3M in a funding round led by EQT Ventures along with participation from existing backers such as Entrepreneur First, AME Cloud Ventures and Bart Swanson of Horizon Ventures.