Anyone who looks at the past understands the value of technological innovation in building progress. We ate a lot of rice and beans to get here. And I, of course, don't think differently. So, for the sake of clarity:
- I want to write to organize what I'm seeing
- I want to encourage more practical and detailed (and perhaps less apocalyptic) discussions
- My bias is that of a speaker, former audio producer, former musician and curious
- The idea is to bring the discussion by looking at what we have now and suggest a future that is closer, but avoid the complexity of discussions about a distant future.
- This title is clickbait (It's not important to know if we are better than machines, but what the differences are). But I'm already saying that I think we're better. Not incredibly better, but better enough that we don't lose jobs to robots
With that out of the way, let's get to the point:
- INTERPRETATION x INTERPRETATION
In speech synthesis (that process in which the machine generates a voice from text, or Text To Speech or TTS) it is still not possible to make the voice imbue meaning beyond that of the textual one, to note: sarcasm, friendliness, security, hesitation, colloquiality, etc.
The voice neural network works like this: it looks at the recorded material of a speaker and makes connections (starting from a linguistic system database with another database). She notices that the speaker ends sentences in a certain way; that questions have a certain melody; that the pronunciation of “seisceintos” is very São Paulo — these are phenomena that always (or almost always) occur in the material analyzed. From then on, when you give her a text, she uses the ruler and “reassembles” the speaker's voice by crossing the context she learned with the context she received in the text.
That is, everything that is different does not enter the model. Everything that is possible does not enter the model. Example:
Sea bath? No. Shower is what's there.
The synthetic voice in this case would look at the similar contexts of the trained material and choose a somewhat literal intonation: No. (after all, most “no’s” are somewhat literal). It is impossible for the machine to think of possibilities such as: does this character want to show the absurdity of the suggestion? Is she too lazy to leave the house? Does he have the mischievous smile of someone who already knows the answer? And even if she had the answer to those questions, what is it? a repulsive or lazy or naughty “no”?
Interestingly, one word describes two distinct processes in a speaker's work: interpreting. First, it is necessary to interpret what is the message (what is the context, what message do I want to send, who is it for) for later to interpret the message (execute, take from the world of ideas to paper — or air). And these two processes are completely different for the machine.
But let's look a little further and suggest mechanisms using some of what exists. What if I used a special material to train the machine? A bank of repulsive interpretations, another of lazy ones and another of scoundrels. Then the user could choose: this “no” is repulsive. Play. Hmmm I didn't like it. I am going to try lazy. Play. Hmmm next!
Imagine a system in which the user had to choose from a list of 200 (or was it 2000?) feelings/meanings for each word and/or phrase? And feelings combined meanings (repulsive and lazy)? Not practical.
So let's go further and imagine a machine capable of doing this: it would have to look at all the audiovisual content produced in recent years and what was the general understanding of human beings of each of these intonations, rhythms, inflections, etc. to be able to make choices. And these kinds of things change all the time. We are talking about a very different amount of data than that used to train synthetic voice models.
Furthermore, imagine training 2000 models with different feelings/meanings for just one voice? It's no surprise that GPT chat will only last until 2021. This technology has limits. At least today and in the near future.
Therefore, here are two important points: firstly, the way technology is structured today, it no it will be capable of reproducing extra-textual feelings/meanings, even with all the processing in the world. And second: even if it is, we would still need an entity thinking about what choices to make to reach a target result (is that a no lazy or gross?). A machine is not capable of that. A human being, yes, but it would be much more practical for him to turn on a microphone and record than having to manage levers.
2. SATURATION/MID
If everyone has the same tool available, everything that is produced looks very similar. And, therefore, the perception of the result produced by this tool changes quickly and without any control. What was new becomes old. What was different remains the same. What could be credibility becomes false. Very quick.
Just listen to the same voice in content on Instagram/TikTok 3 times for you to feel repulsed by it and the channel that is using it: even if it sounds “natural” in some contexts, from the exposure I already know that something is wrong: Am I being deceived? Is this message true? Why didn't the channel owner use his own voice?
3. THE MYTH OF PRODUCTIVITY GAINS
Unlike other areas of audiovisual production, voiceovers do not require a lot of time to produce. Therefore, if we compare the process of generating a voice through synthesis with recording and editing a voiceover, neither of the two processes would be incredibly faster — even more so if you take into account the quality of the creative process of the voiceover in the audiovisual.
No melhor cenário (entrei no site, colei meu roteiro numa caixinha, fiz o download de uma locução sintética) eu só ganharia algumas horas em relação a pedir para o locutor gravar e esperar chegar o arquivo. E o trabalho é o mesmo: “máquina, me dê uma locução” versus “locutor, me dê uma locução”.
And for long pieces? In a few minutes I generate a voiceover for an audiobook that would take a speaker days to record. Still, even disregarding the difference in quality between one and the other, all generative AI work needs to be reviewed. The system still has flaws, the books are not all written in the same way, the punctuations are sometimes strange (there is a big difference between written and oral language and the speaker is making this filter all the time), there are neologisms and it would be necessary a human to review the work before publishing (as is the case with all material generated by generative AI). And here we are talking about review work that is different from that normally done on a recording of a speaker. It would be listening, looking for inconsistencies and looking for solutions that may not exist: it is possible that a certain paragraph has a punctuation that is impossible for the synthetic voice to interpret correctly — what do I do? Do I change the score? Is adjustment possible?
And guess how long it takes to review a 10-hour book? At least 10 hours (if everything is correct). So once again: what is the real gain?
4. CONTROL — TIMBRE
An essential part of the announcers' job is controlling the exposure of their voice. When we run a campaign with a large reach/impact, our voice remains tied to that message/brand for a long time. And this has several consequences (deliberate or not). Making good choices is building a collective perception of what your voice means — after all, millions of people have heard your voice in such and such a context (your career is defined like this). And the brand knows this: when hiring the speaker, it lends everything that voice brings to its brand.
If there is no control, who guarantees that your brand voice will not be used by your competitor? Or maybe the voice you chose for the video that will play at your sustainability event wasn't used by an extreme political channel that says there is no global warming and went viral (in a bad way)? Or is the voice in your audiobook the one that has been used a thousand times on TikTok and no one can stand it anymore?
5. CRIME 1 — THE LAWS WE HAVE
We are not unprotected. While more specific legislation is not ready, we have the Advertising Self-Regulation Code (from Conar), Consumer Protection Code, Child and Adolescent Statute, Copyright Law and General Data Protection Law to guide us. Very careful! Using someone else's work/voice without authorization has long been a crime.
Especially in our case I would like to highlight:
- Personality Right: in the Federal Constitution and the Brazilian Civil Code, the attributes of a natural person (including their voice) cannot be used without the due authorization of their holder. This includes both its use to train the generative voice and whether that synthetic voice happens to imitate someone's voice (regardless of the process).
- Business Secret: here I quote the manual from the Brazilian Association of Advertisers: “The protection of business secrets may be lost if the secret is made public. Therefore, considering the feedback characteristic of interactions and outputs generated from interaction with users, it is extremely important that information, practices and/or procedures that constitute business secrets are not inserted into Generative AI platforms.” Would you put the campaign script that hasn't yet been sent to your client on a machine that we don't know how to use the data that's there?
6. CRIME 2 — WHAT’S COMING
In the coming years we will see important regulations:
- In Brazil we have 4 PLs on the subject in progress: 5051/2019, 21/2020, 240/2020, 872/2021 and mainly 2338/2023, by Rodrigo Pacheco.
- Derivative Works: some legal processes are taking place in the world that may conclude that the material that is being generated using some type of generative AI that uses a large database in its neural network is in fact a Derivative Work precisely from that database. This means, among other things, that to use this material, authorization from the authors of the original works in this database would be required, as well as correct compensation.
7. ART
An indelible characteristic of art is the connection we have with the artist: I see how this human being expresses himself and that gives me meaning, because I am made of the same thing, I feel the same things. So he helps me understand and express myself.
Generative AI has been relegated to pastiche, which tells us nothing. It's empty, because we need the complete cycle to relate. Who created? In what context?
Can we say the same thing about mass culture, its trends and slurries? Is everything a cheap copy of everything that came before? On my two cents, it's not the same thing. Even if mass communication has pastiche, there are still a million human choices in the process: people thinking about how other people will understand such a message (and that makes us more connected).
This may not be the most important argument in this text, taking into account the connection that people have for an advertising/informational work versus a work exhibited in a museum, but it is still necessary to navigate the current aversion to AI, in general, and the uncanny valley, for example.
8. ROBOT’S VOICE HAS NO TREBLE — THE 10kHz LIMIT
The set of public databases used for the development of artificial voice systems in the last 10 years have sampling rates of 22.05kHz or 24kHz. Therefore, the generated voices are 10–11kHz maximum (Nyquist theorem). A typical recording (and our hearing) goes up to 20kHz.
This 9–10kHz of loss means that we have no treble in the synthetic voice. It is precisely in the treble that important information about audio quality is found: these are elements (harmonics, formants) that characterize timbre and distinction between voices.
This is not the current standard for audiovisual post-production. To work with this limit, we would have to accept a loss of quality or we would have to improve the database. For productions that are paid to ensure the best quality possible, I don't see that concession being made any time soon. But maybe for smaller ones this isn't such a big problem. As for changing the database: there is synthesis generated at 48kHz, but it is not yet a market standard, because doubling the sampling rate means doubling the amount of data and greatly increasing processing (which is already heavy).
*as we get older, we tend to lose the ability to hear high frequencies. At 38 and with a lot of abuse (and care), I no longer hear beyond 17kHz. My 65-year-old father doesn't listen beyond 9kHz. So keep in mind when listening to a synthetic voice and compare to a natural voice what you are not hearing (but your child and your audience are). Trust good, trained ears!
9. ARTIFACTS/GLITCHES
After listening more carefully to the tools available (going a little beyond that first listen where we say “wow, that sounds like a real person talking!”), it is still possible to hear “artifacts”: these are noises or inconsistencies in the audio that show that that voice is synthetic. Sometimes they appear in a melody jump impossible for the human voice or extra noises that are not part of the voice (acoustic and vocoder degradations). You cannot remove them when editing.
They are very particular to this type of technology and, although they appear less than other systems, they still appear from time to time.
In the near future these problems may disappear, but most of the available tools have them, even more so if you use a non-ideal material to train your models — a likely scenario in low-standard production.
Nowadays, in this context of artificial voice in communication and all the negative reaction that this has generated, in my reading these artifacts are the stamp of artificiality, the mark of “false” and everything that this can imply.
*Here it is important to say that I am comparing it with the quality of recording voiceovers in a studio (once again: the audiovisual standard).
10. NOVO MERCADO
When we talk about the job market, we get the notion that it is made up of fixed groups of vacancies, when in fact we know that it is just a reflection of what we want to consume/do. In other words, it does not mean that a certain piece with a synthetic voice necessarily replaced an announcer: it may be that that piece would not even exist if it were not for today's technology.
Therefore, we do not know what audiovisual production will be like in a short period of time and what its demands will be in terms of voice. In fact, if I may be optimistic: if the cost of certain audiovisual production processes will decrease, it is very likely that the number of productions will increase. And whatever has value in this new scenario will share the segment's new pie, without fail.
To give a real example: a client who 5 years ago would never have had the funds to make a short video of their product can now hire a production company aligned with new, cheaper processes. She makes the video and puts on an artificial voice. The client rejects the voice and looks for a real speaker on his own, but approves the video. In other words, a new client entered the voiceover consumer market in the context of a reduction in the cost of audiovisual production. And a new value structure is built based on what really matters, what is not replaceable.
11. CYBORG: SYNTHESIS x CONVERSION
What if we deliberately trained our own systems, with our own voices? Will we be omnipresent cyborgs, selling our voices for a school performance in Roraima, the Rio de Janeiro subway and an event in the interior of Mato Grosso?
It's possible: today there are companies that can do this (train a system with your voice so that you can exploit it commercially). There are several ways you can configure the machine to veto certain uses or content. Or you may have to authorize each contracted use yourself.
But remember that everything above is still valid: it will be a voice without treble, without interpretations, there is a risk of you authorizing its use in a piece that speaks badly about your favorite client, your voice could be labeled a “voice like a robot” or “the voice that people use when they want to talk about politics” etc. And, of course, you will be competing with the hundreds of voices that are already on the market and which are free (not to mention the practicality of choosing the synthetic voice within the platform itself — which is what happens on TikTok, for example, and should expand to other content production software). Will my timbre worth more than others? Could it be that by offering my client a cheaper (and worse) solution, I don't end up making my work precarious?
I don't think anyone has the answer to these questions. It's a risk — whoever takes the risk now will reap the burden or the bonus of that decision.
But there are still possibilities for working with generative voice. One of them is to be hired by a company that wants to sell voices and that needs to create its own database. This material will be used to create the basis of the system — products from this company will not necessarily have the stamps of the people who participated in its database. And this is very important for the speaker, right? But be very careful with contracts so as not to give up the use of your voice in an indeterminate manner.
And, last but not least, in addition to the speech synthesis (the one created from a text), there is also the conversion or voice replacement. Here the system is trained in the same way, but the voice is created from another voice. The system, therefore, is not making the interpretation choices: the base voice is. It will use the same rhythm, intonation, intention, emission, volume, etc., but it will change the timbre, that is, it will transform one person's voice into another person.
The result of conversion is much better than synthesis, because the voice professional can control the interpretation. But note that even if I could transform my voice into James Earl Jones' voice, his lawyers wouldn't like it very much. Or they would really like it. In addition to this (criminal) use, I suggest some:
- Perhaps this use will be useful in expanding the tools available for character creation. In dubbing and original voice acting, historically actors use extremes of their timbres and interpretations to distinguish between the many characters they have to play. This type of interpretation became known as “caricata”. With the conversion We could explore other types of control in these segments.
- In dubbing specifically, this technology cuts both ways: if Tom Hanks allows his voice in English to be replaced by his own voice in Portuguese, what is the advantage of hiring a voice actor? At the same time, the result would be different (better?) if I hired a voice actor to “localize” the interpretation into Brazilian Portuguese (and all its particularities) and, perhaps more on a whim than usefulness, replaced that voice actor's timbre with that of voice of Tom Hanks.
- You can train a model with your voice so you can to convert a work you did in Portuguese into another language. Although each language has its own inflection and characteristics, part of what was interpreted in the original version would be transposed into the target language.
12. OBJECTIVE OF THE TECHNOLOGY
The academic community in developing this type of technology is not interested in replacing announcers. Of course, the same technology can be used by nefarious agents, but its genesis, in addition to being noble, is not focused on addressing details such as those described here. Therefore, the objectives are different: they are not concerned with selling, attracting attention, communicating with a specific audience… In short, communicating as a speaker within a creative/productive communication context can communicate.
Overall, this technology is focused on helping people with disabilities and automating unhealthy and debatable call center-type functions. In this context, intelligibility, processing, naturalness/expressiveness are the keys — and, as we know, these points are not enough for audiovisual.
13. CONCLUSION
The concept of machine learning mess with us. "Like this? You give a bunch of data to the machine and it draws its own conclusions?” Irreparably, if we stop to think without researching, we fall into the snowball fallacy (or slippery slope) common of these discussions like:
“Soon the machine will be able to interpret the text (insert your work here) as well as a human, because the machine never stops learning”
And this happens mainly because we don’t consider that:
The. there is a processing and database limit (of capital and natural resources)
B. there are laws limiting what can be done
w. there are humans guiding the way
We are talking here about technical and contextual limits (market expectations). These limits must keep changing — we need to stay alert! But in a very general way, voice synthesis (TTS) today is not good enough to compete with announcers, considering the segments that already exist and current technology.
Here the use is relegated to free, quick content, celebrity, influencer and specific system applications. It's no small feat and, probably, in the more distant future (I promised not to look at it, but I'm making an exception here) this space will be bigger — at the same time, I don't see how announcers would compete, except by being part of the construction of these tools, training models and negotiating the use of your voice in the new terms of generative AI (which is still being thought about, discussed and developed).
There is also potential for exploration in the voice conversion model, whether extending the ability to work with characters, people who have already passed away and derivative uses; as well as in use for other languages and dubbing. The short future of these techniques is uncertain and promising.
And the market is an amalgam of opinions. It's not true that we won't lose our work to the machine: I've already heard of a case in which a human speaker was passed over in a closed customer service system (there hasn't been time yet for the customer to react to the negative reaction — let's wait!). But as I write I am part of a process in which I have already had to record a sentence with 3 words about 50 times and the client thought that hasn't gotten there yet (We are scheduling a session so they can direct the intention). There is no consensus, nor any abrupt break.
It's up to us speakers to stay tuned to the discussions that are going on (and especially not to sign anything that isn't super clear). And for the audiovisual: announcers are better than machines (and be careful when using this technology deliberately and not get involved in a crime of violating rights).
Announcer and member of the collective presidency of Voice Club