Written by Alberto Betella, co-founder of RSS.com
In today’s audio advertising landscape, we have a broad spectrum of strategies ranging from Run of Network (RON) ads to host-read ads. RON ads are generic and typically have the lowest efficacy, while host-read ads have consistently proven to be more effective because they establish a deeper connection with the audience. The reason for this effectiveness lies in the emotional bond created between the host and their listeners, known as a parasocial relationship. This bond, rooted in trust, is strengthened by how the host’s voice conveys emotion and credibility.
Parasocial Relationships: The Emotional Bond with Audiences
The concept of parasocial relationships was introduced in the mid-1950s, describing how mass media create the illusion of a face-to-face relationship between a media personality and their audience. As media consumption shifted from radio and television to more interactive platforms like podcasts and social media, this phenomenon intensified. This connection develops over time as individuals consume media content, creating a sense of familiarity and emotional attachment. With podcasts, for instance, listeners feel like they know the host, even though this connection is one-sided. The host has no real interaction with the listener, yet listeners develop a strong sense of intimacy, feeling as though they have a personal relationship with the host. As a result, listeners often trust and relate to podcast hosts, amplifying the effectiveness of ads delivered by those hosts [Brinson et al., 2023].
The Science of Emotional Engagement in Audio
The roots of parasocial relationships are deeply embedded in neurobiology [Betella et al., 2014]. Research shows that specific features of speech, such as pitch, timber, accent, rhythm and intonation, which are collectively known as prosodic features, play a crucial role in emotional processing. These elements of speech stimulate emotional centers in the brain, including the amygdala and the anterior cingulate cortex, which are responsible for emotional responses and empathy [Wildgruber et al., 2006]. Prosodic features are key in expressing and interpreting emotions across different languages and cultures. This makes prosody a universal tool in emotional communication.
In short, it is not just what is being said in audio ads that makes them effective: it is how it is said. Prosody acts as the emotional vehicle, fostering the deep connections that make host-read ads resonate so powerfully with listeners. This emotional resonance is precisely why host-read ads drive higher engagement and effectiveness compared to more generic, impersonal ad formats.
Emotion AI: A Game-Changer for Scalable, Personalized Ads
Yet, despite their effectiveness, host-read ads are difficult to scale, particularly in podcasting. They require significant time and effort, especially when multiple variants are needed to address different audiences or shows. This scalability issue is where Emotion AI can play a transformative role.
Emotion AI, also known as Affective Computing [Picard, 1997], refers to technologies that can recognize, interpret, simulate, and even induce human emotions. When applied to podcast advertising, Emotion AI opens up the possibility of creating ads that combine the personal touch of host-read ads with the scalability typically associated with programmatic ad delivery.
Voice cloning can technically involve the basic replication of an utterance, i.e. a segment of speech such as a word, phrase, or sentence. However, simply cloning an utterance lacks the emotional depth needed for truly effective ads. This is where Emotion AI elevates voice cloning. It goes beyond the basic replication of speech by focusing on prosodic features that preserve the emotional signature of the host. These prosodic elements trigger parasocial relationships, fostering trust and emotional engagement with the listener.
It is the application of Emotion AI that makes voice cloning truly effective by capturing the emotional essence of the speaker and triggers a parasocial relationship between a podcast host and their listeners.
Blending the Host´s Authenticity with the Scalability of AI
Imagine a technology that combines the best of both worlds: emotionally engaging ads that feel personal to each audience, even at scale. By creating a voice model of the host, whether by fully replicating the host’s voice or mimicking key prosodic features, a text-to-speech (TTS) engine can then generate virtually infinite ad variations based on this model. By doing so, brands can scale host-read ads programmatically, using dynamic text placeholders that can be updated in real-time with contextual data to deliver highly personalized, context-aware ads [Betella & Richardson, 2024].
This represents a new, hybrid approach to audio advertising. It captures the nuances of the host’s voice, such as accent, rhythm, and intonation, while preserving the emotional connection and authenticity listeners expect. As a result, personalized, emotionally resonant ads can be deployed programmatically across multiple shows and formats.
While ideal for podcasts, this new ad format also extends naturally to other broadcasting types, including live radio and digital audio streams, where the human voice is central to delivering an engaging, compelling message. This isn’t just an idea for the future; it’s technology already coming to life today.
Ethical Considerations and Transparency
Host-read ads already exist as an effective format for both brands and podcast hosts, offering a genuine connection that benefits all parties. By applying Emotion AI principles to clone or mimic a host’s voice, we can scale the essence of host-read ads, achieving significant time and cost savings for creators and advertisers alike.
With this in mind, as with any AI-driven solution, ethical considerations must be at the forefront. The same ethical guidelines that apply to traditional host-read ads should govern AI-generated ones. Transparency is critical and listeners must be aware that they are hearing an ad, even if the ad sounds like it was read by the host. This aligns with existing regulations and best practices in advertising (including Section 5 of the Federal Trade Commission Act in the United States).
On top of that, the ethical deployment of AI in audio advertising must ensure that voice cloning and prosody replication are used responsibly, with hosts’ consent and audience trust being paramount. Maintaining transparency about AI’s role in these ads is key to avoiding any erosion of trust.
Conclusion
The host’s voice will remain unmatched in its ability to create emotional connections with listeners. The future of audio advertising will be defined by the perfect blend of emotional authenticity and scalability, with Emotion AI as the driving force behind this shift. This technology empowers advertisers to reach wider audiences with messages that truly resonate, marking not just an evolution but a revolution where creativity and technology come together to forge deeper, more meaningful connections than ever before.
About the Author
Alberto Betella is the co-founder of RSS.com and holds a PhD in Emotion AI. Prior to RSS.com, he served as Chief Technology Officer at Alpha, the European Moonshot Factory created by the telco giant Telefonica. Back in 2006, he developed Podcast Generator, one of the very first open-source web apps for podcast self-hosting, empowering a wide community of podcasters for over a decade.
References
[Betella & Richardson, 2024] Betella, A., & Richardson, B. (2024). Adaptive Text-to-Speech Synthesis for Dynamic Advertising Insertion in Podcasts and Broadcasts. U.S. Patent No. 12,106,330 B1. Issued October 1, 2024. https://patents.google.com/patent/US12106330B1/
[Betella et al., 2014] Betella, A., et al. (2014). Inference of human affective states from psychophysiological measurements extracted under ecologically valid conditions. Frontiers in Neuroscience, 8, 286. https://doi.org/10.3389/fnins.2014.00286
[Brinson et al., 2023] Brinson, N. H., et al. (2023). Consumer response to podcast advertising: The interactive role of persuasion knowledge and parasocial relationships. Journal of Consumer Marketing, 40(7), 971-982. https://doi.org/10.1108/JCM-01-2023-5819
[Picard, 1997] Picard, R. W. (1997). Affective computing. MIT Press.
[Wildgruber et al., 2006] Wildgruber, D., et al. (2006). Distinct frontal regions subserve evaluation of linguistic and emotional aspects of speech intonation. Cerebral Cortex, 16(10), 1230–1236. https://doi.org/10.1093/cercor/bhh099