![Why the telephony stack is the real bottleneck in voice AI [Q&A] Why the telephony stack is the real bottleneck in voice AI [Q&A]](https://betanews.com/wp-content/uploads/2025/11/AI-robot-call-center-640x440.jpg)
Everyone talks about better LLMs and the need for more ‘human’ voices when calling a company, but none of this matters if your agent can’t survive a real world phone call.
We spoke to Alexey Aylarov, Voximplant CEO and cо-founder, who believes that the real constraint is outdated telephony infrastructure that was never designed for real-time AI.
BN: What are some of the common misconceptions about Voice AI Agents?
AA: Many people believe that a Voice AI Agent is simply a ChatGPT with a voice. In reality, production-ready agents are way more complex. To operate in the real world and be able to call or take calls from real customers, AI agents require a whole infrastructure beyond the language model.
A Voice AI Agent needs a real phone number, an LLM that understands and interprets intent and generates responses, a Speech-to-Text (STT) engine to convert caller audio into text for the LLM, and Text-to-Speech (TTS) to convert the agent’s response into natural speech. It also needs conversational turn-taking capability, so the system knows when the human is speaking and when the AI should speak. And finally, it needs a telephony gateway so Voice AI can interact with the global phone network.
Developers often focus on selecting the right LLM or the latest speech provider, but in order to create a production-ready Voice AI Agent, you need to tackle the challenge of managing multiple components working together in real-time. We call this orchestration.
BN: How are B2C businesses using Voice AI today?
AA: Text-based chats are convenient for quick queries and follow-ups, but 80 percent of inbound customer interactions still come through voice, meaning calls, especially in urgent situations.
For instance, Voice AI can handle scheduling tasks end-to-end, meaning booking, cancelling, or rescheduling an appointment, or even calling a client who didn’t show up. It can support sales by calling inbound leads the moment a customer submits a form, or by qualifying them and scheduling a meeting.
Voice AI Agents also have the capacity to replace the legacy Interactive Voice Response (IVR) systems. It’s when you call your bank or carrier and an automated voice asks you to, “Select two if you have a question about billing”. Nobody likes those, so an AI is actually an improvement in this case. It can triage calls, answer FAQs, and finally route calls to a human agent when necessary. In logistics, it can call a customer when the courier is running late or confirm the delivery time.
Today, Voice AI becomes both the communication and the execution layer. It’s the fastest-growing entry point for B2C businesses, and soon they will eliminate the need for traditional call centers.
BN: What is driving the adoption of AI inside voice call flows?
AA: First, it’s the ability of AI Agents to handle customer communication 24/7 across any time zone. Companies don’t need to build call centers across continents, and customers can get their issues solved instantly, without waiting until business hours. AI voice agents respond faster, reduce abandonment, and repeat calls. And the economics are compelling.
BN: What has fundamentally changed in customer support with the rise of AI Agents?
AA: Customers now expect businesses to be instantly reachable across multiple channels. They want to send a voice message on WhatsApp or call through Facebook Messenger, and are not ready to wait 15 minutes for a response. At the same time, generative AI has matured enough to deliver this at scale in real-time.
But legacy call centers and manual CRMs weren’t built for this new reality. They rely on outdated workflows, lack context, and can’t meet today’s need for speed and personalization. Enterprises need infrastructure that unifies AI models, telephony, and real-time voice systems under a single programmable layer.
BN: With the rise of AI tools, more people refer to orchestration. Why is it critical? Can you explain why platforms emerge and what this means in practice?
AA: Voice AI requires telephony, speech engines, LLM, compliance, and more. Each country and provider has their own requirements and constraints. For example, countries like the UAE and Saudi Arabia limit most VoIP traffic, requiring licenses, making cloud telephony deployments legally challenging in the region.
Just a single vendor can’t solve everything end-to-end, and enterprises need the ability to mix and match the best components. Orchestration platforms emerged to help fill the gaps and coordinate all these components together. Since LLMs are constantly evolving and updating, the voice AI pipeline must be designed for easy switching.
With Voice AI, orchestration unifies telephony, AI, and real-time communication in a single programmable environment.
BN: As AI becomes more sophisticated, what safeguards or frameworks should developers and B2C businesses consider?
AA: Always assume you will want to swap LLM and speech vendors as they progress and develop. Make sure that it is possible to do in your pipeline. Then your voice AI Agent will be able to evolve as models improve, new features appear, and prices change.
Also, don’t expect these systems to always act reliably or to have the same uptime as you get from other cloud services. Always configure a backup.
Another safeguard is to verify the global footprint of the platform you are using. Don’t assume all vendors have the same performance, reach, or compatibility everywhere. Voice infrastructure constraints vary a lot by region.
Expect continued improvements in quality and performance for speech and AI. Start with the easiest, highest ROI use cases, and then expand as the technology improves. This is crucial for handling scaling and the ever-increasing volume of calls.
BN: Looking ahead, what emerging AI capabilities do you believe hold the most promise? How will voice automation reshape customer experience economics?
AA: State-of-the-art systems will make AI sound very natural. It will be hard to distinguish them from humans. This refers not just to voice quality, but also reasoning, adaptability, and data grounding.
Legacy IVRs, such as rigid phone menus, will vanish and be replaced by conversational Voice AI. Real-time automation will expand outbound capabilities: we will see more proactive callbacks, logistics coordination, and dispute resolution without requiring human intervention.
Image credit: phonlamai/depositphotos.com

