Skip to main content

VoiceBot

VoiceBot is an optional feature that enables automated voice conversations with callers. It can be used for self-service scenarios, call qualification, or as a fully autonomous voice agent.

VoiceBot handles the voice channel — converting between speech and text — so that conversation backends only need to work with plain text, similar to a chatbot.

Architecture

The VoiceBot pipeline is divided into three steps, named after human organs:

StepNameFunction
👂Ears (Speech-to-Text)Converts the caller's voice into text in real time
🧠Brain (Conversation)Receives text, processes it, and generates a text response
👄Mouth (Text-to-Speech)Converts the text response back into voice for the caller

This separation allows complex voice interactions to be built on top of backends that only handle text. For example, a CRM system does not need to deal with audio streams, codecs, or real-time timing — it simply receives a text message and returns a text reply.

The Ears and Mouth steps also include internal features that improve the user experience, such as background audio, natural pauses, and timing adjustments to make the conversation feel more human.

Brain Backends

The Brain step supports multiple backends:

  • Internal (LLM) — uses Azure OpenAI (e.g. GPT-4.1) as the conversation engine. The entire dialog logic runs within UCS.
  • External (integrated system) — delegates the conversation to an external system (e.g. OSL or another CRM/dialog platform). UCS sends the caller's text to the external system and receives a text response. The external system controls the dialog flow.

Azure Services

VoiceBot uses the same Azure resources as AI Transcription. If you have already set up AI Transcription, no additional Azure resources are required.

If you are setting up VoiceBot without AI Transcription, create the following items as described in the AI Transcription — Azure Services section:

ItemRequired for
App RegistrationAuthentication of UCS against Azure APIs
Azure OpenAIBrain — internal LLM backend (only if not using an external system)
Azure AI ServicesEars — real-time Speech-to-Text; optionally Mouth — Text-to-Speech
BudgetCost control for Azure services

Third-Party Services

In addition to Azure, the Mouth step can use a third-party Text-to-Speech provider:

  • ElevenLabs — offers high-quality, natural-sounding voices. Requires a separate ElevenLabs account and API key.

The choice of TTS provider is configured during activation.

Activating the Feature

Once all required services are provisioned, send the following information to INSOFT for activation:

Azure credentials (if not already provided for AI Transcription):

ParameterSource
Tenant IDApp Registration → Overview → Directory (tenant) ID
Client IDApp Registration → Overview → Application (client) ID
Client SecretApp Registration → Certificates & secrets → Value
Azure OpenAI EndpointAzure OpenAI → Overview → Endpoint
OpenAI Deployment NameAzure OpenAI → Azure AI Foundry → Deployments
OpenAI API VersionAzure OpenAI → Azure AI Foundry → Deployments → Deployment details
AI Services EndpointAzure AI Services → Overview → Endpoint

VoiceBot-specific configuration:

ParameterDescription
Brain backendinternal (LLM) or external (integrated system)
TTS providerazure or elevenlabs
ElevenLabs API KeyOnly if using ElevenLabs as the TTS provider
External system detailsOnly if using an external brain backend — connection details provided by the integrator