VoiceBot

VoiceBot is an optional feature that enables automated voice conversations with callers. It can be used for self-service scenarios, call qualification, or as a fully autonomous voice agent.

VoiceBot handles the voice channel — converting between speech and text — so that conversation backends only need to work with plain text, similar to a chatbot.

Architecture

The VoiceBot pipeline is divided into three steps, named after human organs:

Step	Name	Function
👂	Ears (Speech-to-Text)	Converts the caller's voice into text in real time
🧠	Brain (Conversation)	Receives text, processes it, and generates a text response
👄	Mouth (Text-to-Speech)	Converts the text response back into voice for the caller

This separation allows complex voice interactions to be built on top of backends that only handle text. For example, a CRM system does not need to deal with audio streams, codecs, or real-time timing — it simply receives a text message and returns a text reply.

The Ears and Mouth steps also include internal features that improve the user experience, such as background audio, natural pauses, and timing adjustments to make the conversation feel more human.

Brain Backends

The Brain step supports multiple backends:

Internal (LLM) — uses Azure OpenAI (e.g. GPT-4.1) as the conversation engine. The entire dialog logic runs within UCS.
External (integrated system) — delegates the conversation to an external system (e.g. OSL or another CRM/dialog platform). UCS sends the caller's text to the external system and receives a text response. The external system controls the dialog flow.

Azure Services

VoiceBot uses the same Azure resources as AI Transcription. If you have already set up AI Transcription, no additional Azure resources are required.

If you are setting up VoiceBot without AI Transcription, create the following items as described in the AI Transcription — Azure Services section:

Item	Required for
App Registration	Authentication of UCS against Azure APIs
Azure OpenAI	Brain — internal LLM backend (only if not using an external system)
Azure AI Services	Ears — real-time Speech-to-Text; optionally Mouth — Text-to-Speech
Budget	Cost control for Azure services

Third-Party Services

In addition to Azure, the Mouth step can use a third-party Text-to-Speech provider:

ElevenLabs — offers high-quality, natural-sounding voices. Requires a separate ElevenLabs account and API key.

The choice of TTS provider is configured during activation.

Activating the Feature

Once all required services are provisioned, send the following information to INSOFT for activation:

Azure credentials (if not already provided for AI Transcription):

Parameter	Source
Tenant ID	App Registration → Overview → Directory (tenant) ID
Client ID	App Registration → Overview → Application (client) ID
Client Secret	App Registration → Certificates & secrets → Value
Azure OpenAI Endpoint	Azure OpenAI → Overview → Endpoint
OpenAI Deployment Name	Azure OpenAI → Azure AI Foundry → Deployments
OpenAI API Version	Azure OpenAI → Azure AI Foundry → Deployments → Deployment details
AI Services Endpoint	Azure AI Services → Overview → Endpoint

VoiceBot-specific configuration:

Parameter	Description
Brain backend	`internal` (LLM) or `external` (integrated system)
TTS provider	`azure` or `elevenlabs`
ElevenLabs API Key	Only if using ElevenLabs as the TTS provider
External system details	Only if using an external brain backend — connection details provided by the integrator

Architecture​

Brain Backends​

Azure Services​

Third-Party Services​

Activating the Feature​

Architecture

Brain Backends

Azure Services

Third-Party Services

Activating the Feature