VoiceBot
VoiceBot is an optional feature that enables automated voice conversations with callers. It can be used for self-service scenarios, call qualification, or as a fully autonomous voice agent.
VoiceBot handles the voice channel — converting between speech and text — so that conversation backends only need to work with plain text, similar to a chatbot.
Architecture
The VoiceBot pipeline is divided into three steps, named after human organs:
| Step | Name | Function |
|---|---|---|
| 👂 | Ears (Speech-to-Text) | Converts the caller's voice into text in real time |
| 🧠 | Brain (Conversation) | Receives text, processes it, and generates a text response |
| 👄 | Mouth (Text-to-Speech) | Converts the text response back into voice for the caller |
This separation allows complex voice interactions to be built on top of backends that only handle text. For example, a CRM system does not need to deal with audio streams, codecs, or real-time timing — it simply receives a text message and returns a text reply.
The Ears and Mouth steps also include internal features that improve the user experience, such as background audio, natural pauses, and timing adjustments to make the conversation feel more human.
Brain Backends
The Brain step supports multiple backends:
- Internal (LLM) — uses Azure OpenAI (e.g. GPT-4.1) as the conversation engine. The entire dialog logic runs within UCS.
- External (integrated system) — delegates the conversation to an external system (e.g. OSL or another CRM/dialog platform). UCS sends the caller's text to the external system and receives a text response. The external system controls the dialog flow.
Azure Services
VoiceBot uses the same Azure resources as AI Transcription. If you have already set up AI Transcription, no additional Azure resources are required.
If you are setting up VoiceBot without AI Transcription, create the following items as described in the AI Transcription — Azure Services section:
| Item | Required for |
|---|---|
| App Registration | Authentication of UCS against Azure APIs |
| Azure OpenAI | Brain — internal LLM backend (only if not using an external system) |
| Azure AI Services | Ears — real-time Speech-to-Text; optionally Mouth — Text-to-Speech |
| Budget | Cost control for Azure services |
Third-Party Services
In addition to Azure, the Mouth step can use a third-party Text-to-Speech provider:
- ElevenLabs — offers high-quality, natural-sounding voices. Requires a separate ElevenLabs account and API key.
The choice of TTS provider is configured during activation.
Activating the Feature
Once all required services are provisioned, send the following information to INSOFT for activation:
Azure credentials (if not already provided for AI Transcription):
| Parameter | Source |
|---|---|
| Tenant ID | App Registration → Overview → Directory (tenant) ID |
| Client ID | App Registration → Overview → Application (client) ID |
| Client Secret | App Registration → Certificates & secrets → Value |
| Azure OpenAI Endpoint | Azure OpenAI → Overview → Endpoint |
| OpenAI Deployment Name | Azure OpenAI → Azure AI Foundry → Deployments |
| OpenAI API Version | Azure OpenAI → Azure AI Foundry → Deployments → Deployment details |
| AI Services Endpoint | Azure AI Services → Overview → Endpoint |
VoiceBot-specific configuration:
| Parameter | Description |
|---|---|
| Brain backend | internal (LLM) or external (integrated system) |
| TTS provider | azure or elevenlabs |
| ElevenLabs API Key | Only if using ElevenLabs as the TTS provider |
| External system details | Only if using an external brain backend — connection details provided by the integrator |