Voice Chat
Voice chat is the always-on, telephone-style conversation mode users get by swiping right from text chat. The audio stream is bidirectional and continuous — the user just talks and the AI talks back — and the conversation is grounded in the same documents and project configuration the text chat uses.
Voice today is purpose-built for natural Q&A against your project’s knowledge base. Tool Registry, navigation suggestions, and runtime context personalization are partially supported and improving rapidly — see Current Capabilities for what works today.
How Users Access It
Voice mode is reached by swiping right on the chat page. The chat container is a horizontally-paged scroll view: the left page is the text chat you already know; the right page is the voice page.
The first swipe to the voice page opens the WebSocket session and starts the mic. Swiping back to text pauses the mic but keeps the session warm — re-entering within ~10 seconds reuses the same connection. After ~10 seconds idle, the session closes; the next swipe opens a fresh one.
There is no “tap-to-talk” or push-to-talk button. The mic is on the entire time the user is on the voice page; the AI uses server-side voice activity detection to know when the user has finished speaking.
Enabling Voice
Voice is gated at three layers, all of which must allow it for the mic to appear:
| Layer | Where | What it does |
|---|---|---|
| Subscription plan | Set by Qafka | The project’s plan must include voice. If the plan disables it, the WebSocket connection is rejected with 4003 VOICE_NOT_ALLOWED regardless of any other setting. |
| Project override | Dashboard › Chat Theme | Per-project toggle that respects the plan ceiling. Use it to turn voice off for a specific project on a voice-enabled plan. |
| SDK prop | <Qafka voiceEnabled={false} /> | Per-app override. Default true. Use it when a project is voice-eligible but a particular build (e.g. accessibility-restricted variant) shouldn’t expose it. |
If any layer disables voice, the SDK hides the voice page entirely — no swipe affordance, no mic icon, no surprise tap.
Voice State Machine
The voice page has five states the SDK exposes and the voice components render:
| State | Meaning |
|---|---|
idle | No active connection. The voice page is visible but the mic isn’t capturing. |
connecting | WebSocket handshake + audio engine warmup. Brief. |
listening | Mic is open, capturing user audio, waiting for VAD to detect end of speech. |
thinking | User has finished speaking; the AI is processing (possibly invoking a tool). No audio plays. |
speaking | The AI is speaking back. Audio is rendering through the device speaker; transcript updates token-by-token. |
Customize how each state looks by passing voiceComponents (see the widget reference). The components receive the live state, amplitude (0–1, useful for animation speed), and theme.
Barge-in
The user can interrupt the AI mid-sentence. As soon as the SDK detects user speech, the in-flight audio response is cancelled and the AI starts a new turn from the new question. Useful for “no, wait, I meant…” follow-ups; less useful when the AI is reading a long answer the user actually wants — keep replies concise via your voice instructions.
Voice Instructions
Voice mode uses a separate per-project instructions field, edited from the Project Overview. It’s appended to the system prompt only when voice mode is active, and exists because voice prompting needs are different from text:
- Pronunciation rules (“BMW iX3 should be read as ‘i-iks-üç’”, “TCKN as ‘t-c-k-n’”)
- Length limits (“Keep answers to 2–3 sentences”)
- Tone (“Friendly but professional, use ‘siz’”)
- Language enforcement (“Always reply in Turkish, even when the user code-switches”)
The text-chat critical instructions still apply on top of voice instructions — if you’ve already written “always reply in Turkish” there, you don’t need to repeat it. Voice instructions are for things only relevant to spoken interaction.
Customizing the Voice Page
The voice page is built from three replaceable slots. Provide any subset via the voiceComponents prop on <Qafka />; each missing slot falls back to the SDK’s default.
| Slot | What it renders | Common use cases |
|---|---|---|
VoiceIndicator | The animated visual that signals what the voice session is doing right now (idle pulse, listening waveform, speaking blob). Defaults to a built-in animated indicator. | Replace with your brand’s Lottie animation, an audio-reactive blob driven by amplitude, or static iconography. |
VoiceBackground | The container behind the indicator + transcript. Defaults to a solid theme-colored view. | Replace with a gradient, animated background, video, or anything that responds to state (e.g. fade in when speaking starts). |
VoiceTranscript | The text area that streams the user’s question and the AI’s reply. Defaults to a centered theme-typed text block. | Replace to control typography, layout (above vs below the indicator), or to render something completely different — captions, a chat-style alternation, or nothing at all. |
What Each Slot Receives
All three slots receive the live voice state, the current audio amplitude (0–1), and the active theme. VoiceTranscript additionally receives transcript (the AI’s reply, streaming) and userTranscript (the user’s last spoken question).
import LottieView from 'lottie-react-native'
import { LinearGradient } from 'expo-linear-gradient'
<Qafka
voiceComponents={{
VoiceIndicator: ({ state, amplitude, theme }) => (
<LottieView
source={require('./voice-blob.json')}
autoPlay
loop
// Speed up animation when actively speaking
speed={state === 'speaking' ? 1.5 + amplitude : 1}
/>
),
VoiceBackground: ({ state, children }) => (
<LinearGradient
colors={state === 'listening' ? ['#1e293b', '#0f172a'] : ['#0f172a', '#020617']}
style={{ flex: 1 }}
>
{children}
</LinearGradient>
),
VoiceTranscript: ({ transcript, userTranscript, state, theme }) => (
<Text style={{ color: theme.colors.text, fontSize: 18, padding: 24 }}>
{state === 'listening' || state === 'thinking'
? userTranscript
: transcript}
</Text>
),
}}
/>Sizing Gotcha — Always Set an Explicit Height
The voice slot containers are intentionally flex: 0 (intrinsic-sized) so they don’t fight the rest of your layout. If your custom component relies on flex: 1 to fill space, it will collapse to zero height because there’s no parent flex hint to expand into. Always give your slot component an explicit height, minHeight, or fixed dimensions:
// ❌ Collapses to nothing — no parent to flex into
VoiceIndicator: () => <View style={{ flex: 1 }}><LottieView ... /></View>
// ✅ Renders at the size you intended
VoiceIndicator: () => <View style={{ height: 220 }}><LottieView ... /></View>This applies to all three slots, but it’s most commonly hit on VoiceIndicator (Lottie/animation views often need a fixed canvas).
When to Customize vs Use Defaults
The defaults are deliberately neutral so they pass for most apps without intervention. Reach for voiceComponents when:
- Your app has a strong brand identity that the default visual breaks (most common)
- You want voice mode to feel like a “moment” rather than “a chat with audio” (animated background, full-bleed indicator, narrative typography)
- You need accessibility tweaks the default doesn’t expose (high-contrast transcripts, larger text, custom motion preferences)
For everyday tone/color tuning, edit the chat theme instead — it’s automatically applied to the default voice components without writing a single line of code.
Conversation Persistence
A voice conversation is the same database object as a text conversation — same conversationId, same place in the Conversations list, same retention. Voice messages carry a transport: 'realtime' flag so they’re distinguishable from text messages within a session.
The user can swipe back and forth between text and voice within a single session and the AI keeps the conversation context — both transports write to and read from the same message history.
Current Capabilities
What voice supports today:
- Spoken Q&A grounded in your project’s documents (RAG-driven via a built-in
search_knowledge_basetool the AI invokes when it needs facts) - Continuous mic + server-side VAD + barge-in
- Transcript stream alongside audio (renders in your
VoiceTranscriptcomponent) - Per-project voice instructions
- Voice-only access control via plan / project / SDK prop
What’s partial or in progress — voice today is more limited than text chat, and these gaps are actively being closed:
- Tool Registry — Only the built-in RAG tool runs in voice. Custom tools defined in Dashboard › Tools work in text chat but are not yet routed to voice sessions. Tool definitions, custom/server/custom-with-ai execution modes, and the
onToolSuggestedcallback in voice are part of the next voice phase. - Navigation suggestions — Voice can’t suggest screen navigation today. Text chat is the channel for navigation-driven flows.
- Runtime
contextpersonalization — The SDK forwardscontextto the voice session, but the backend doesn’t yet injectuserContextkeys (user name, current screen, active campaign) into the voice system prompt. Greetings and replies are not yet personalized in voice the way they are in text. - External suggestions (WhatsApp, phone, app store) — Same as navigation: text-only for now.
For these features, fall back to text chat. The voiceComponents and toolRenderMode SDK surface is in place ahead of full backend support so partner apps don’t have to change their integration when these features ship.
Imperative Control
For programmatic control outside the swipe UX (e.g. a custom “talk to assistant” floating button elsewhere in your app), drive voice via the widget ref:
import { useRef } from 'react'
import { Qafka, type QafkaHandle } from '@qafka/react-native'
const qafkaRef = useRef<QafkaHandle>(null)
<Qafka ref={qafkaRef} />
// Open / close the voice session
await qafkaRef.current?.connectVoice()
await qafkaRef.current?.disconnectVoice()
// Mute / unmute the mic without ending the session — the AI keeps speaking
await qafkaRef.current?.pauseMic()
await qafkaRef.current?.resumeMic()pauseMic is useful for push-to-talk overlays: capture the press, resumeMic(); release, pauseMic(). The session stays open, so it’s faster than connecting from scratch.
Connection Lifecycle
| Event | What happens |
|---|---|
| User swipes to voice page | Open WebSocket if none is open, then start the audio pipeline, then start the mic |
| User swipes back to text page | Pause the mic; keep the WebSocket open for ~10s in case they swipe back |
| 10s idle on text page | Close the WebSocket; next swipe to voice opens a fresh session |
| 60s of silence in voice mode | Close the WebSocket; show a toast/notice in the voice page |
| App goes to background | Mic and WebSocket close immediately |
| App returns to foreground on the voice page | Open a new WebSocket and start the mic |
| Network drop | Close cleanly and surface the error to the voice page |
The audio pipeline starts before the WebSocket connection deliberately — Gemini Live can stream the first audio bytes faster than the OS audio engine warms up, and starting it after the WS leads to dropped initial chunks.
Server-Side Components
The voice session is a Qafka-managed WebSocket between the SDK and our backend. The backend then proxies to Google’s Gemini Live API directly (not through LiteLLM, because LiteLLM doesn’t yet support Gemini Live’s native function-calling protocol). This means:
- Your API key stays server-side — clients never reach Gemini directly
- Rate limits, attestation, and budget caps run in the same backend code path as text chat
- RAG retrieval, conversation persistence, and system prompt assembly all happen on the backend
What’s Out of Scope
Voice is intentionally narrower than text in a few places, by design:
- No file upload in voice — file inputs are a text-mode interaction
- No camera / video input — audio only
- No multi-model routing — voice is Gemini Live only
- No audio recording / replay — only transcripts persist; raw audio is not stored