Voice Chat

Voice chat is the always-on, telephone-style conversation mode users get by swiping right from text chat. The audio stream is bidirectional and continuous — the user just talks and the AI talks back — and the conversation is grounded in the same documents and project configuration the text chat uses.

Voice today is purpose-built for natural Q&A against your project’s knowledge base. Tool Registry, navigation suggestions, and runtime context personalization are partially supported and improving rapidly — see Current Capabilities for what works today.

How Users Access It

Voice mode is reached by swiping right on the chat page. The chat container is a horizontally-paged scroll view: the left page is the text chat you already know; the right page is the voice page.

The first swipe to the voice page opens the WebSocket session and starts the mic. Swiping back to text pauses the mic but keeps the session warm — re-entering within ~10 seconds reuses the same connection. After ~10 seconds idle, the session closes; the next swipe opens a fresh one.

There is no “tap-to-talk” or push-to-talk button. The mic is on the entire time the user is on the voice page; the AI uses server-side voice activity detection to know when the user has finished speaking.

Enabling Voice

Voice is gated at three layers, all of which must allow it for the mic to appear:

Layer	Where	What it does
Subscription plan	Set by Qafka	The project’s plan must include voice. If the plan disables it, the WebSocket connection is rejected with `4003 VOICE_NOT_ALLOWED` regardless of any other setting.
Project override	Dashboard › Chat Theme	Per-project toggle that respects the plan ceiling. Use it to turn voice off for a specific project on a voice-enabled plan.
SDK prop	`<Qafka voiceEnabled={false} />`	Per-app override. Default `true`. Use it when a project is voice-eligible but a particular build (e.g. accessibility-restricted variant) shouldn’t expose it.

If any layer disables voice, the SDK hides the voice page entirely — no swipe affordance, no mic icon, no surprise tap.

Voice State Machine

The voice page has five states the SDK exposes and the voice components render:

State	Meaning
`idle`	No active connection. The voice page is visible but the mic isn’t capturing.
`connecting`	WebSocket handshake + audio engine warmup. Brief.
`listening`	Mic is open, capturing user audio, waiting for VAD to detect end of speech.
`thinking`	User has finished speaking; the AI is processing (possibly invoking a tool). No audio plays.
`speaking`	The AI is speaking back. Audio is rendering through the device speaker; transcript updates token-by-token.

Customize how each state looks by passing voiceComponents (see the widget reference). The components receive the live state, amplitude (0–1, useful for animation speed), and theme.

Barge-in

The user can interrupt the AI mid-sentence. As soon as the SDK detects user speech, the in-flight audio response is cancelled and the AI starts a new turn from the new question. Useful for “no, wait, I meant…” follow-ups; less useful when the AI is reading a long answer the user actually wants — keep replies concise via your voice instructions.

Voice Instructions

Voice mode uses a separate per-project instructions field, edited from the Project Overview. It’s appended to the system prompt only when voice mode is active, and exists because voice prompting needs are different from text:

Pronunciation rules (“BMW iX3 should be read as ‘i-iks-üç’”, “TCKN as ‘t-c-k-n’”)
Length limits (“Keep answers to 2–3 sentences”)
Tone (“Friendly but professional, use ‘siz’”)
Language enforcement (“Always reply in Turkish, even when the user code-switches”)

The text-chat critical instructions still apply on top of voice instructions — if you’ve already written “always reply in Turkish” there, you don’t need to repeat it. Voice instructions are for things only relevant to spoken interaction.

Customizing the Voice Page

The voice page is built from three replaceable slots. Provide any subset via the voiceComponents prop on <Qafka />; each missing slot falls back to the SDK’s default.

Slot	What it renders	Common use cases
`VoiceIndicator`	The animated visual that signals what the voice session is doing right now (idle pulse, listening waveform, speaking blob). Defaults to a built-in animated indicator.	Replace with your brand’s Lottie animation, an audio-reactive blob driven by `amplitude`, or static iconography.
`VoiceBackground`	The container behind the indicator + transcript. Defaults to a solid theme-colored view.	Replace with a gradient, animated background, video, or anything that responds to `state` (e.g. fade in when speaking starts).
`VoiceTranscript`	The text area that streams the user’s question and the AI’s reply. Defaults to a centered theme-typed text block.	Replace to control typography, layout (above vs below the indicator), or to render something completely different — captions, a chat-style alternation, or nothing at all.

What Each Slot Receives

All three slots receive the live voice state, the current audio amplitude (0–1), and the active theme. VoiceTranscript additionally receives transcript (the AI’s reply, streaming) and userTranscript (the user’s last spoken question).


import LottieView from 'lottie-react-native'
import { LinearGradient } from 'expo-linear-gradient'
 
<Qafka
  voiceComponents={{
    VoiceIndicator: ({ state, amplitude, theme }) => (
      <LottieView
        source={require('./voice-blob.json')}
        autoPlay
        loop
        // Speed up animation when actively speaking
        speed={state === 'speaking' ? 1.5 + amplitude : 1}
      />
    ),
    VoiceBackground: ({ state, children }) => (
      <LinearGradient
        colors={state === 'listening' ? ['#1e293b', '#0f172a'] : ['#0f172a', '#020617']}
        style={{ flex: 1 }}
      >
        {children}
      </LinearGradient>
    ),
    VoiceTranscript: ({ transcript, userTranscript, state, theme }) => (
      <Text style={{ color: theme.colors.text, fontSize: 18, padding: 24 }}>
        {state === 'listening' || state === 'thinking'
          ? userTranscript
          : transcript}
      </Text>
    ),
  }}
/>

Sizing Gotcha — Always Set an Explicit Height

The voice slot containers are intentionally flex: 0 (intrinsic-sized) so they don’t fight the rest of your layout. If your custom component relies on flex: 1 to fill space, it will collapse to zero height because there’s no parent flex hint to expand into. Always give your slot component an explicit height, minHeight, or fixed dimensions:


// ❌ Collapses to nothing — no parent to flex into
VoiceIndicator: () => <View style={{ flex: 1 }}><LottieView ... /></View>
 
// ✅ Renders at the size you intended
VoiceIndicator: () => <View style={{ height: 220 }}><LottieView ... /></View>

This applies to all three slots, but it’s most commonly hit on VoiceIndicator (Lottie/animation views often need a fixed canvas).

When to Customize vs Use Defaults

The defaults are deliberately neutral so they pass for most apps without intervention. Reach for voiceComponents when:

Your app has a strong brand identity that the default visual breaks (most common)
You want voice mode to feel like a “moment” rather than “a chat with audio” (animated background, full-bleed indicator, narrative typography)
You need accessibility tweaks the default doesn’t expose (high-contrast transcripts, larger text, custom motion preferences)

For everyday tone/color tuning, edit the chat theme instead — it’s automatically applied to the default voice components without writing a single line of code.

Conversation Persistence

A voice conversation is the same database object as a text conversation — same conversationId, same place in the Conversations list, same retention. Voice messages carry a transport: 'realtime' flag so they’re distinguishable from text messages within a session.

The user can swipe back and forth between text and voice within a single session and the AI keeps the conversation context — both transports write to and read from the same message history.

Current Capabilities

What voice supports today:

Spoken Q&A grounded in your project’s documents (RAG-driven via a built-in search_knowledge_base tool the AI invokes when it needs facts)
Continuous mic + server-side VAD + barge-in
Transcript stream alongside audio (renders in your VoiceTranscript component)
Per-project voice instructions
Voice-only access control via plan / project / SDK prop

What’s partial or in progress — voice today is more limited than text chat, and these gaps are actively being closed:

Tool Registry — Only the built-in RAG tool runs in voice. Custom tools defined in Dashboard › Tools work in text chat but are not yet routed to voice sessions. Tool definitions, custom/server/custom-with-ai execution modes, and the onToolSuggested callback in voice are part of the next voice phase.
Navigation suggestions — Voice can’t suggest screen navigation today. Text chat is the channel for navigation-driven flows.
Runtime context personalization — The SDK forwards context to the voice session, but the backend doesn’t yet inject userContext keys (user name, current screen, active campaign) into the voice system prompt. Greetings and replies are not yet personalized in voice the way they are in text.
External suggestions (WhatsApp, phone, app store) — Same as navigation: text-only for now.

For these features, fall back to text chat. The voiceComponents and toolRenderMode SDK surface is in place ahead of full backend support so partner apps don’t have to change their integration when these features ship.

Imperative Control

For programmatic control outside the swipe UX (e.g. a custom “talk to assistant” floating button elsewhere in your app), drive voice via the widget ref:


import { useRef } from 'react'
import { Qafka, type QafkaHandle } from '@qafka/react-native'
 
const qafkaRef = useRef<QafkaHandle>(null)
 
<Qafka ref={qafkaRef} />
 
// Open / close the voice session
await qafkaRef.current?.connectVoice()
await qafkaRef.current?.disconnectVoice()
 
// Mute / unmute the mic without ending the session — the AI keeps speaking
await qafkaRef.current?.pauseMic()
await qafkaRef.current?.resumeMic()

pauseMic is useful for push-to-talk overlays: capture the press, resumeMic(); release, pauseMic(). The session stays open, so it’s faster than connecting from scratch.

Connection Lifecycle

Event	What happens
User swipes to voice page	Open WebSocket if none is open, then start the audio pipeline, then start the mic
User swipes back to text page	Pause the mic; keep the WebSocket open for ~10s in case they swipe back
10s idle on text page	Close the WebSocket; next swipe to voice opens a fresh session
60s of silence in voice mode	Close the WebSocket; show a toast/notice in the voice page
App goes to background	Mic and WebSocket close immediately
App returns to foreground on the voice page	Open a new WebSocket and start the mic
Network drop	Close cleanly and surface the error to the voice page

The audio pipeline starts before the WebSocket connection deliberately — Gemini Live can stream the first audio bytes faster than the OS audio engine warms up, and starting it after the WS leads to dropped initial chunks.

Server-Side Components

The voice session is a Qafka-managed WebSocket between the SDK and our backend. The backend then proxies to Google’s Gemini Live API directly (not through LiteLLM, because LiteLLM doesn’t yet support Gemini Live’s native function-calling protocol). This means:

Your API key stays server-side — clients never reach Gemini directly
Rate limits, attestation, and budget caps run in the same backend code path as text chat
RAG retrieval, conversation persistence, and system prompt assembly all happen on the backend

What’s Out of Scope

Voice is intentionally narrower than text in a few places, by design:

No file upload in voice — file inputs are a text-mode interaction
No camera / video input — audio only
No multi-model routing — voice is Gemini Live only
No audio recording / replay — only transcripts persist; raw audio is not stored