Skip to Content
    QAFKA
    CTRL K
    CTRL K
    • Introduction
      • Quick Start
      • Configuration
      • React Native Widget
      • Theming
      • Context
      • Navigation
      • External Navigation
      • Handling Tools
      • Voice Chat
      • Sub-Projects
      • Error Handling
      • CLI
      • Dashboard
      • Invitations
      • Settings
        • Project
        • Overview
        • Conversations
        • Chat Test
        • Sub-Projects
        • Analysis
        • Configuration
        • Members
        • Documents
        • Tools
        • Action Logs
        • Navigation Rules
        • External Destinations
        • Chat Theme
        • API Keys
      • API Key Security
    • Introduction
      • Quick Start
      • Configuration
      • React Native Widget
      • Theming
      • Context
      • Navigation
      • External Navigation
      • Handling Tools
      • Voice Chat
      • Sub-Projects
      • Error Handling
      • CLI
      • Dashboard
      • Invitations
      • Settings
        • Project
        • Overview
        • Conversations
        • Chat Test
        • Sub-Projects
        • Analysis
        • Configuration
        • Members
        • Documents
        • Tools
        • Action Logs
        • Navigation Rules
        • External Destinations
        • Chat Theme
        • API Keys
      • API Key Security

    On This Page

    • How Users Access It
    • Enabling Voice
    • Voice State Machine
    • Barge-in
    • Voice Instructions
    • Customizing the Voice Page
    • What Each Slot Receives
    • Sizing Gotcha — Always Set an Explicit Height
    • When to Customize vs Use Defaults
    • Conversation Persistence
    • Current Capabilities
    • Imperative Control
    • Connection Lifecycle
    • Server-Side Components
    • What’s Out of Scope
    Question? Give us feedback Edit this page 
    GuidesVoice Chat

    Voice Chat

    Voice chat is the always-on, telephone-style conversation mode users get by swiping right from text chat. The audio stream is bidirectional and continuous — the user just talks and the AI talks back — and the conversation is grounded in the same documents and project configuration the text chat uses.

    Voice today is purpose-built for natural Q&A against your project’s knowledge base. Tool Registry, navigation suggestions, and runtime context personalization are partially supported and improving rapidly — see Current Capabilities for what works today.

    How Users Access It

    Voice mode is reached by swiping right on the chat page. The chat container is a horizontally-paged scroll view: the left page is the text chat you already know; the right page is the voice page.

    The first swipe to the voice page opens the WebSocket session and starts the mic. Swiping back to text pauses the mic but keeps the session warm — re-entering within ~10 seconds reuses the same connection. After ~10 seconds idle, the session closes; the next swipe opens a fresh one.

    There is no “tap-to-talk” or push-to-talk button. The mic is on the entire time the user is on the voice page; the AI uses server-side voice activity detection to know when the user has finished speaking.

    Enabling Voice

    Voice is gated at three layers, all of which must allow it for the mic to appear:

    LayerWhereWhat it does
    Subscription planSet by QafkaThe project’s plan must include voice. If the plan disables it, the WebSocket connection is rejected with 4003 VOICE_NOT_ALLOWED regardless of any other setting.
    Project overrideDashboard › Chat ThemePer-project toggle that respects the plan ceiling. Use it to turn voice off for a specific project on a voice-enabled plan.
    SDK prop<Qafka voiceEnabled={false} />Per-app override. Default true. Use it when a project is voice-eligible but a particular build (e.g. accessibility-restricted variant) shouldn’t expose it.

    If any layer disables voice, the SDK hides the voice page entirely — no swipe affordance, no mic icon, no surprise tap.

    Voice State Machine

    The voice page has five states the SDK exposes and the voice components render:

    StateMeaning
    idleNo active connection. The voice page is visible but the mic isn’t capturing.
    connectingWebSocket handshake + audio engine warmup. Brief.
    listeningMic is open, capturing user audio, waiting for VAD to detect end of speech.
    thinkingUser has finished speaking; the AI is processing (possibly invoking a tool). No audio plays.
    speakingThe AI is speaking back. Audio is rendering through the device speaker; transcript updates token-by-token.

    Customize how each state looks by passing voiceComponents (see the widget reference). The components receive the live state, amplitude (0–1, useful for animation speed), and theme.

    Barge-in

    The user can interrupt the AI mid-sentence. As soon as the SDK detects user speech, the in-flight audio response is cancelled and the AI starts a new turn from the new question. Useful for “no, wait, I meant…” follow-ups; less useful when the AI is reading a long answer the user actually wants — keep replies concise via your voice instructions.

    Voice Instructions

    Voice mode uses a separate per-project instructions field, edited from the Project Overview. It’s appended to the system prompt only when voice mode is active, and exists because voice prompting needs are different from text:

    • Pronunciation rules (“BMW iX3 should be read as ‘i-iks-üç’”, “TCKN as ‘t-c-k-n’”)
    • Length limits (“Keep answers to 2–3 sentences”)
    • Tone (“Friendly but professional, use ‘siz’”)
    • Language enforcement (“Always reply in Turkish, even when the user code-switches”)

    The text-chat critical instructions still apply on top of voice instructions — if you’ve already written “always reply in Turkish” there, you don’t need to repeat it. Voice instructions are for things only relevant to spoken interaction.

    Customizing the Voice Page

    The voice page is built from three replaceable slots. Provide any subset via the voiceComponents prop on <Qafka />; each missing slot falls back to the SDK’s default.

    SlotWhat it rendersCommon use cases
    VoiceIndicatorThe animated visual that signals what the voice session is doing right now (idle pulse, listening waveform, speaking blob). Defaults to a built-in animated indicator.Replace with your brand’s Lottie animation, an audio-reactive blob driven by amplitude, or static iconography.
    VoiceBackgroundThe container behind the indicator + transcript. Defaults to a solid theme-colored view.Replace with a gradient, animated background, video, or anything that responds to state (e.g. fade in when speaking starts).
    VoiceTranscriptThe text area that streams the user’s question and the AI’s reply. Defaults to a centered theme-typed text block.Replace to control typography, layout (above vs below the indicator), or to render something completely different — captions, a chat-style alternation, or nothing at all.

    What Each Slot Receives

    All three slots receive the live voice state, the current audio amplitude (0–1), and the active theme. VoiceTranscript additionally receives transcript (the AI’s reply, streaming) and userTranscript (the user’s last spoken question).

    import LottieView from 'lottie-react-native' import { LinearGradient } from 'expo-linear-gradient' <Qafka voiceComponents={{ VoiceIndicator: ({ state, amplitude, theme }) => ( <LottieView source={require('./voice-blob.json')} autoPlay loop // Speed up animation when actively speaking speed={state === 'speaking' ? 1.5 + amplitude : 1} /> ), VoiceBackground: ({ state, children }) => ( <LinearGradient colors={state === 'listening' ? ['#1e293b', '#0f172a'] : ['#0f172a', '#020617']} style={{ flex: 1 }} > {children} </LinearGradient> ), VoiceTranscript: ({ transcript, userTranscript, state, theme }) => ( <Text style={{ color: theme.colors.text, fontSize: 18, padding: 24 }}> {state === 'listening' || state === 'thinking' ? userTranscript : transcript} </Text> ), }} />

    Sizing Gotcha — Always Set an Explicit Height

    The voice slot containers are intentionally flex: 0 (intrinsic-sized) so they don’t fight the rest of your layout. If your custom component relies on flex: 1 to fill space, it will collapse to zero height because there’s no parent flex hint to expand into. Always give your slot component an explicit height, minHeight, or fixed dimensions:

    // ❌ Collapses to nothing — no parent to flex into VoiceIndicator: () => <View style={{ flex: 1 }}><LottieView ... /></View> // ✅ Renders at the size you intended VoiceIndicator: () => <View style={{ height: 220 }}><LottieView ... /></View>

    This applies to all three slots, but it’s most commonly hit on VoiceIndicator (Lottie/animation views often need a fixed canvas).

    When to Customize vs Use Defaults

    The defaults are deliberately neutral so they pass for most apps without intervention. Reach for voiceComponents when:

    • Your app has a strong brand identity that the default visual breaks (most common)
    • You want voice mode to feel like a “moment” rather than “a chat with audio” (animated background, full-bleed indicator, narrative typography)
    • You need accessibility tweaks the default doesn’t expose (high-contrast transcripts, larger text, custom motion preferences)

    For everyday tone/color tuning, edit the chat theme instead — it’s automatically applied to the default voice components without writing a single line of code.

    Conversation Persistence

    A voice conversation is the same database object as a text conversation — same conversationId, same place in the Conversations list, same retention. Voice messages carry a transport: 'realtime' flag so they’re distinguishable from text messages within a session.

    The user can swipe back and forth between text and voice within a single session and the AI keeps the conversation context — both transports write to and read from the same message history.

    Current Capabilities

    What voice supports today:

    • Spoken Q&A grounded in your project’s documents (RAG-driven via a built-in search_knowledge_base tool the AI invokes when it needs facts)
    • Continuous mic + server-side VAD + barge-in
    • Transcript stream alongside audio (renders in your VoiceTranscript component)
    • Per-project voice instructions
    • Voice-only access control via plan / project / SDK prop

    What’s partial or in progress — voice today is more limited than text chat, and these gaps are actively being closed:

    • Tool Registry — Only the built-in RAG tool runs in voice. Custom tools defined in Dashboard › Tools work in text chat but are not yet routed to voice sessions. Tool definitions, custom/server/custom-with-ai execution modes, and the onToolSuggested callback in voice are part of the next voice phase.
    • Navigation suggestions — Voice can’t suggest screen navigation today. Text chat is the channel for navigation-driven flows.
    • Runtime context personalization — The SDK forwards context to the voice session, but the backend doesn’t yet inject userContext keys (user name, current screen, active campaign) into the voice system prompt. Greetings and replies are not yet personalized in voice the way they are in text.
    • External suggestions (WhatsApp, phone, app store) — Same as navigation: text-only for now.

    For these features, fall back to text chat. The voiceComponents and toolRenderMode SDK surface is in place ahead of full backend support so partner apps don’t have to change their integration when these features ship.

    Imperative Control

    For programmatic control outside the swipe UX (e.g. a custom “talk to assistant” floating button elsewhere in your app), drive voice via the widget ref:

    import { useRef } from 'react' import { Qafka, type QafkaHandle } from '@qafka/react-native' const qafkaRef = useRef<QafkaHandle>(null) <Qafka ref={qafkaRef} /> // Open / close the voice session await qafkaRef.current?.connectVoice() await qafkaRef.current?.disconnectVoice() // Mute / unmute the mic without ending the session — the AI keeps speaking await qafkaRef.current?.pauseMic() await qafkaRef.current?.resumeMic()

    pauseMic is useful for push-to-talk overlays: capture the press, resumeMic(); release, pauseMic(). The session stays open, so it’s faster than connecting from scratch.

    Connection Lifecycle

    EventWhat happens
    User swipes to voice pageOpen WebSocket if none is open, then start the audio pipeline, then start the mic
    User swipes back to text pagePause the mic; keep the WebSocket open for ~10s in case they swipe back
    10s idle on text pageClose the WebSocket; next swipe to voice opens a fresh session
    60s of silence in voice modeClose the WebSocket; show a toast/notice in the voice page
    App goes to backgroundMic and WebSocket close immediately
    App returns to foreground on the voice pageOpen a new WebSocket and start the mic
    Network dropClose cleanly and surface the error to the voice page

    The audio pipeline starts before the WebSocket connection deliberately — Gemini Live can stream the first audio bytes faster than the OS audio engine warms up, and starting it after the WS leads to dropped initial chunks.

    Server-Side Components

    The voice session is a Qafka-managed WebSocket between the SDK and our backend. The backend then proxies to Google’s Gemini Live API directly (not through LiteLLM, because LiteLLM doesn’t yet support Gemini Live’s native function-calling protocol). This means:

    • Your API key stays server-side — clients never reach Gemini directly
    • Rate limits, attestation, and budget caps run in the same backend code path as text chat
    • RAG retrieval, conversation persistence, and system prompt assembly all happen on the backend

    What’s Out of Scope

    Voice is intentionally narrower than text in a few places, by design:

    • No file upload in voice — file inputs are a text-mode interaction
    • No camera / video input — audio only
    • No multi-model routing — voice is Gemini Live only
    • No audio recording / replay — only transcripts persist; raw audio is not stored
    Last updated on June 3, 2026
    Handling ToolsSub-Projects

    MIT 2026 QAFKA