Voice Control for Cline: VS Code + ElevenLabs MCP

Anis MarrouchiAI Bot
By Anis Marrouchi & AI Bot ·

Loading the Text to Speech Audio Player...

Introduction

Cline is a powerful AI coding agent within VS Code. While text input is standard, wouldn't it be convenient to issue commands using your voice? This tutorial demonstrates how we added voice command capabilities to Cline by creating a dedicated VS Code extension (cline-voice-assistant) that leverages an ElevenLabs MCP (Model Context Protocol) server for accurate speech-to-text (STT) transcription.

What this solution provides:

  • Hands-free Interaction: Trigger voice recording via a command or keybinding.
  • Accurate Transcription: Utilizes ElevenLabs' STT API via a local MCP server.
  • Seamless Integration: Sends the transcribed text directly to the main Cline extension for processing.

How it Works:

  1. A user triggers the Cline: Start Voice Command in VS Code (provided by cline-voice-assistant).
  2. The extension uses the sox command-line tool to record audio from the default microphone, saving it to a temporary file.
  3. The extension connects to a locally running elevenlabs-mcp-server using the MCP SDK.
  4. It calls the elevenlabs_stt tool on the MCP server, passing the path to the recorded audio file.
  5. The MCP server sends the audio to the ElevenLabs API and returns the transcription.
  6. The cline-voice-assistant extension retrieves the API exported by the main Cline extension (saoudrizwan.claude-dev).
  7. It uses the sendMessage method from the Cline API to send the transcribed text to the main Cline chat interface.
  8. Cline processes the text as if it were typed, and the response appears in the chat window.

This tutorial focuses on voice input. The response from Cline will still be text-based in the chat window. Adding voice output (Text-to-Speech) for Cline's responses would require further modifications, potentially to the main Cline extension itself.

Step-by-Step Guide

Let's walk through the key steps involved in creating this voice assistant setup.

Prerequisites

  • Cline Extension: The main Cline VS Code extension (saoudrizwan.claude-dev) must be installed.
  • Node.js & npm: Required for running MCP servers and building extensions.
  • sox: A command-line audio utility. Install it (e.g., on macOS: brew install sox).
  • ElevenLabs Account & API Key: Sign up at ElevenLabs and get an API key.
  • Saiku Project: This tutorial assumes you are working within the Saiku project structure (/Users/macbookpro/Developer/saiku in this example).

Create ElevenLabs MCP Server

We need a server to handle STT requests using the ElevenLabs API.

  1. Create Server Project:
    cd /path/to/your/mcp/servers # e.g., /Users/macbookpro/Documents/Cline/MCP
    npx @modelcontextprotocol/create-server elevenlabs-mcp-server
    cd elevenlabs-mcp-server
    npm install elevenlabs form-data node-fetch@2 @types/node-fetch@2
  2. Implement Server (src/index.ts): Create src/index.ts with Node.js code to:
    • Import necessary modules (@modelcontextprotocol/sdk, elevenlabs, fs, path, os, child_process, form-data, node-fetch).
    • Read the ELEVENLABS_API_KEY from environment variables.
    • Define MCP tools: elevenlabs_stt, elevenlabs_tts, elevenlabs_tts_and_play.
    • Implement the handleSttRequest function:
      • Takes a filePath argument.
      • Reads the audio file.
      • Creates FormData with the file and model_id: 'scribe_v1'.
      • Makes a POST request to https://api.elevenlabs.io/v1/speech-to-text with the API key and form data.
      • Parses the JSON response and returns the transcribed text.
    • Implement handlers for TTS tools (using elevenlabs SDK or fetch).
    • Start the server using StdioServerTransport. (Refer to the development log for the full server code we created earlier)
  3. Build Server:
    npm run build --prefix /path/to/your/mcp/servers/elevenlabs-mcp-server
  4. Configure in Cline Settings: Add the server to your cline_mcp_settings.json (usually in .../User/globalStorage/saoudrizwan.claude-dev/settings/):
    {
      "mcpServers": {
        // ... other servers
        "elevenlabs-mcp-server": {
          "command": "node",
          "args": ["/full/path/to/elevenlabs-mcp-server/build/index.js"],
          "env": {
            "ELEVENLABS_API_KEY": "YOUR_ELEVENLABS_API_KEY"
          },
          "disabled": false,
          "autoApprove": []
        }
      }
    }
    Replace the path and API key accordingly. Cline should automatically start this server.

Create Voice Assistant VS Code Extension

This extension handles recording and orchestrates the STT process.

  1. Scaffold Extension (Manual): Since yo code had issues, we created the structure manually inside /Users/macbookpro/Developer/saiku/extensions/:
    • Create directory cline-voice-assistant.
    • Create package.json (define name, command cline-voice-assistant.startVoiceCommand, main entry ./out/extension.js, dependencies like @modelcontextprotocol/sdk, devDependencies like @types/vscode, typescript).
    • Create tsconfig.json (configure tsc to compile to out/ directory, module commonjs).
    • Create src/ directory.
    • Copy cline.d.ts from extensions/cline/src/exports/ to extensions/cline-voice-assistant/src/.
    • Create basic README.md and .gitignore. (Refer to the development log for the exact file contents)
  2. Install Dependencies:
    npm install --prefix extensions/cline-voice-assistant
  3. Implement Extension Logic (src/extension.ts): Create src/extension.ts with code to:
    • Import vscode, path, os, fs, exec, MCP Client, StdioClientTransport, and the copied ClineAPI type.
    • Define constants for the temporary audio file path and the main Cline extension ID.
    • Implement the activate function:
      • Get the main Cline extension's exported API (vscode.extensions.getExtension(...).activate()). Handle errors if not found or API is invalid.
      • Read the shared cline_mcp_settings.json to find the elevenlabs-mcp-server configuration (command, args, env). Handle errors if not found.
      • Instantiate the MCP Client.
      • Instantiate StdioClientTransport using the configuration read from settings (merge process.env with server-specific env, add --stdio to args).
      • Connect the client to the transport. Handle connection errors.
      • Register the cline-voice-assistant.startVoiceCommand:
        • Check if MCP client and Cline API are available.
        • Execute sox -d /tmp/cline_command.wav silence 1 0.1 1% 1 1.0 1% to record audio. Handle errors.
        • Call mcpClient.callTool({ name: 'elevenlabs_stt', arguments: { file_path: ... } }). Handle errors and parse the result.
        • Call clineApi.sendMessage(transcribedText). Handle errors.
      • Register the command disposable and the transport close logic for deactivation.
    • Export the activate and deactivate functions. (Refer to the development log for the full extension code we created earlier)
  4. Compile Extension:
    npm run compile --prefix extensions/cline-voice-assistant

Package and Install

  1. Install vsce:
    npm install -g vsce
  2. Package:
    cd extensions/cline-voice-assistant
    vsce package
    (You might need to confirm 'y' if it warns about missing LICENSE/README). This creates cline-voice-assistant-0.0.1.vsix.
  3. Install:
    code --install-extension extensions/cline-voice-assistant/cline-voice-assistant-0.0.1.vsix --force
    Alternatively, install manually via the Extensions view (... > "Install from VSIX...").
  4. Restart VS Code: Restart VS Code completely.

Usage

  1. Ensure sox is installed and the elevenlabs-mcp-server is running.
  2. Open the Command Palette (Cmd+Shift+P or Ctrl+Shift+P).
  3. Run Cline: Start Voice Command.
  4. Speak your command.
  5. The transcribed text appears in the Cline chat window, followed by Cline's text response. Check Developer Tools (Help > Toggle Developer Tools > Console) for logs or errors.

Conclusion

By creating a dedicated VS Code extension and leveraging an ElevenLabs MCP server, we've successfully enabled voice command input for Cline. This setup uses sox for recording, the MCP server for ElevenLabs STT, and the main Cline extension's API to process the transcribed text. While the response remains text-based, this provides a significant convenience for hands-free interaction.

Future Possibilities

This setup provides a solid foundation for voice input. Here are some potential next steps:

  • Voice Output: Modify the main Cline extension (extensions/cline/) to check if input came via voice and, if so, use the elevenlabs_tts_and_play MCP tool to speak the response instead of just displaying text. This requires understanding and modifying the Cline extension's core logic.
  • Alternative STT: Replace the ElevenLabs MCP server with one using a different STT service (like Whisper, either local via whisper.cpp or the OpenAI API).
  • Integrated Recording: Replace the sox dependency by implementing recording directly within the VS Code extension using Webview APIs (MediaRecorder), making the setup more self-contained.
  • UI Button: Add a microphone button to the Cline UI instead of relying on the Command Palette.

If you're interested in enhancing Cline's capabilities, consider:

  • Forking the Project: Explore the Saiku codebase (https://github.com/nooqta/saiku) and experiment with your own modifications.
  • Contributing: If you develop improvements, consider contributing back to the main project following their contribution guidelines.

Happy voice coding!


Want to read more tutorials? Check out our latest tutorial on How to Generate Sound Effects Using ElevenLabs API in JavaScript.

Discuss Your Project with Us

We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.

Let's find the best solutions for your needs.