Building a Voice-Driven AI Agent with ServiceNow and ElevenLabs: A Fun PoC Journey

6 min readDec 20, 2024

Have you ever dreamt of having an AI voice assistant to handle your service requests, like creating tickets in ServiceNow while chatting back with you in a soothing voice? Well, I did, and guess what? I made it happen! Let me take you on a journey where code meets creativity, and humor keeps the bugs away (mostly).

The Idea

The goal was simple: create an AI-powered voice agent that:

1. Listens to your commands.
2. Understands your intent (well, most of the time).
3. Creates ServiceNow tickets.
4. Responds with a lovely voice, saying, “Your wish is my command!”

Sounds cool, right? Let’s dive into the how!

Step 1: Setting Up the Tech Stack

For this PoC, I used:

- Node.js for the backend.
- Express.js to handle API requests.
- OpenAI GPT-4 to process user input and extract ticket details.
- ElevenLabs for Text-to-Speech (TTS).
- HTML/CSS/JavaScript for a sleek front-end.
- Google’s Speech-to-Text API (or a browser’s Web Speech API) for transcription.
- A generous helping of coffee ☕.

Step 2: The Middleware — A Magical Bridge

The middleware connects the front-end with OpenAI, ServiceNow, and ElevenLabs. It processes user input, extracts intent, and makes the magic happen. Here’s the code for our middleware:

app.js

import express from 'express';
import bodyParser from 'body-parser';
import { createServiceNowTicket } from './serviceNowMiddleware.js';
import OpenAI from "openai";
import dotenv from 'dotenv';
import cors from "cors";
import axios from "axios";

dotenv.config();

process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';

const app = express();
const openai = new OpenAI({
    organization: process.env.ORG_ID,
    apiKey: process.env.OPENAI_API_KEY,
    project: process.env.PROJECT_ID
});

app.use(cors())

app.use(bodyParser.json());

// Process Text from Front-End
app.post('/process_text', async (req, res) => {
    try {
        const { text } = req.body;

        // Step 1: Use OpenAI GPT-4 for intent analysis
        const nlpResponse = await openai.chat.completions.create({
            model: 'gpt-4',
            messages: [
                { role: 'system', content: 'You are a virtual assistant integrated with ServiceNow. Your purpose is to help users create incident tickets in ServiceNow. When a user describes an issue, extract the relevant details (like description and priority) and return a structured JSON object. Only respond with the JSON object containing: - "description": A short description of the issue. - "priority": A number from 1 to 5 indicating the priority (1 = high, 5 = low). Do not provide any other text or explanations.' },
                { role: 'user', content: text },
            ],
        });

        const gptOutput = nlpResponse.choices[0].message.content;
        console.log('GPT-4 Response:', gptOutput);

        let description, priority;

        // Step 2: Parse GPT-4 output or handle non-JSON response
        try {
            const parsedOutput = JSON.parse(gptOutput);
            description = parsedOutput.description;
            priority = parsedOutput.priority;

            if (!description || !priority) {
                throw new Error('Missing required fields in JSON.');
            }
        } catch (error) {
            console.warn('Failed to parse GPT output:', error.message);

            // Fallback: Use user input as a placeholder for description
            description = text;
            priority = null; // Indicate missing priority
        }

        // Step 3: Handle missing priority or description
        if (!priority) {
            const followUpText = `I noticed you didn’t specify a priority for your ticket. Can you provide one? (e.g., High, Medium, Low)`;
            return generateTTSResponse(res, followUpText); // Ask the user for more details
        }

        // Step 4: Create a ServiceNow ticket
        const ticketNumber = await createServiceNowTicket(description, priority);

        // Step 5: Generate success response
        const responseText = `Your ticket has been created successfully. The ticket number is ${ticketNumber}.`;
        return generateTTSResponse(res, responseText);

    } catch (error) {
        console.error('Error processing text:', error.message);
        // Graceful fallback for unexpected errors
        const fallbackResponseText = 'I encountered an issue processing your request. Please try again later.';
        return generateTTSResponse(res, fallbackResponseText);
    }
});

const generateTTSResponse = async (res, text) => {
    try {
        // ElevenLabs TTS API Request
        const response = await axios.post(
            `https://api.elevenlabs.io/v1/text-to-speech/<your-voice-id>`,
            {
                text,
                model_id: "eleven_monolingual_v1", // Use correct model ID
                voice_settings: {
                    stability: 0.5,
                    similarity_boost: 0.75,
                },
            },
            {
                headers: {
                    'Content-Type': 'application/json',
                    'xi-api-key': process.env.ELEVENLABS_API_KEY,
                },
                responseType: 'arraybuffer',
            }
        );
        
        // Convert audio to Base64
        const audioBase64 = Buffer.from(response.data).toString('base64');

        // Send both text and audio (as Base64)
        res.json({
            status: 'success',
            text: text,
            audio: `data:audio/mpeg;base64,${audioBase64}`, // Data URI for audio
        });
    } catch (error) {
        console.error('Error generating TTS response:', error.message);
        res.status(500).send('Failed to generate audio response.');
    }
};


// Serve Audio Files
app.use(express.static('responses'));

// Start the Server
const PORT = process.env.PORT || 5000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));

You can choose your voice ID from here.

serviceNowMiddleware.js

import axios from 'axios';
import dotenv from 'dotenv';

dotenv.config();

const { SERVICENOW_URL, SERVICENOW_USER, SERVICENOW_PASS } = process.env;

export const createServiceNowTicket = async (description, priority) => {
    try {
        const response = await axios.post(
            SERVICENOW_URL,
            {
                short_description: description,
                priority: priority || '3',
            },
            {
                auth: {
                    username: SERVICENOW_USER,
                    password: SERVICENOW_PASS,
                },
                headers: {
                    'Content-Type': 'application/json',
                },
            }
        );
        console.log(response.data)
        return response.data.result.number;
    } catch (error) {
        console.error('ServiceNow API Error:', error.message);
        throw new Error('Failed to create ServiceNow ticket');
    }
};

Step 3: The Front-End

Here’s the HTML/CSS/JavaScript for our voice agent:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>AI Voice Agent</title>
        <style>
        /* Overall Styling */
        body {
            margin: 0;
            padding: 0;
            height: 100vh;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            background: linear-gradient(to bottom, lightgreen, white, lightgreen);
            font-family: Arial, sans-serif;
        }

        /* Title Styling */
        h1 {
            position: absolute;
            top: 20px;
            font-size: 2.5rem;
            text-align: center;
            color: #333;
        }

        /* Button Styling */
        #toggleButton {
            width: 120px;
            height: 120px;
            border: none;
            border-radius: 50%;
            background: #4caf50;
            color: white;
            font-size: 1.2rem;
            cursor: pointer;
            box-shadow: 0 0 20px rgba(0, 0, 0, 0.2);
            transition: all 0.3s ease-in-out;
        }

        /* Blinking Glow Effect in Recording State */
        #toggleButton.recording {
            box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
            animation: blink 1s infinite;
        }

        @keyframes blink {
            0%, 100% {
                box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
            }
            50% {
                box-shadow: 0 0 10px 3px rgba(0, 255, 0, 0.3), 0 0 20px 5px rgba(0, 255, 0, 0.2);
            }
        }

        /* Output Text Styling */
        #output {
            margin-top: 20px;
            font-size: 1.2rem;
            color: #555;
            text-align: center;
        }

        /* Response Text Styling */
        #response {
            margin-top: 10px;
            font-size: 1.2rem;
            color: #333;
            text-align: center;
        }

        #botanalysing {
            opacity: 0;
        }
        </style>
        <script>
        let isRecording = false;
        let recognition;

        function toggleRecording() {
            if (!isRecording) {
                startRecording();
            } else {
                stopRecording();
            }
        }

        function startRecording() {
            recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
            recognition.lang = 'en-US';
            recognition.start();

            recognition.onresult = async (event) => {
                document.getElementById('botanalysing').style.opacity = 1;

                const speechText = event.results[0][0].transcript;
                document.getElementById('output').innerText = `You said: ${speechText}`;

                const response = await fetch('http://localhost:5000/process_text', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ text: speechText })
                });

                if (response.ok) {
                    document.getElementById('botanalysing').style.opacity = 0;

                    const responseData = await response.json();

                    // Display the AI's response text
                    document.getElementById('response').innerText = `AI Response: ${responseData.text}`;

                    // Play the audio response
                    const audio = new Audio(responseData.audio);
                    audio.play();
                } else {
                    console.error('Error generating speech');
                }
            };

            recognition.onerror = (event) => {
                document.getElementById('botanalysing').style.opacity = 0;

                console.error('Speech recognition error:', event.error);
                stopRecording();
            };

            isRecording = true;
            const button = document.getElementById('toggleButton');
            button.innerText = 'Stop';
            button.classList.add('recording');
        }

        function stopRecording() {
            if (recognition) recognition.stop();
            isRecording = false;
            const button = document.getElementById('toggleButton');
            button.innerText = 'Start';
            button.classList.remove('recording');
        }
        </script>
    </head>
    <body>
        <h1>AI Voice Agent</h1>
        <button id="toggleButton" onclick="toggleRecording()">
            Start
        </button>
        <p id="output"></p>
        <p id="response"></p>
        <p id="botanalysing">Bot is analyzing your query. Please wait...</p>
    </body>
</html>

Step 4: Testing & Debugging

- Frontend: Open the HTML in a browser, click “Start,” and speak to your agent.
- Backend: Use tools like Postman to test API endpoints.
- Logs: Keep an eye on the console for debugging errors.

Step 5: Lessons Learned

1. Speech Recognition is Fun: Watching your app transcribe speech is incredibly satisfying.
2. JSON Parsing Woes: GPT sometimes likes to play rogue — always handle invalid JSON.
3. TTS is Addictive: Hearing your AI respond in a natural voice feels like magic.

Conclusion

This PoC showcases the power of modern APIs to create interactive, voice-driven applications. Whether for customer support or productivity tools, the possibilities are endless. So, what are you waiting for? Let your voice be heard — literally!

Try this out and let me know how it goes! Or better, build something even cooler and share it with the world.