Building a Voice-Driven AI Agent with ServiceNow and ElevenLabs: A Fun PoC Journey

Souvik Majumder
6 min readDec 20, 2024

--

Have you ever dreamt of having an AI voice assistant to handle your service requests, like creating tickets in ServiceNow while chatting back with you in a soothing voice? Well, I did, and guess what? I made it happen! Let me take you on a journey where code meets creativity, and humor keeps the bugs away (mostly).

The Idea

The goal was simple: create an AI-powered voice agent that:

1. Listens to your commands.
2. Understands your intent (well, most of the time).
3. Creates ServiceNow tickets.
4. Responds with a lovely voice, saying, “Your wish is my command!”

Sounds cool, right? Let’s dive into the how!

Step 1: Setting Up the Tech Stack

For this PoC, I used:

- Node.js for the backend.
- Express.js to handle API requests.
- OpenAI GPT-4 to process user input and extract ticket details.
- ElevenLabs for Text-to-Speech (TTS).
- HTML/CSS/JavaScript for a sleek front-end.
- Google’s Speech-to-Text API (or a browser’s Web Speech API) for transcription.
- A generous helping of coffee ☕.

Step 2: The Middleware — A Magical Bridge

The middleware connects the front-end with OpenAI, ServiceNow, and ElevenLabs. It processes user input, extracts intent, and makes the magic happen. Here’s the code for our middleware:

app.js

import express from 'express';
import bodyParser from 'body-parser';
import { createServiceNowTicket } from './serviceNowMiddleware.js';
import OpenAI from "openai";
import dotenv from 'dotenv';
import cors from "cors";
import axios from "axios";

dotenv.config();

process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';

const app = express();
const openai = new OpenAI({
organization: process.env.ORG_ID,
apiKey: process.env.OPENAI_API_KEY,
project: process.env.PROJECT_ID
});

app.use(cors())

app.use(bodyParser.json());

// Process Text from Front-End
app.post('/process_text', async (req, res) => {
try {
const { text } = req.body;

// Step 1: Use OpenAI GPT-4 for intent analysis
const nlpResponse = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'You are a virtual assistant integrated with ServiceNow. Your purpose is to help users create incident tickets in ServiceNow. When a user describes an issue, extract the relevant details (like description and priority) and return a structured JSON object. Only respond with the JSON object containing: - "description": A short description of the issue. - "priority": A number from 1 to 5 indicating the priority (1 = high, 5 = low). Do not provide any other text or explanations.' },
{ role: 'user', content: text },
],
});

const gptOutput = nlpResponse.choices[0].message.content;
console.log('GPT-4 Response:', gptOutput);

let description, priority;

// Step 2: Parse GPT-4 output or handle non-JSON response
try {
const parsedOutput = JSON.parse(gptOutput);
description = parsedOutput.description;
priority = parsedOutput.priority;

if (!description || !priority) {
throw new Error('Missing required fields in JSON.');
}
} catch (error) {
console.warn('Failed to parse GPT output:', error.message);

// Fallback: Use user input as a placeholder for description
description = text;
priority = null; // Indicate missing priority
}

// Step 3: Handle missing priority or description
if (!priority) {
const followUpText = `I noticed you didn’t specify a priority for your ticket. Can you provide one? (e.g., High, Medium, Low)`;
return generateTTSResponse(res, followUpText); // Ask the user for more details
}

// Step 4: Create a ServiceNow ticket
const ticketNumber = await createServiceNowTicket(description, priority);

// Step 5: Generate success response
const responseText = `Your ticket has been created successfully. The ticket number is ${ticketNumber}.`;
return generateTTSResponse(res, responseText);

} catch (error) {
console.error('Error processing text:', error.message);
// Graceful fallback for unexpected errors
const fallbackResponseText = 'I encountered an issue processing your request. Please try again later.';
return generateTTSResponse(res, fallbackResponseText);
}
});

const generateTTSResponse = async (res, text) => {
try {
// ElevenLabs TTS API Request
const response = await axios.post(
`https://api.elevenlabs.io/v1/text-to-speech/<your-voice-id>`,
{
text,
model_id: "eleven_monolingual_v1", // Use correct model ID
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
},
{
headers: {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVENLABS_API_KEY,
},
responseType: 'arraybuffer',
}
);

// Convert audio to Base64
const audioBase64 = Buffer.from(response.data).toString('base64');

// Send both text and audio (as Base64)
res.json({
status: 'success',
text: text,
audio: `data:audio/mpeg;base64,${audioBase64}`, // Data URI for audio
});
} catch (error) {
console.error('Error generating TTS response:', error.message);
res.status(500).send('Failed to generate audio response.');
}
};


// Serve Audio Files
app.use(express.static('responses'));

// Start the Server
const PORT = process.env.PORT || 5000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));

You can choose your voice ID from here.

serviceNowMiddleware.js

import axios from 'axios';
import dotenv from 'dotenv';

dotenv.config();

const { SERVICENOW_URL, SERVICENOW_USER, SERVICENOW_PASS } = process.env;

export const createServiceNowTicket = async (description, priority) => {
try {
const response = await axios.post(
SERVICENOW_URL,
{
short_description: description,
priority: priority || '3',
},
{
auth: {
username: SERVICENOW_USER,
password: SERVICENOW_PASS,
},
headers: {
'Content-Type': 'application/json',
},
}
);
console.log(response.data)
return response.data.result.number;
} catch (error) {
console.error('ServiceNow API Error:', error.message);
throw new Error('Failed to create ServiceNow ticket');
}
};

Step 3: The Front-End

Here’s the HTML/CSS/JavaScript for our voice agent:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Voice Agent</title>
<style>
/* Overall Styling */
body {
margin: 0;
padding: 0;
height: 100vh;
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
background: linear-gradient(to bottom, lightgreen, white, lightgreen);
font-family: Arial, sans-serif;
}

/* Title Styling */
h1 {
position: absolute;
top: 20px;
font-size: 2.5rem;
text-align: center;
color: #333;
}

/* Button Styling */
#toggleButton {
width: 120px;
height: 120px;
border: none;
border-radius: 50%;
background: #4caf50;
color: white;
font-size: 1.2rem;
cursor: pointer;
box-shadow: 0 0 20px rgba(0, 0, 0, 0.2);
transition: all 0.3s ease-in-out;
}

/* Blinking Glow Effect in Recording State */
#toggleButton.recording {
box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
animation: blink 1s infinite;
}

@keyframes blink {
0%, 100% {
box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
}
50% {
box-shadow: 0 0 10px 3px rgba(0, 255, 0, 0.3), 0 0 20px 5px rgba(0, 255, 0, 0.2);
}
}

/* Output Text Styling */
#output {
margin-top: 20px;
font-size: 1.2rem;
color: #555;
text-align: center;
}

/* Response Text Styling */
#response {
margin-top: 10px;
font-size: 1.2rem;
color: #333;
text-align: center;
}

#botanalysing {
opacity: 0;
}
</style>
<script>
let isRecording = false;
let recognition;

function toggleRecording() {
if (!isRecording) {
startRecording();
} else {
stopRecording();
}
}

function startRecording() {
recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.lang = 'en-US';
recognition.start();

recognition.onresult = async (event) => {
document.getElementById('botanalysing').style.opacity = 1;

const speechText = event.results[0][0].transcript;
document.getElementById('output').innerText = `You said: ${speechText}`;

const response = await fetch('http://localhost:5000/process_text', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: speechText })
});

if (response.ok) {
document.getElementById('botanalysing').style.opacity = 0;

const responseData = await response.json();

// Display the AI's response text
document.getElementById('response').innerText = `AI Response: ${responseData.text}`;

// Play the audio response
const audio = new Audio(responseData.audio);
audio.play();
} else {
console.error('Error generating speech');
}
};

recognition.onerror = (event) => {
document.getElementById('botanalysing').style.opacity = 0;

console.error('Speech recognition error:', event.error);
stopRecording();
};

isRecording = true;
const button = document.getElementById('toggleButton');
button.innerText = 'Stop';
button.classList.add('recording');
}

function stopRecording() {
if (recognition) recognition.stop();
isRecording = false;
const button = document.getElementById('toggleButton');
button.innerText = 'Start';
button.classList.remove('recording');
}
</script>
</head>
<body>
<h1>AI Voice Agent</h1>
<button id="toggleButton" onclick="toggleRecording()">
Start
</button>
<p id="output"></p>
<p id="response"></p>
<p id="botanalysing">Bot is analyzing your query. Please wait...</p>
</body>
</html>

Step 4: Testing & Debugging

- Frontend: Open the HTML in a browser, click “Start,” and speak to your agent.
- Backend: Use tools like Postman to test API endpoints.
- Logs: Keep an eye on the console for debugging errors.

Step 5: Lessons Learned

1. Speech Recognition is Fun: Watching your app transcribe speech is incredibly satisfying.
2. JSON Parsing Woes: GPT sometimes likes to play rogue — always handle invalid JSON.
3. TTS is Addictive: Hearing your AI respond in a natural voice feels like magic.

Conclusion

This PoC showcases the power of modern APIs to create interactive, voice-driven applications. Whether for customer support or productivity tools, the possibilities are endless. So, what are you waiting for? Let your voice be heard — literally!

Try this out and let me know how it goes! Or better, build something even cooler and share it with the world.

--

--

Souvik Majumder
Souvik Majumder

Written by Souvik Majumder

Full Stack Developer | Machine Learning | AI | NLP | AWS | SAP

No responses yet