Building a Voice-Driven AI Agent with ServiceNow and ElevenLabs: A Fun PoC Journey
Have you ever dreamt of having an AI voice assistant to handle your service requests, like creating tickets in ServiceNow while chatting back with you in a soothing voice? Well, I did, and guess what? I made it happen! Let me take you on a journey where code meets creativity, and humor keeps the bugs away (mostly).
The Idea
The goal was simple: create an AI-powered voice agent that:
1. Listens to your commands.
2. Understands your intent (well, most of the time).
3. Creates ServiceNow tickets.
4. Responds with a lovely voice, saying, “Your wish is my command!”
Sounds cool, right? Let’s dive into the how!
Step 1: Setting Up the Tech Stack
For this PoC, I used:
- Node.js for the backend.
- Express.js to handle API requests.
- OpenAI GPT-4 to process user input and extract ticket details.
- ElevenLabs for Text-to-Speech (TTS).
- HTML/CSS/JavaScript for a sleek front-end.
- Google’s Speech-to-Text API (or a browser’s Web Speech API) for transcription.
- A generous helping of coffee ☕.
Step 2: The Middleware — A Magical Bridge
The middleware connects the front-end with OpenAI, ServiceNow, and ElevenLabs. It processes user input, extracts intent, and makes the magic happen. Here’s the code for our middleware:
app.js
import express from 'express';
import bodyParser from 'body-parser';
import { createServiceNowTicket } from './serviceNowMiddleware.js';
import OpenAI from "openai";
import dotenv from 'dotenv';
import cors from "cors";
import axios from "axios";
dotenv.config();
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0';
const app = express();
const openai = new OpenAI({
organization: process.env.ORG_ID,
apiKey: process.env.OPENAI_API_KEY,
project: process.env.PROJECT_ID
});
app.use(cors())
app.use(bodyParser.json());
// Process Text from Front-End
app.post('/process_text', async (req, res) => {
try {
const { text } = req.body;
// Step 1: Use OpenAI GPT-4 for intent analysis
const nlpResponse = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'You are a virtual assistant integrated with ServiceNow. Your purpose is to help users create incident tickets in ServiceNow. When a user describes an issue, extract the relevant details (like description and priority) and return a structured JSON object. Only respond with the JSON object containing: - "description": A short description of the issue. - "priority": A number from 1 to 5 indicating the priority (1 = high, 5 = low). Do not provide any other text or explanations.' },
{ role: 'user', content: text },
],
});
const gptOutput = nlpResponse.choices[0].message.content;
console.log('GPT-4 Response:', gptOutput);
let description, priority;
// Step 2: Parse GPT-4 output or handle non-JSON response
try {
const parsedOutput = JSON.parse(gptOutput);
description = parsedOutput.description;
priority = parsedOutput.priority;
if (!description || !priority) {
throw new Error('Missing required fields in JSON.');
}
} catch (error) {
console.warn('Failed to parse GPT output:', error.message);
// Fallback: Use user input as a placeholder for description
description = text;
priority = null; // Indicate missing priority
}
// Step 3: Handle missing priority or description
if (!priority) {
const followUpText = `I noticed you didn’t specify a priority for your ticket. Can you provide one? (e.g., High, Medium, Low)`;
return generateTTSResponse(res, followUpText); // Ask the user for more details
}
// Step 4: Create a ServiceNow ticket
const ticketNumber = await createServiceNowTicket(description, priority);
// Step 5: Generate success response
const responseText = `Your ticket has been created successfully. The ticket number is ${ticketNumber}.`;
return generateTTSResponse(res, responseText);
} catch (error) {
console.error('Error processing text:', error.message);
// Graceful fallback for unexpected errors
const fallbackResponseText = 'I encountered an issue processing your request. Please try again later.';
return generateTTSResponse(res, fallbackResponseText);
}
});
const generateTTSResponse = async (res, text) => {
try {
// ElevenLabs TTS API Request
const response = await axios.post(
`https://api.elevenlabs.io/v1/text-to-speech/<your-voice-id>`,
{
text,
model_id: "eleven_monolingual_v1", // Use correct model ID
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
},
{
headers: {
'Content-Type': 'application/json',
'xi-api-key': process.env.ELEVENLABS_API_KEY,
},
responseType: 'arraybuffer',
}
);
// Convert audio to Base64
const audioBase64 = Buffer.from(response.data).toString('base64');
// Send both text and audio (as Base64)
res.json({
status: 'success',
text: text,
audio: `data:audio/mpeg;base64,${audioBase64}`, // Data URI for audio
});
} catch (error) {
console.error('Error generating TTS response:', error.message);
res.status(500).send('Failed to generate audio response.');
}
};
// Serve Audio Files
app.use(express.static('responses'));
// Start the Server
const PORT = process.env.PORT || 5000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));
You can choose your voice ID from here.
serviceNowMiddleware.js
import axios from 'axios';
import dotenv from 'dotenv';
dotenv.config();
const { SERVICENOW_URL, SERVICENOW_USER, SERVICENOW_PASS } = process.env;
export const createServiceNowTicket = async (description, priority) => {
try {
const response = await axios.post(
SERVICENOW_URL,
{
short_description: description,
priority: priority || '3',
},
{
auth: {
username: SERVICENOW_USER,
password: SERVICENOW_PASS,
},
headers: {
'Content-Type': 'application/json',
},
}
);
console.log(response.data)
return response.data.result.number;
} catch (error) {
console.error('ServiceNow API Error:', error.message);
throw new Error('Failed to create ServiceNow ticket');
}
};
Step 3: The Front-End
Here’s the HTML/CSS/JavaScript for our voice agent:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>AI Voice Agent</title>
<style>
/* Overall Styling */
body {
margin: 0;
padding: 0;
height: 100vh;
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
background: linear-gradient(to bottom, lightgreen, white, lightgreen);
font-family: Arial, sans-serif;
}
/* Title Styling */
h1 {
position: absolute;
top: 20px;
font-size: 2.5rem;
text-align: center;
color: #333;
}
/* Button Styling */
#toggleButton {
width: 120px;
height: 120px;
border: none;
border-radius: 50%;
background: #4caf50;
color: white;
font-size: 1.2rem;
cursor: pointer;
box-shadow: 0 0 20px rgba(0, 0, 0, 0.2);
transition: all 0.3s ease-in-out;
}
/* Blinking Glow Effect in Recording State */
#toggleButton.recording {
box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
animation: blink 1s infinite;
}
@keyframes blink {
0%, 100% {
box-shadow: 0 0 20px 5px rgba(0, 255, 0, 0.6), 0 0 40px 10px rgba(0, 255, 0, 0.4);
}
50% {
box-shadow: 0 0 10px 3px rgba(0, 255, 0, 0.3), 0 0 20px 5px rgba(0, 255, 0, 0.2);
}
}
/* Output Text Styling */
#output {
margin-top: 20px;
font-size: 1.2rem;
color: #555;
text-align: center;
}
/* Response Text Styling */
#response {
margin-top: 10px;
font-size: 1.2rem;
color: #333;
text-align: center;
}
#botanalysing {
opacity: 0;
}
</style>
<script>
let isRecording = false;
let recognition;
function toggleRecording() {
if (!isRecording) {
startRecording();
} else {
stopRecording();
}
}
function startRecording() {
recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.lang = 'en-US';
recognition.start();
recognition.onresult = async (event) => {
document.getElementById('botanalysing').style.opacity = 1;
const speechText = event.results[0][0].transcript;
document.getElementById('output').innerText = `You said: ${speechText}`;
const response = await fetch('http://localhost:5000/process_text', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: speechText })
});
if (response.ok) {
document.getElementById('botanalysing').style.opacity = 0;
const responseData = await response.json();
// Display the AI's response text
document.getElementById('response').innerText = `AI Response: ${responseData.text}`;
// Play the audio response
const audio = new Audio(responseData.audio);
audio.play();
} else {
console.error('Error generating speech');
}
};
recognition.onerror = (event) => {
document.getElementById('botanalysing').style.opacity = 0;
console.error('Speech recognition error:', event.error);
stopRecording();
};
isRecording = true;
const button = document.getElementById('toggleButton');
button.innerText = 'Stop';
button.classList.add('recording');
}
function stopRecording() {
if (recognition) recognition.stop();
isRecording = false;
const button = document.getElementById('toggleButton');
button.innerText = 'Start';
button.classList.remove('recording');
}
</script>
</head>
<body>
<h1>AI Voice Agent</h1>
<button id="toggleButton" onclick="toggleRecording()">
Start
</button>
<p id="output"></p>
<p id="response"></p>
<p id="botanalysing">Bot is analyzing your query. Please wait...</p>
</body>
</html>
Step 4: Testing & Debugging
- Frontend: Open the HTML in a browser, click “Start,” and speak to your agent.
- Backend: Use tools like Postman to test API endpoints.
- Logs: Keep an eye on the console for debugging errors.
Step 5: Lessons Learned
1. Speech Recognition is Fun: Watching your app transcribe speech is incredibly satisfying.
2. JSON Parsing Woes: GPT sometimes likes to play rogue — always handle invalid JSON.
3. TTS is Addictive: Hearing your AI respond in a natural voice feels like magic.
Conclusion
This PoC showcases the power of modern APIs to create interactive, voice-driven applications. Whether for customer support or productivity tools, the possibilities are endless. So, what are you waiting for? Let your voice be heard — literally!
Try this out and let me know how it goes! Or better, build something even cooler and share it with the world.