If you've ever tried to programmatically access YouTube video transcripts, you know the pain. There's no official endpoint in the YouTube Data API v3 for captions text. You either scrape, reverse-engineer undocumented endpoints, or give up.
I didn't want to give up. I was building ScripTube (scriptube.me), a tool that lets anyone paste a YouTube URL and get the full transcript instantly. Here's how I built the backend API using Next.js API routes.
The Problem
YouTube's Data API lets you list caption tracks for a video, but actually downloading the caption text requires OAuth on behalf of the video owner. That's useless if you want transcripts of videos you don't own.
The workaround: YouTube serves auto-generated and manual captions to every viewer through an internal endpoint. Several open-source libraries tap into this.
The Stack
- Next.js 14 (App Router)
-
youtube-transcript npm package (or
youtube-transcript-apifor Python) - Vercel for deployment
- Rate limiting via Upstash Redis
Step 1: The API Route
Create app/api/transcript/route.ts:
import { NextRequest, NextResponse } from 'next/server';
import { YoutubeTranscript } from 'youtube-transcript';
export async function POST(req: NextRequest) {
try {
const { url } = await req.json();
if (!url) {
return NextResponse.json(
{ error: 'URL is required' },
{ status: 400 }
);
}
const videoId = extractVideoId(url);
if (!videoId) {
return NextResponse.json(
{ error: 'Invalid YouTube URL' },
{ status: 400 }
);
}
const transcript = await YoutubeTranscript.fetchTranscript(videoId);
const formatted = transcript
.map((entry) => entry.text)
.join(' ');
return NextResponse.json({
videoId,
transcript: formatted,
segments: transcript,
});
} catch (error) {
return NextResponse.json(
{ error: 'Failed to fetch transcript. It may not be available.' },
{ status: 500 }
);
}
}
Step 2: Extracting the Video ID
YouTube URLs come in many flavors — youtube.com/watch?v=, youtu.be/, youtube.com/embed/, URLs with extra parameters. You need a robust parser:
function extractVideoId(url: string): string | null {
const patterns = [
/(?:youtube\.com\/watch\?v=)([a-zA-Z0-9_-]{11})/,
/(?:youtu\.be\/)([a-zA-Z0-9_-]{11})/,
/(?:youtube\.com\/embed\/)([a-zA-Z0-9_-]{11})/,
/(?:youtube\.com\/v\/)([a-zA-Z0-9_-]{11})/,
];
for (const pattern of patterns) {
const match = url.match(pattern);
if (match) return match[1];
}
// Maybe they just pasted the video ID directly
if (/^[a-zA-Z0-9_-]{11}$/.test(url)) return url;
return null;
}
Step 3: Rate Limiting
Without rate limiting, your API will get hammered. I use Upstash Redis with a sliding window:
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(10, '60 s'),
analytics: true,
});
// Inside your route handler:
const ip = req.headers.get('x-forwarded-for') ?? '127.0.0.1';
const { success } = await ratelimit.limit(ip);
if (!success) {
return NextResponse.json(
{ error: 'Rate limit exceeded. Try again in a minute.' },
{ status: 429 }
);
}
Step 4: Formatting the Output
Raw transcript data comes as an array of segments with text, offset, and duration. For most users, they want clean readable text. But power users want timestamps. I serve both:
const plainText = transcript
.map((s) => s.text)
.join(' ')
.replace(/\s+/g, ' ')
.trim();
const withTimestamps = transcript.map((s) => ({
time: formatTimestamp(s.offset),
text: s.text,
}));
function formatTimestamp(ms: number): string {
const totalSeconds = Math.floor(ms / 1000);
const minutes = Math.floor(totalSeconds / 60);
const seconds = totalSeconds % 60;
return `${minutes}:${seconds.toString().padStart(2, '0')}`;
}
Step 5: The Frontend
The frontend is dead simple — one input, one button, one output area:
'use client';
import { useState } from 'react';
export default function TranscriptExtractor() {
const [url, setUrl] = useState('');
const [transcript, setTranscript] = useState('');
const [loading, setLoading] = useState(false);
const fetchTranscript = async () => {
setLoading(true);
const res = await fetch('/api/transcript', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url }),
});
const data = await res.json();
setTranscript(data.transcript || data.error);
setLoading(false);
};
return (
<div>
<input
value={url}
onChange={(e) => setUrl(e.target.value)}
placeholder="Paste YouTube URL..."
/>
<button onClick={fetchTranscript} disabled={loading}>
{loading ? 'Extracting...' : 'Get Transcript'}
</button>
{transcript && <pre>{transcript}</pre>}
</div>
);
}
Gotchas I Hit
Not all videos have transcripts. Some creators disable captions entirely. Auto-generation isn't available for every language. Handle this gracefully.
Auto-generated captions have errors. Especially with technical terms, proper nouns, and heavy accents. There's not much you can do about this — it's YouTube's ASR quality.
YouTube occasionally changes internal endpoints. The
youtube-transcriptpackage has broken before when YouTube updated their frontend. Pin your dependency versions and monitor for issues.Long videos = large payloads. A 3-hour video transcript can be 50K+ words. Consider pagination or streaming for very long content.
Deployment
Deploy to Vercel with vercel --prod. The API routes become serverless functions automatically. Set your Upstash environment variables in the Vercel dashboard.
This is essentially what powers ScripTube (scriptube.me). The architecture is simple — the complexity is in the edge cases: URL parsing, error handling, rate limiting, and formatting.
If you're building something similar, start simple. One input. One button. Ship it. Iterate based on what users actually need.
Top comments (0)