Building a YouTube Transcript API with Next.js

#nextjs #api #youtube #webdev

If you've ever tried to programmatically access YouTube video transcripts, you know the pain. There's no official endpoint in the YouTube Data API v3 for captions text. You either scrape, reverse-engineer undocumented endpoints, or give up.

I didn't want to give up. I was building ScripTube (scriptube.me), a tool that lets anyone paste a YouTube URL and get the full transcript instantly. Here's how I built the backend API using Next.js API routes.

The Problem

YouTube's Data API lets you list caption tracks for a video, but actually downloading the caption text requires OAuth on behalf of the video owner. That's useless if you want transcripts of videos you don't own.

The workaround: YouTube serves auto-generated and manual captions to every viewer through an internal endpoint. Several open-source libraries tap into this.

The Stack

Next.js 14 (App Router)
youtube-transcript npm package (or youtube-transcript-api for Python)
Vercel for deployment
Rate limiting via Upstash Redis

Step 1: The API Route

Create app/api/transcript/route.ts:

import { NextRequest, NextResponse } from 'next/server';
import { YoutubeTranscript } from 'youtube-transcript';

export async function POST(req: NextRequest) {
  try {
    const { url } = await req.json();

    if (!url) {
      return NextResponse.json(
        { error: 'URL is required' },
        { status: 400 }
      );
    }

    const videoId = extractVideoId(url);
    if (!videoId) {
      return NextResponse.json(
        { error: 'Invalid YouTube URL' },
        { status: 400 }
      );
    }

    const transcript = await YoutubeTranscript.fetchTranscript(videoId);

    const formatted = transcript
      .map((entry) => entry.text)
      .join(' ');

    return NextResponse.json({
      videoId,
      transcript: formatted,
      segments: transcript,
    });
  } catch (error) {
    return NextResponse.json(
      { error: 'Failed to fetch transcript. It may not be available.' },
      { status: 500 }
    );
  }
}

Step 2: Extracting the Video ID

YouTube URLs come in many flavors — youtube.com/watch?v=, youtu.be/, youtube.com/embed/, URLs with extra parameters. You need a robust parser:

function extractVideoId(url: string): string | null {
  const patterns = [
    /(?:youtube\.com\/watch\?v=)([a-zA-Z0-9_-]{11})/,
    /(?:youtu\.be\/)([a-zA-Z0-9_-]{11})/,
    /(?:youtube\.com\/embed\/)([a-zA-Z0-9_-]{11})/,
    /(?:youtube\.com\/v\/)([a-zA-Z0-9_-]{11})/,
  ];

  for (const pattern of patterns) {
    const match = url.match(pattern);
    if (match) return match[1];
  }

  // Maybe they just pasted the video ID directly
  if (/^[a-zA-Z0-9_-]{11}$/.test(url)) return url;

  return null;
}

Step 3: Rate Limiting

Without rate limiting, your API will get hammered. I use Upstash Redis with a sliding window:

import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '60 s'),
  analytics: true,
});

// Inside your route handler:
const ip = req.headers.get('x-forwarded-for') ?? '127.0.0.1';
const { success } = await ratelimit.limit(ip);

if (!success) {
  return NextResponse.json(
    { error: 'Rate limit exceeded. Try again in a minute.' },
    { status: 429 }
  );
}

Step 4: Formatting the Output

Raw transcript data comes as an array of segments with text, offset, and duration. For most users, they want clean readable text. But power users want timestamps. I serve both:

const plainText = transcript
  .map((s) => s.text)
  .join(' ')
  .replace(/\s+/g, ' ')
  .trim();

const withTimestamps = transcript.map((s) => ({
  time: formatTimestamp(s.offset),
  text: s.text,
}));

function formatTimestamp(ms: number): string {
  const totalSeconds = Math.floor(ms / 1000);
  const minutes = Math.floor(totalSeconds / 60);
  const seconds = totalSeconds % 60;
  return `${minutes}:${seconds.toString().padStart(2, '0')}`;
}

Step 5: The Frontend

The frontend is dead simple — one input, one button, one output area:

'use client';
import { useState } from 'react';

export default function TranscriptExtractor() {
  const [url, setUrl] = useState('');
  const [transcript, setTranscript] = useState('');
  const [loading, setLoading] = useState(false);

  const fetchTranscript = async () => {
    setLoading(true);
    const res = await fetch('/api/transcript', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ url }),
    });
    const data = await res.json();
    setTranscript(data.transcript || data.error);
    setLoading(false);
  };

  return (
    <div>
      <input
        value={url}
        onChange={(e) => setUrl(e.target.value)}
        placeholder="Paste YouTube URL..."
      />
      <button onClick={fetchTranscript} disabled={loading}>
        {loading ? 'Extracting...' : 'Get Transcript'}
      </button>
      {transcript && <pre>{transcript}</pre>}
    </div>
  );
}

Gotchas I Hit

Not all videos have transcripts. Some creators disable captions entirely. Auto-generation isn't available for every language. Handle this gracefully.
Auto-generated captions have errors. Especially with technical terms, proper nouns, and heavy accents. There's not much you can do about this — it's YouTube's ASR quality.
YouTube occasionally changes internal endpoints. The youtube-transcript package has broken before when YouTube updated their frontend. Pin your dependency versions and monitor for issues.
Long videos = large payloads. A 3-hour video transcript can be 50K+ words. Consider pagination or streaming for very long content.

Deployment

Deploy to Vercel with vercel --prod. The API routes become serverless functions automatically. Set your Upstash environment variables in the Vercel dashboard.

This is essentially what powers ScripTube (scriptube.me). The architecture is simple — the complexity is in the edge cases: URL parsing, error handling, rate limiting, and formatting.

If you're building something similar, start simple. One input. One button. Ship it. Iterate based on what users actually need.

Check out ScripTube →