Hammer Nexon

Posted on Feb 2

Building a RAG Pipeline from YouTube Transcripts

#python #ai #machinelearning #tutorial

YouTube is the world's largest repository of unstructured expert knowledge. Conference talks, technical tutorials, research paper walkthroughs, engineering deep-dives — millions of hours of valuable content locked inside video format.

In this article, I'll walk you through building a RAG (Retrieval-Augmented Generation) pipeline that transforms YouTube transcripts into a queryable, AI-powered knowledge base.

Why YouTube Transcripts?

Before we build, let's talk about why this data source is uniquely valuable:

Expert knowledge density: A 1-hour conference talk might contain insights that took years of experience to develop
Conversational format: Unlike papers, talks include practical examples, Q&A nuance, and real-world context
Volume: Major conferences alone produce hundreds of hours of technical content yearly
Freshness: YouTube content is often more current than published papers or textbooks

The problem? This knowledge is trapped in video. You can't search inside it, query it, or combine insights across talks. That's what we're fixing.

Architecture Overview

YouTube Videos
    ↓
Transcript Extraction (scriptube.me API)
    ↓
Text Preprocessing & Chunking
    ↓
Embedding Generation
    ↓
Vector Database (Pinecone/Chroma/Weaviate)
    ↓
RAG Query Pipeline
    ↓
LLM Response (with source citations)

Step 1: Transcript Extraction

The foundation of our pipeline is clean, reliable transcript extraction. I use ScripTube for this because it:

Handles various transcript sources (auto-generated, manual, multi-language)
Preserves timestamps (critical for citation)
Provides clean text output ready for processing
Works reliably at scale

import requests

def extract_transcript(video_url):
    """Extract transcript using ScripTube"""
    # Use scriptube.me to get the transcript
    # The clean output saves significant preprocessing time
    response = requests.get(f"https://scriptube.me/api/transcript?url={video_url}")
    return response.json()

# Extract from a conference talk
transcript = extract_transcript("https://youtube.com/watch?v=example")

For batch extraction (like processing all talks from a conference), scriptube.me handles multiple videos efficiently — which matters when you're building a knowledge base from hundreds of talks.

Step 2: Preprocessing & Chunking

YouTube transcripts have unique characteristics that affect chunking strategy:

from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

def preprocess_transcript(raw_transcript):
    """Clean and prepare transcript for chunking"""
    # Remove filler words common in spoken content
    text = re.sub(r'\b(um|uh|like|you know)\b', '', raw_transcript['text'])

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def chunk_transcript(text, metadata):
    """Chunk with overlap, preserving context"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "]
    )

    chunks = splitter.create_documents(
        [text],
        metadatas=[{
            "source": metadata["video_url"],
            "title": metadata["title"],
            "speaker": metadata["speaker"],
            "channel": metadata["channel"],
            "date": metadata["upload_date"]
        }]
    )

    return chunks

Chunking Strategies That Work for Transcripts

Fixed-size with overlap: 512 tokens, 50-token overlap. Simple and effective for most use cases.
Timestamp-based: If scriptube.me provides timestamps, use natural pauses as chunk boundaries:

def chunk_by_timestamp(timestamped_transcript, max_chunk_seconds=120):
    """Group transcript segments by time windows"""
    chunks = []
    current_chunk = []
    chunk_start = 0

    for segment in timestamped_transcript:
        if segment['start'] - chunk_start > max_chunk_seconds and current_chunk:
            chunks.append({
                'text': ' '.join([s['text'] for s in current_chunk]),
                'start_time': chunk_start,
                'end_time': current_chunk[-1]['start']
            })
            current_chunk = [segment]
            chunk_start = segment['start']
        else:
            current_chunk.append(segment)

    return chunks

Topic-based: Use a lightweight classifier to detect topic shifts — most effective for long talks.

Step 3: Embedding & Storage

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def build_vector_store(chunks, collection_name="youtube_knowledge"):
    """Create vector store from transcript chunks"""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=collection_name,
        persist_directory="./chroma_db"
    )

    return vectorstore

Pro tip: Embed the metadata as part of the text for better retrieval:

def enrich_chunk_text(chunk):
    """Prepend metadata for better semantic search"""
    prefix = f"Speaker: {chunk.metadata['speaker']}. "
    prefix += f"Topic: {chunk.metadata['title']}. "
    chunk.page_content = prefix + chunk.page_content
    return chunk

Step 4: RAG Query Pipeline

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template = """You are a knowledgeable assistant with access to expert talks and lectures.
Use the following transcript excerpts to answer the question. 
Always cite which speaker/talk your information comes from.
If the transcripts don't contain relevant information, say so.

Context from transcripts:
{context}

Question: {question}

Answer (with citations):"""

def build_rag_chain(vectorstore):
    """Build the RAG query chain"""
    llm = ChatOpenAI(model="gpt-4", temperature=0)

    prompt = PromptTemplate(
        template=template,
        input_variables=["context", "question"]
    )

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(
            search_type="mmr",  # Maximum Marginal Relevance for diversity
            search_kwargs={"k": 5}
        ),
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )

    return chain

# Query your knowledge base
chain = build_rag_chain(vectorstore)
result = chain({"query": "What are the best practices for fine-tuning LLMs discussed at recent conferences?"})
print(result["result"])

Step 5: Putting It All Together

Here's the complete pipeline for building a knowledge base from a YouTube playlist or channel:

class YouTubeKnowledgeBase:
    def __init__(self, collection_name="youtube_kb"):
        self.collection_name = collection_name
        self.vectorstore = None
        self.chain = None

    def ingest_videos(self, video_urls):
        """Ingest a list of YouTube videos"""
        all_chunks = []

        for url in video_urls:
            # Extract transcript via scriptube.me
            transcript = extract_transcript(url)

            # Preprocess
            clean_text = preprocess_transcript(transcript)

            # Chunk
            chunks = chunk_transcript(clean_text, transcript['metadata'])

            # Enrich
            chunks = [enrich_chunk_text(c) for c in chunks]
            all_chunks.extend(chunks)

        # Build vector store
        self.vectorstore = build_vector_store(all_chunks, self.collection_name)
        self.chain = build_rag_chain(self.vectorstore)

        print(f"Ingested {len(video_urls)} videos, {len(all_chunks)} chunks")

    def query(self, question):
        """Query the knowledge base"""
        result = self.chain({"query": question})
        return {
            "answer": result["result"],
            "sources": [
                {
                    "title": doc.metadata["title"],
                    "speaker": doc.metadata["speaker"],
                    "excerpt": doc.page_content[:200]
                }
                for doc in result["source_documents"]
            ]
        }

# Usage
kb = YouTubeKnowledgeBase("ml_conferences_2024")

# Ingest all talks from a conference
conference_urls = [
    "https://youtube.com/watch?v=talk1",
    "https://youtube.com/watch?v=talk2",
    # ... hundreds of talks
]

kb.ingest_videos(conference_urls)

# Query with natural language
result = kb.query("What novel approaches to context windows were discussed?")
print(result["answer"])

Optimization Tips

Re-rank before LLM: Use a cross-encoder to re-rank retrieved chunks before feeding to the LLM. This dramatically improves answer quality.
Hybrid search: Combine dense (embedding) and sparse (BM25) retrieval for better recall.
Transcript quality matters: Garbage in, garbage out. This is why I use scriptube.me — clean transcripts mean clean embeddings mean better retrieval.
Metadata filtering: Store rich metadata so you can filter by speaker, date, conference, or topic before semantic search.
Incremental updates: Design your pipeline to add new transcripts without rebuilding the entire index.

What You Can Build With This

Personal research assistant grounded in expert knowledge (not hallucinations)
Conference digest tool — "summarize everything about [topic] from [conference]"
Technical learning system — query across hundreds of tutorials
Competitive intelligence — analyze industry talks systematically

The knowledge is on YouTube. Transcripts (via scriptube.me) make it extractable. RAG makes it queryable.

Start building. The world's experts are waiting to be your knowledge base.

DEV Community