DEV Community

Hammer Nexon
Hammer Nexon

Posted on

Building a RAG Pipeline from YouTube Transcripts

YouTube is the world's largest repository of unstructured expert knowledge. Conference talks, technical tutorials, research paper walkthroughs, engineering deep-dives — millions of hours of valuable content locked inside video format.

In this article, I'll walk you through building a RAG (Retrieval-Augmented Generation) pipeline that transforms YouTube transcripts into a queryable, AI-powered knowledge base.

Why YouTube Transcripts?

Before we build, let's talk about why this data source is uniquely valuable:

  • Expert knowledge density: A 1-hour conference talk might contain insights that took years of experience to develop
  • Conversational format: Unlike papers, talks include practical examples, Q&A nuance, and real-world context
  • Volume: Major conferences alone produce hundreds of hours of technical content yearly
  • Freshness: YouTube content is often more current than published papers or textbooks

The problem? This knowledge is trapped in video. You can't search inside it, query it, or combine insights across talks. That's what we're fixing.

Architecture Overview

YouTube Videos
    ↓
Transcript Extraction (scriptube.me API)
    ↓
Text Preprocessing & Chunking
    ↓
Embedding Generation
    ↓
Vector Database (Pinecone/Chroma/Weaviate)
    ↓
RAG Query Pipeline
    ↓
LLM Response (with source citations)
Enter fullscreen mode Exit fullscreen mode

Step 1: Transcript Extraction

The foundation of our pipeline is clean, reliable transcript extraction. I use ScripTube for this because it:

  • Handles various transcript sources (auto-generated, manual, multi-language)
  • Preserves timestamps (critical for citation)
  • Provides clean text output ready for processing
  • Works reliably at scale
import requests

def extract_transcript(video_url):
    """Extract transcript using ScripTube"""
    # Use scriptube.me to get the transcript
    # The clean output saves significant preprocessing time
    response = requests.get(f"https://scriptube.me/api/transcript?url={video_url}")
    return response.json()

# Extract from a conference talk
transcript = extract_transcript("https://youtube.com/watch?v=example")
Enter fullscreen mode Exit fullscreen mode

For batch extraction (like processing all talks from a conference), scriptube.me handles multiple videos efficiently — which matters when you're building a knowledge base from hundreds of talks.

Step 2: Preprocessing & Chunking

YouTube transcripts have unique characteristics that affect chunking strategy:

from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

def preprocess_transcript(raw_transcript):
    """Clean and prepare transcript for chunking"""
    # Remove filler words common in spoken content
    text = re.sub(r'\b(um|uh|like|you know)\b', '', raw_transcript['text'])

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def chunk_transcript(text, metadata):
    """Chunk with overlap, preserving context"""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "]
    )

    chunks = splitter.create_documents(
        [text],
        metadatas=[{
            "source": metadata["video_url"],
            "title": metadata["title"],
            "speaker": metadata["speaker"],
            "channel": metadata["channel"],
            "date": metadata["upload_date"]
        }]
    )

    return chunks
Enter fullscreen mode Exit fullscreen mode

Chunking Strategies That Work for Transcripts

  1. Fixed-size with overlap: 512 tokens, 50-token overlap. Simple and effective for most use cases.

  2. Timestamp-based: If scriptube.me provides timestamps, use natural pauses as chunk boundaries:

def chunk_by_timestamp(timestamped_transcript, max_chunk_seconds=120):
    """Group transcript segments by time windows"""
    chunks = []
    current_chunk = []
    chunk_start = 0

    for segment in timestamped_transcript:
        if segment['start'] - chunk_start > max_chunk_seconds and current_chunk:
            chunks.append({
                'text': ' '.join([s['text'] for s in current_chunk]),
                'start_time': chunk_start,
                'end_time': current_chunk[-1]['start']
            })
            current_chunk = [segment]
            chunk_start = segment['start']
        else:
            current_chunk.append(segment)

    return chunks
Enter fullscreen mode Exit fullscreen mode
  1. Topic-based: Use a lightweight classifier to detect topic shifts — most effective for long talks.

Step 3: Embedding & Storage

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def build_vector_store(chunks, collection_name="youtube_knowledge"):
    """Create vector store from transcript chunks"""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=collection_name,
        persist_directory="./chroma_db"
    )

    return vectorstore
Enter fullscreen mode Exit fullscreen mode

Pro tip: Embed the metadata as part of the text for better retrieval:

def enrich_chunk_text(chunk):
    """Prepend metadata for better semantic search"""
    prefix = f"Speaker: {chunk.metadata['speaker']}. "
    prefix += f"Topic: {chunk.metadata['title']}. "
    chunk.page_content = prefix + chunk.page_content
    return chunk
Enter fullscreen mode Exit fullscreen mode

Step 4: RAG Query Pipeline

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

template = """You are a knowledgeable assistant with access to expert talks and lectures.
Use the following transcript excerpts to answer the question. 
Always cite which speaker/talk your information comes from.
If the transcripts don't contain relevant information, say so.

Context from transcripts:
{context}

Question: {question}

Answer (with citations):"""

def build_rag_chain(vectorstore):
    """Build the RAG query chain"""
    llm = ChatOpenAI(model="gpt-4", temperature=0)

    prompt = PromptTemplate(
        template=template,
        input_variables=["context", "question"]
    )

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(
            search_type="mmr",  # Maximum Marginal Relevance for diversity
            search_kwargs={"k": 5}
        ),
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )

    return chain

# Query your knowledge base
chain = build_rag_chain(vectorstore)
result = chain({"query": "What are the best practices for fine-tuning LLMs discussed at recent conferences?"})
print(result["result"])
Enter fullscreen mode Exit fullscreen mode

Step 5: Putting It All Together

Here's the complete pipeline for building a knowledge base from a YouTube playlist or channel:

class YouTubeKnowledgeBase:
    def __init__(self, collection_name="youtube_kb"):
        self.collection_name = collection_name
        self.vectorstore = None
        self.chain = None

    def ingest_videos(self, video_urls):
        """Ingest a list of YouTube videos"""
        all_chunks = []

        for url in video_urls:
            # Extract transcript via scriptube.me
            transcript = extract_transcript(url)

            # Preprocess
            clean_text = preprocess_transcript(transcript)

            # Chunk
            chunks = chunk_transcript(clean_text, transcript['metadata'])

            # Enrich
            chunks = [enrich_chunk_text(c) for c in chunks]
            all_chunks.extend(chunks)

        # Build vector store
        self.vectorstore = build_vector_store(all_chunks, self.collection_name)
        self.chain = build_rag_chain(self.vectorstore)

        print(f"Ingested {len(video_urls)} videos, {len(all_chunks)} chunks")

    def query(self, question):
        """Query the knowledge base"""
        result = self.chain({"query": question})
        return {
            "answer": result["result"],
            "sources": [
                {
                    "title": doc.metadata["title"],
                    "speaker": doc.metadata["speaker"],
                    "excerpt": doc.page_content[:200]
                }
                for doc in result["source_documents"]
            ]
        }

# Usage
kb = YouTubeKnowledgeBase("ml_conferences_2024")

# Ingest all talks from a conference
conference_urls = [
    "https://youtube.com/watch?v=talk1",
    "https://youtube.com/watch?v=talk2",
    # ... hundreds of talks
]

kb.ingest_videos(conference_urls)

# Query with natural language
result = kb.query("What novel approaches to context windows were discussed?")
print(result["answer"])
Enter fullscreen mode Exit fullscreen mode

Optimization Tips

  1. Re-rank before LLM: Use a cross-encoder to re-rank retrieved chunks before feeding to the LLM. This dramatically improves answer quality.

  2. Hybrid search: Combine dense (embedding) and sparse (BM25) retrieval for better recall.

  3. Transcript quality matters: Garbage in, garbage out. This is why I use scriptube.me — clean transcripts mean clean embeddings mean better retrieval.

  4. Metadata filtering: Store rich metadata so you can filter by speaker, date, conference, or topic before semantic search.

  5. Incremental updates: Design your pipeline to add new transcripts without rebuilding the entire index.

What You Can Build With This

  • Personal research assistant grounded in expert knowledge (not hallucinations)
  • Conference digest tool — "summarize everything about [topic] from [conference]"
  • Technical learning system — query across hundreds of tutorials
  • Competitive intelligence — analyze industry talks systematically

The knowledge is on YouTube. Transcripts (via scriptube.me) make it extractable. RAG makes it queryable.

Start building. The world's experts are waiting to be your knowledge base.


Top comments (0)