Hammer Nexon

Posted on Feb 2

How to Create a Custom AI Tutor from Any YouTube Channel

#chatgpt #webdev #tutorial #ai

What if you could take any YouTube expert — a professor, a technical instructor, a thought leader — and turn their entire channel into an interactive AI tutor?

Not a generic chatbot. A tutor that knows everything that specific person has taught, can answer questions with their perspective, and can quiz you on their material.

That's what we're building today.

The Concept

YouTube channels are essentially free courses. Many professors, engineers, and educators have hundreds of hours of content on their channels. The problem:

It's video — linear and passive
You can't ask questions
You can't search across videos
No quizzes, no interaction

We're going to fix all of that.

What You'll Need

Python 3.8+
An OpenAI API key (or Anthropic for Claude)
scriptube.me for transcript extraction
About 30 minutes

Step 1: Choose Your "Instructor"

Find a YouTube channel with substantial educational content. Good candidates:

University professors with lecture series
Technical tutorial creators
Industry experts with deep-dive content

For this tutorial, let's say we're building a tutor from a machine learning educator's channel with 100+ videos.

Step 2: Extract All Transcripts

import json
import time

def extract_channel_transcripts(video_urls):
    """
    Extract transcripts from all videos in a channel.
    Uses scriptube.me for clean, reliable extraction.
    """
    transcripts = []

    for url in video_urls:
        try:
            transcript = extract_transcript(url)  # via scriptube.me
            transcripts.append({
                'url': url,
                'title': transcript['title'],
                'text': transcript['text'],
                'duration': transcript.get('duration', 0)
            })
            print(f"✓ Extracted: {transcript['title']}")
            time.sleep(1)  # Be respectful with rate limiting
        except Exception as e:
            print(f"✗ Failed: {url} — {e}")

    # Save for later use
    with open('channel_transcripts.json', 'w') as f:
        json.dump(transcripts, f)

    return transcripts

# Your list of video URLs from the channel
video_urls = get_channel_video_urls("CHANNEL_ID")  # Use YouTube API
transcripts = extract_channel_transcripts(video_urls)
print(f"Extracted {len(transcripts)} transcripts")

scriptube.me handles the heavy lifting here — auto-generated captions, manual subs, different languages. Clean text output every time.

Step 3: Build the Tutor's Knowledge Base

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter

def build_tutor_knowledge_base(transcripts):
    """Build a searchable knowledge base from channel transcripts"""

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100
    )

    all_docs = []
    for t in transcripts:
        docs = splitter.create_documents(
            texts=[t['text']],
            metadatas=[{
                'title': t['title'],
                'url': t['url'],
                'source': 'youtube_channel'
            }]
        )
        all_docs.extend(docs)

    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(all_docs, embeddings)
    vectorstore.save_local("tutor_index")

    print(f"Knowledge base built: {len(all_docs)} chunks from {len(transcripts)} videos")
    return vectorstore

Step 4: Create the AI Tutor

Here's where it gets exciting. We're building a tutor that can:

Explain concepts from the channel's content
Answer questions with context from actual lectures
Quiz you on the material
Suggest what to study next

from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain

TUTOR_SYSTEM_PROMPT = """You are an AI tutor based on the teachings of a YouTube educator. 
Your knowledge comes from their actual video transcripts.

Your responsibilities:
1. Explain concepts clearly, using examples from the lectures
2. When answering, reference which video/lecture the information comes from
3. If asked to quiz, create questions based on the actual content
4. Adapt your explanations to the student's level
5. If something wasn't covered in the transcripts, say so honestly

Always be encouraging and patient, like the best tutors are.

Context from lectures:
{context}

Chat history:
{chat_history}

Student's question: {question}

Your response:"""

class AITutor:
    def __init__(self, vectorstore_path="tutor_index"):
        embeddings = OpenAIEmbeddings()
        self.vectorstore = FAISS.load_local(vectorstore_path, embeddings)

        self.memory = ConversationBufferWindowMemory(
            memory_key="chat_history",
            return_messages=True,
            k=10
        )

        self.llm = ChatOpenAI(model="gpt-4", temperature=0.3)

        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(search_kwargs={"k": 4}),
            memory=self.memory,
            combine_docs_chain_kwargs={"prompt": TUTOR_SYSTEM_PROMPT}
        )

    def ask(self, question):
        """Ask your AI tutor a question"""
        result = self.chain({"question": question})
        return result["answer"]

    def quiz(self, topic=None):
        """Get quizzed on the material"""
        if topic:
            question = f"Create a 5-question quiz about {topic} based on the lecture material. Include answers."
        else:
            question = "Create a 5-question quiz covering the most important concepts from the lectures. Include answers."
        return self.ask(question)

    def study_plan(self, goal):
        """Get a personalized study plan"""
        question = f"Based on all the lecture content available, create a structured study plan for someone who wants to: {goal}"
        return self.ask(question)

# Use your tutor
tutor = AITutor()

# Ask questions
print(tutor.ask("Explain backpropagation in simple terms"))
print(tutor.ask("How did the instructor explain gradient descent?"))

# Get quizzed
print(tutor.quiz("neural networks"))

# Get a study plan
print(tutor.study_plan("understand transformers from scratch"))

Step 5: Add a Web Interface (Optional)

import gradio as gr

tutor = AITutor()

def chat(message, history):
    response = tutor.ask(message)
    return response

demo = gr.ChatInterface(
    fn=chat,
    title="🎓 AI Tutor — Powered by YouTube Knowledge",
    description="Ask me anything about the course material. I'm trained on the instructor's actual lectures.",
    examples=[
        "Explain the key concepts from the first lecture",
        "Quiz me on what we've covered",
        "What should I study next?",
        "Compare the approaches discussed in lectures 3 and 7"
    ]
)

demo.launch()

The Result

You now have a personal AI tutor that:

✅ Knows everything a specific YouTube educator has taught
✅ Can answer unlimited questions about the material
✅ Generates quizzes and study plans
✅ References specific videos as sources
✅ Maintains conversation context for follow-up questions

The key insight: YouTube already has the world's best educators sharing knowledge for free. scriptube.me extracts that knowledge into text. AI makes it interactive.

You can build this for ANY YouTube channel — coding tutorials, science lectures, language lessons, business education.

Every expert on YouTube can become your personal AI tutor.

Next Steps

Add spaced repetition (track what you got wrong, re-quiz later)
Support multiple channels for cross-referencing
Add citation links that deep-link to the specific video timestamp
Build a progress tracker

The $0 university is real. The tools are here. Start building.

Transcripts: scriptube.me

Article 3: Automating Knowledge Extraction: YouTube → Transcripts → Vector DB → AI

Tags: #ai #automation #vectordatabase #knowledge

Knowledge is everywhere. On YouTube alone, there are tens of millions of hours of expert content — conference talks, university lectures, technical deep-dives, industry interviews. The problem isn't access. It's extraction and organization.

In this article, I'll show you how to build an automated pipeline that continuously extracts knowledge from YouTube, processes it, stores it in a vector database, and makes it queryable through AI.

Think of it as building your own personal knowledge engine that gets smarter every day.

System Architecture

┌─────────────────────────────────────────────┐
│                 Input Layer                   │
│  YouTube Channels / Playlists / Search       │
│  (RSS feeds, YouTube API, manual curation)   │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│            Extraction Layer                   │
│  scriptube.me — Transcript extraction         │
│  (handles auto-captions, manual subs, i18n)  │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│           Processing Layer                    │
│  Cleaning → Chunking → Metadata enrichment   │
│  (NLP preprocessing, speaker detection)      │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│            Storage Layer                      │
│  Vector DB (embeddings) + Document DB (raw)  │
│  (Pinecone/Chroma + PostgreSQL/MongoDB)      │
└──────────────────┬──────────────────────────┘
                   ↓
┌─────────────────────────────────────────────┐
│             Query Layer                       │
│  RAG Pipeline → LLM → Cited Responses       │
│  (API, CLI, or Web Interface)                │
└─────────────────────────────────────────────┘

Part 1: Automated Ingestion

The first challenge is knowing WHAT to ingest. We want our system to automatically process new content from sources we care about.

import feedparser
from datetime import datetime, timedelta
import schedule

class KnowledgeIngester:
    def __init__(self, db, vectorstore):
        self.db = db  # Document database for raw transcripts
        self.vectorstore = vectorstore
        self.sources = []

    def add_channel(self, channel_id, category=None):
        """Subscribe to a YouTube channel for automatic ingestion"""
        self.sources.append({
            'type': 'channel',
            'id': channel_id,
            'category': category,
            'rss_url': f'https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}'
        })

    def add_playlist(self, playlist_id, category=None):
        """Subscribe to a playlist"""
        self.sources.append({
            'type': 'playlist',
            'id': playlist_id,
            'category': category
        })

    def check_new_content(self):
        """Check all sources for new videos"""
        new_videos = []

        for source in self.sources:
            if source['type'] == 'channel':
                feed = feedparser.parse(source['rss_url'])
                for entry in feed.entries:
                    video_url = entry.link
                    if not self.db.is_processed(video_url):
                        new_videos.append({
                            'url': video_url,
                            'title': entry.title,
                            'published': entry.published,
                            'category': source.get('category'),
                            'channel_id': source['id']
                        })

        return new_videos

    def ingest_video(self, video_info):
        """Full pipeline for a single video"""
        # Step 1: Extract transcript via scriptube.me
        transcript = extract_transcript(video_info['url'])

        # Step 2: Preprocess
        cleaned = preprocess_transcript(transcript)

        # Step 3: Chunk with metadata
        chunks = smart_chunk(cleaned, {
            **video_info,
            'speaker': transcript.get('speaker', 'Unknown'),
            'duration': transcript.get('duration', 0)
        })

        # Step 4: Store raw in document DB
        self.db.store_transcript(video_info['url'], transcript)

        # Step 5: Embed and store in vector DB
        self.vectorstore.add_documents(chunks)

        print(f"✓ Ingested: {video_info['title']} ({len(chunks)} chunks)")

    def run_ingestion_cycle(self):
        """Check for and process new content"""
        new_videos = self.check_new_content()
        print(f"Found {len(new_videos)} new videos")

        for video in new_videos:
            try:
                self.ingest_video(video)
            except Exception as e:
                print(f"✗ Error processing {video['url']}: {e}")

# Set up the ingester
ingester = KnowledgeIngester(db, vectorstore)

# Subscribe to channels
ingester.add_channel("UC_CHANNEL_1", category="machine_learning")
ingester.add_channel("UC_CHANNEL_2", category="software_engineering")
ingester.add_channel("UC_CHANNEL_3", category="business")

# Run on schedule
schedule.every(6).hours.do(ingester.run_ingestion_cycle)

Part 2: Smart Chunking for Transcripts

Spoken content requires different chunking strategies than written text:

import spacy

nlp = spacy.load("en_core_web_sm")

def smart_chunk(text, metadata, max_chunk_size=1000):
    """
    Intelligent chunking that respects topic boundaries.
    Spoken content shifts topics more fluidly than written text,
    so we use sentence-level analysis.
    """
    doc = nlp(text)
    sentences = list(doc.sents)

    chunks = []
    current_chunk = []
    current_size = 0

    for sent in sentences:
        sent_text = sent.text.strip()
        sent_size = len(sent_text)

        if current_size + sent_size > max_chunk_size and current_chunk:
            chunk_text = ' '.join(current_chunk)
            chunks.append({
                'text': chunk_text,
                'metadata': {
                    **metadata,
                    'chunk_index': len(chunks),
                    'chunk_type': 'topic_segment'
                }
            })
            # Keep last sentence for overlap
            current_chunk = [current_chunk[-1], sent_text]
            current_size = len(current_chunk[0]) + sent_size
        else:
            current_chunk.append(sent_text)
            current_size += sent_size

    # Don't forget the last chunk
    if current_chunk:
        chunks.append({
            'text': ' '.join(current_chunk),
            'metadata': {
                **metadata,
                'chunk_index': len(chunks),
                'chunk_type': 'topic_segment'
            }
        })

    return chunks

Part 3: The Query Interface

from anthropic import Anthropic

class KnowledgeEngine:
    def __init__(self, vectorstore):
        self.vectorstore = vectorstore
        self.client = Anthropic()

    def query(self, question, filters=None, k=5):
        """
        Query the knowledge base with optional filters.
        Returns an AI response grounded in expert transcripts.
        """
        # Retrieve relevant chunks
        search_kwargs = {"k": k}
        if filters:
            search_kwargs["filter"] = filters

        docs = self.vectorstore.similarity_search(question, **search_kwargs)

        # Build context
        context = "\n\n---\n\n".join([
            f"Source: {doc.metadata['title']} (by {doc.metadata.get('speaker', 'Unknown')})\n"
            f"Content: {doc.page_content}"
            for doc in docs
        ])

        # Generate response
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2000,
            messages=[{
                "role": "user",
                "content": f"""Based on the following expert transcript excerpts, answer the question.
                Always cite which source you're drawing from.

                SOURCES:
                {context}

                QUESTION: {question}

                Provide a thorough, well-cited answer:"""
            }]
        )

        return {
            "answer": response.content[0].text,
            "sources": [
                {"title": d.metadata["title"], "url": d.metadata.get("url", "")}
                for d in docs
            ]
        }

    def compare(self, topic, speakers=None):
        """Compare perspectives on a topic across different experts"""
        question = f"What are the different perspectives on {topic}?"
        if speakers:
            question += f" Focus on views from: {', '.join(speakers)}"
        return self.query(question, k=10)

    def summarize_recent(self, category=None, days=7):
        """Summarize recently ingested content"""
        filters = {}
        if category:
            filters["category"] = category

        return self.query(
            f"Summarize the key themes and insights from content added in the last {days} days",
            filters=filters,
            k=15
        )

# Usage
engine = KnowledgeEngine(vectorstore)

# Ask anything
result = engine.query("What are the latest approaches to prompt engineering?")
print(result["answer"])

# Compare expert views
result = engine.compare("scaling laws", speakers=["Ilya Sutskever", "Andrej Karpathy"])
print(result["answer"])

# Weekly digest
result = engine.summarize_recent(category="machine_learning", days=7)
print(result["answer"])

Part 4: Deployment

For a production-ready system, wrap it in a FastAPI service:

from fastapi import FastAPI, Query
from pydantic import BaseModel

app = FastAPI(title="Knowledge Engine API")

class QueryRequest(BaseModel):
    question: str
    category: str = None
    max_results: int = 5

class QueryResponse(BaseModel):
    answer: str
    sources: list

@app.post("/query", response_model=QueryResponse)
async def query_knowledge(request: QueryRequest):
    filters = {"category": request.category} if request.category else None
    result = engine.query(request.question, filters=filters, k=request.max_results)
    return result

@app.post("/ingest")
async def trigger_ingestion():
    ingester.run_ingestion_cycle()
    return {"status": "ingestion complete"}

@app.get("/stats")
async def get_stats():
    return {
        "total_videos": db.count_videos(),
        "total_chunks": vectorstore.count(),
        "sources": len(ingester.sources)
    }

The Big Picture

What we've built is a knowledge extraction pipeline that:

Monitors YouTube channels you care about
Extracts transcripts automatically (via scriptube.me)
Processes them into searchable chunks
Stores them in a vector database
Answers questions grounded in real expert knowledge

This is how knowledge work changes in the AI era. YouTube has unlimited expert knowledge — millions of hours of it. The bottleneck was never access. It was processing and retrieval.

With this pipeline, you've eliminated that bottleneck.

Tools used:

ScripTube — Transcript extraction
LangChain / Anthropic SDK — LLM integration
Chroma / Pinecone — Vector storage
FastAPI — API layer

The world's experts are sharing their knowledge on YouTube every day. Now you can actually USE all of it.

Start building your knowledge engine.

DEV Community