From Video to Chat: How I Built a YouTube-Powered RAG Chatbot

Introduction

As I’ve been diving deeper into GenAI tooling, I wanted to go beyond tutorials and actually build something. Over the weekend, I set out to create a chatbot that could answer questions about any YouTube video, even if it was in another language. The idea was simple: fetch the transcript, translate it if needed, and use retrieval-augmented generation (RAG) to let users ask questions about the content.

This project might not be too complex, but it helped me understand the core building blocks of a RAG pipeline, and how tools like LangChain, OpenAI, and Streamlit fit together to create a functional app.

Problem Statement

YouTube videos are packed with useful information, but finding specific answers often means scrubbing through timestamps and captions. I wanted to build a simple app that:

  • Lets users paste a YouTube link
  • Fetches the transcript and translates it to English if needed
  • Breaks it into chunks and indexes it in a vector store
  • Allows the user to ask questions about the video content

Architecture Overview

Here’s a high-level look at how the system works:

  1. User inputs a YouTube URL via Streamlit UI
  2. Transcript is fetched using the youtube_transcript_api
  3. Language is detected and translated to English (if necessary) using OpenAI’s LLM
  4. Transcript is split into chunks using RecursiveCharacterTextSplitter
  5. Embeddings are generated via OpenAIEmbeddings
  6. FAISS is used to store the vector embeddings
  7. LangChain pipeline (parallel chain for creating the prompt and combining that with the main chain to get the final answer) is built
  8. User can chat with the video via Streamlit

Tools I Used and Why

As a beginner into GenAI, I chose tools that were both powerful and beginner-friendly with lots of community resources and good documentation:

  • LangChain: It abstracts away much of the boilerplate needed to build RAG pipelines. Its modular design allowed me to plug together retrievers, prompts, and models with minimal code.
  • OpenAI (Chat & Embedding APIs): OpenAI’s models are reliable, high-quality, and easy to integrate. I used gpt-4o-mini for translation and answering, and text-embedding-3-small for creating vector representations of transcript chunks.
  • youtube_transcript_api: This Python library made it really simple to fetch transcripts from YouTube without needing a YouTube Data API key, which helped keep setup minimal.
  • langdetect: I needed a way to auto-detect the transcript language. This library was lightweight and perfect for the job.
  • FAISS: It’s a widely-used vector store by Meta, perfect for similarity search. LangChain integrates with it directly, so I didn’t need to manage infrastructure.
  • Streamlit: For a frontend, Streamlit was ideal. As a Python-based UI framework, it let me build a functional interface quickly without needing to dive into HTML/CSS or JavaScript.
  • dotenv: For environment management, especially to store and access API keys securely.

These tools together gave me everything I needed to go from an idea to a working prototype over the weekend.

Step-by-Step Breakdown

1. Fetching and Translating the Transcript

I used youtube_transcript_api to fetch captions and langdetect to determine the language. If the transcript wasn’t in English, I used OpenAI’s LLM to translate the text.

def get_translated_transcript(video_id, model):
    # fetch transcript and translate if needed

2. Splitting and Indexing the Transcript

LangChain’s RecursiveCharacterTextSplitter helped split long transcripts into smaller chunks. These were converted into embeddings using OpenAI’s text-embedding-3-small model and stored in a FAISS index.

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript_text])
vector_store = FAISS.from_documents(chunks, embeddings)

3. Building the LangChain Chat Pipeline

I created a prompt template that restricts the LLM to answer only from the transcript context. The chain combines:

  • Retriever
  • Prompt
  • LLM (OpenAI)
  • Output parser
parallel_chain = RunnableParallel({
  'context': retriever | RunnableLambda(format_docs),
  'question': RunnablePassthrough()
})
main_chain = parallel_chain | prompt | model | parser

4. Streamlit Frontend

I wanted to focus on the actual logic of the chatbot so the user-facing part is a simple Streamlit app. Users paste the YouTube URL, click a button, and once the transcript is processed, they can ask questions. The chain is stored in st.session_state to persist across interactions.

Demo & Observations

I recorded a short demo of the chatbot answering questions about the video “The Real Reason Why You Have Allergies” by Kurzgesagt. It correctly summarized the video, identified the core topic, and responded to follow-up questions.

While the app is small, building it taught me a lot about the importance of prompt design, how retrievers affect context relevance, and how session management works in Streamlit.

What I Learned

  • How to build a full RAG pipeline using LangChain
  • Integrating transcript APIs and language detection
  • Using LangChain’s composable chains (like RunnableParallel)
  • Managing state in Streamlit and chaining components together

Conclusion

This project isn’t overly complex, but it’s a complete, working application that helped me understand the RAG pipeline in practice. If you’re new to LangChain or GenAI workflows, I highly recommend building something similar.

You can find the code repo for this mini project here – YoutubeChatbot

Would love to hear your thoughts or ideas for what I should build next!

Until next time,

Adiba 😊

Spread the love