Multimodal Video Search with LlamaIndex & VideoDB: A RAG Approach to Finding and Streaming Key Moments
Retrieval-Augmented Generation (RAG) has revolutionized text-based search by enabling efficient information retrieval and synthesis. However, adapting RAG for video content presents new challenges due to its multimodal nature, which includes visual, auditory, and textual components. Unlike text-based retrieval, video search requires processing spoken dialogue, scene segmentation, and object recognition to extract meaningful insights.
Traditional methods rely on metadata or manual annotations, limiting the scope of video search. VideoDB, a serverless video database, simplifies this process by enabling direct indexing, searching, and streaming of video content. It provides random access to sequential video data, allowing for seamless retrieval of relevant video segments.
This article explores how to build a multimodal RAG pipeline using:
- VideoDB — to store, index, and retrieve spoken and visual content, and generate video streams for search results.
- LlamaIndex — to retrieve, structure, and process indexed video data, enabling semantic search on spoken content.
- OpenAI models — to synthesize insights from retrieved content, generating concise and relevant summaries.
With these tools, we can analyze and stream relevant video clips based on user queries, unlocking new possibilities for video-based search applications.
Prerequisites
For the full project implementation, including all the code examples and configurations discussed in this series, please visit my GitHub repository:
git clone https://github.com/menendes/video-search-rag.git
Understanding RAG in Video Search
What is RAG
Retrieval-Augmented Generation (RAG) is an AI approach that combines information retrieval and text generation to produce contextually relevant responses. Instead of relying solely on a model’s internal knowledge, RAG retrieves relevant information from external sources before generating a response. This enhances accuracy and relevance, especially when dealing with large or dynamic datasets.
In the context of video search, RAG is particularly useful because it allows us to:
- Retrieve spoken content from video transcripts.
- Retrieve visual scenes matching a query.
- Use a language model to summarize and enhance results.
- Stream relevant video segments instead of returning static text.
Architecture of the Video Search RAG Pipeline
The diagram illustrates the multimodal RAG workflow for video search, highlighting how queries are processed to return text-based insights and playable video segments.
Step 1: User Query Processing
- The user submits a text query.
- The query is sent to VideoDBRetriever, which handles spoken word search and scene-based retrieval.
Step 2: Retrieval from VideoDB
- VideoDB stores indexed spoken words, scene metadata, and semantic representations of the video.
- The retriever fetches relevant nodes containing timestamps and spoken content matching the query.
Step 3: Response Synthesis with LLM
- Retrieved results are structured and processed by LlamaIndex.
- A language model (LLM) augments the retrieved text, generating a more readable response.
- The synthesized text response is returned to the user.
Step 4: Video Streaming Generation
- VideoDB extracts start and end timestamps from the retrieved content.
- A programmable video stream is generated, allowing users to watch only the relevant parts.
- Additional media overlays like subtitles, branding, and audio enhancements can be applied.
Project Purpose
The goal of this project is to create an efficient video search pipeline that enables users to search for relevant video segments based on spoken content and visual scenes. Instead of relying on traditional metadata-based search, our system leverages semantic retrieval and AI-powered response synthesis to deliver accurate and context-aware results.
Sample Video Search Scenario
To demonstrate the search pipeline, we use a cooking tutorial video as an example. Users can perform queries such as:
- “How do I chop onions?” → Retrieves relevant spoken instructions.
- “Show me the chopping part.” → Retrieves visual scenes where onions are chopped.
- “Summarize the cooking process.” → Uses LLM to generate a step-by-step summary.
Setting Up VideoDB and API Keys
Before we begin indexing and retrieving video content, we need to set up VideoDB, obtain an API key, create a collection, and upload videos.
1. Obtaining a VideoDB API Key
To use VideoDB, you need to register to VideoDB Website and obtain an API key:
2. Setting Up Environment Variables
To securely use the API keys in your project, set them as environment variables:
import os
os.environ["VIDEO_DB_API_KEY"] = "your-api-key-here"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key-here"
3. Creating a Video Collection
A collection in VideoDB is used to organize videos for efficient indexing and retrieval.
from videodb import connect
conn = connect()
collection = conn.create_collection(name="cooking_tutorials", description="VideoDB Retrievers based on Cooking!")
print(f"Collection ID: {collection.id}")
4. Uploading Videos to VideoDB
Once the collection is created, we can upload a video file or import it from a URL:
video = collection.upload_video("path/to/video.mp4")
#Alternatively, for YouTube videos:
video = collection.upload_video("https://www.youtube.com/watch?....")
print(f"Video ID: {video.id}")
Once the videos are uploaded, we can proceed to index and retrieve spoken and visual content.
Now after upload process we are ready to indexing process.
Indexing and Retrieving Video Content
To enable effective search and retrieval, we need to index spoken and visual content in VideoDB. Below, we demonstrate how to:
- Index spoken words — Converting speech into searchable text.
- Index visual scenes — Identifying distinct segments for visual search.
- Retrieve relevant content — Fetching results based on user queries.
- Enhance results with LlamaIndex — Using an LLM for refined responses.
Indexing Spoken Content
Before retrieving spoken words, we must ensure that the spoken word index is created. VideoDB automatically generates transcripts, but indexing needs to be triggered:
from videodb import connect
conn = connect()
coll = conn.get_collection('your-collection-id') # Replace with actual ID
video = coll.get_video('your-video-id') # Replace with actual ID. For multiple video search you can use get_videos() method.
# Index spoken words (only needed once per video)
video.index_spoken_words()
Once indexed, we can search spoken content using semantic retrieval.
Indexing Visual Content (Scene Detection)
To enable visual content search, we need to index scenes in the video. The index_scenes
method in VideoDB extracts key frames based on different scene segmentation techniques and allows generating textual descriptions using a vision model.
The index_scenes
function supports multiple parameters, but we focus on the most relevant ones:
1-) “extraction_type” : Defines how scenes are segmented. Since a video is essentially a sequence of images, different scene extraction methods help select the most meaningful frames.
— SceneExtractionType.shot_based
– Identifies scene boundaries based on camera cuts.
— SceneExtractionType.time_based
– Segments the video at fixed time intervals.
2-) “extraction_config” : Configures scene extraction settings.
— Example: {"time": 5, "frame_count": 3}
“time” defines interval at which scenes are segmented — which means every 5 seconds segments a scene and “frame_count” defines extracts three representative frames per scene.
3-) “prompt” : Allows vision models to generate detailed scene descriptions.
- Example: “Describe the scene in detail” provides a rich textual summary.
- Custom prompts can refine descriptions, e.g., “Identify if a person is running and tag it as running_detected.”
4-) “callback_url” : If provided, a notification is sent when the scene indexing is completed.
index_id = video.index_scenes(
extraction_type=SceneExtractionType.shot_based,
extraction_config={"frame_count": 3},
prompt="Describe the cooking process in detail, mentioning ingredients, actions, and utensils used.",
)
scene_index = video.get_scene_index(index_id)
Once the scene index is created, we can retrieve specific cooking steps using semantic search.
Retrieving Spoken Content
Now that we have indexed the spoken content of our cooking tutorial video, we can retrieve relevant information using semantic search. This allows users to ask natural language queries, such as:
- “How do I chop onions?” → Retrieves spoken instructions related to chopping onions.
- “What are the ingredients for the soup?” → Retrieves the section where ingredients are listed.
To search for spoken words, we use VideoDBRetriever
and specify semantic search on the spoken_word index.
spoken_retriever = VideoDBRetriever(
collection=coll.id,
video=video.id,
search_type=SearchType.semantic, # Enables semantic search
index_type=IndexType.spoken_word, # Searches within spoken words
score_threshold=0.1, # Minimum relevance threshold
)
Now, we can query the indexed spoken words:
spoken_query = "How do I chop onions?"
nodes_spoken_index = spoken_retriever.retrieve(spoken_query)
This retrieves the most relevant timestamps where the phrase or a semantically similar phrase is spoken.
Generating a Summarized Response
Instead of displaying raw transcript text, we can use LlamaIndex’s response synthesizer to generate a more structured response:
from llama_index.core import get_response_synthesizer
response_synthesizer = get_response_synthesizer()
response = response_synthesizer.synthesize(spoken_query, nodes=nodes_spoken_index)
print(response)
The outputs looks like this: “To chop onions, start by cutting off the stem end of the onion, then cut the onion in half from root to stem. Peel off the outer skin and place the flat side of the onion on the cutting board. Make vertical cuts towards the root end without cutting all the way through, then make horizontal cuts parallel to the cutting board. Finally, make downward cuts to dice the onion into small pieces.”
Retrieving Visual Content
In addition to spoken content search, we can perform visual content search using scene-based retrieval. This allows users to search for specific visual moments in the video, such as:
- “Show me the part where onions are being chopped.”
- “Find the scene where the chef adds spices.”
To search for specific visual moments, we use VideoDBRetriever
, specifying scene-based retrieval.
scene_retriever = VideoDBRetriever(
collection=coll.id,
video=video.id,
search_type=SearchType.semantic, # Enables semantic search for scenes
index_type=IndexType.scene, # Searches within visual scene index
scene_index_id=index_id, # Uses the indexed scene data
score_threshold=0.1, # Minimum relevance threshold
)
Now, we can query the indexed scene-based data:
scene_query = "Show me the part where onions are being chopped."
nodes_scene_index = scene_retriever.retrieve(scene_query)
Streaming Retrieved Video Segments (Spoken & Visual Search)
Since we now have retrieved results for both spoken and visual content, we need to extract the timestamps and generate a streamable video segment for each.
We collect timestamps from both spoken and visual search results:
spoken_results = [
(node.metadata["start"], node.metadata["end"])
for node in nodes_spoken_index
]
scene_results = [
(node.metadata["start"], node.metadata["end"])
for node in nodes_scene_index
]
# Merge timestamps from both retrievals
all_results = spoken_results + scene_results
Now, we create a streaming link that plays only the relevant segments:
stream_link = video.generate_stream(all_results)
Finally, we stream the relevant parts of the video:
from videodb import play_stream
play_stream(stream_link)
This ensures that users can search based on both spoken and visual content and instantly watch only the relevant parts of the video like below.
Now, we have fully integrated multimodal search using spoken and visual content retrieval :)
Further related articles do not forget to follow me on Medium and Linkedin :) If you have any question feel free to contact ;)