AIC Multimodal Search Engine: Unifying Video, Audio & Image Retrieval
When text-based search engines fail: Architecting a sub-second retrieval pipeline blending vector embeddings (Milvus) and lexical search (Elasticsearch) for massive unstructured media archives.
The Chaos of Archiving Media Content
A significant problem facing media agencies, production houses, and large content publishers is the inability to locate specific moments within petabytes of video footage. Tagging video and audio files manually is labor-intensive and prone to human error. Standard relational databases (SQL) or text-based search engines (Solr) are fundamentally incapable of searching for a "red sports car driving in the rain" unless a human has explicitly typed those words into the video's metadata.
The system required a pipeline capable of accepting natural language queries and returning the exact timestamp of a video where that action occurred—instantaneously, without relying on human data entry.
Architecting a Hybrid Retrieval Engine
To solve this, I designed the AIC Multimodal Search Engine. A purely vector-based semantic search struggles with exact keyword matching (e.g., specific names, serial numbers), while a purely lexical search fails at semantic understanding. The solution required a "hybrid" retrieval architecture.
1. The Asynchronous Media Ingestion Pipeline (Python + Docker)
The heaviest computational burden in this architecture is the ingestion process. Uploading a 2-hour 4K video requires processing pipelines that do not crash the primary API gateway. I developed asynchronous background workers using Python.
When a media file hit the data/ volume mounted in Docker, the processing script (process.py) engaged. It employed FFmpeg to extract the audio track and segment the video into granular temporal chunks (e.g., 5-second intervals). Concurrently, OpenCV sampled keyframes from the video stream to avoid processing redundant, identical frames.
2. Vectorizing Unstructured Reality (Milvus)
These sampled frames were pushed through a Vision-Language Model (VLM, like CLIP or an equivalent multimodal model) to generate semantic vector embeddings. Simultaneously, the audio track was passed through an Automatic Speech Recognition (ASR) model (Whisper) to generate text transcripts.
The resulting 512-dimensional vectors (visual representations) and transcripts were pushed to a highly-scalable, containerized instance of Milvus, running inside the docker-compose cluster.
3. Hybrid Retrieval & Reciprocal Rank Fusion
I integrated Elasticsearch to handle the lexical side (metadata, exact keywords, OCR text extracted from the video frames). When a user fired a query to the FastAPI server (main.py), the backend embedded the user's text query. It then executed a semantic search against Milvus to find conceptually similar video frames, while simultaneously running a BM25 keyword search against Elasticsearch.
The results were unified and ranked using Reciprocal Rank Fusion (RRF), ensuring that a search for a specific entity name didn't get lost in the vector space math.
Business Impact
By automating the indexing of unstructured media through computer vision and NLP models, the system eliminated the need for human tagging. The hybrid architecture (Milvus + Elasticsearch) achieved sub-second query response times across the entire media dataset. Users could instantly locate specific 5-second clips buried deep within thousands of hours of unlabelled historical footage.
Struggling with unsearchable, unstructured media data?
I build high-performance, hybrid retrieval pipelines for video, audio, and images using production-grade open-source tools.
Let's discuss your search bottleneck mythonggg@gmail.com