Data Infrastructure & RAG

Enterprise Data Ingestion: Eliminating AI Hallucinations for a UK Startup

Architecting an automated, multi-stage RAG (Retrieval-Augmented Generation) ingestion pipeline to process unstructured legal and financial documents safely across multi-tenant boundaries.

Core Technologies Next.js, Python, FastAPI, Qdrant, Prisma, Directus
Role Independent AI & Data Engineer

The Chaos of Unstructured Enterprise Data

A fast-growing UK startup (Visual Hive) recognized that their enterprise clients were drowning in unstructured PDFs, financial statements, and complex compliance reports. To build an intelligent assistant capable of querying these documents, they initially experimented with standard, off-the-shelf RAG wrappers (like basic LangChain document loaders).

The result was disastrous for a B2B setting. Naive text splitters broke the context of complex legal tables in half. The AI models consistently hallucinated critical financial figures because they retrieved the wrong chunks of text. In enterprise software, an AI hallucinating a liability clause or misreading a multi-column financial statement is a dealbreaker.

Architecting a Zero-Hallucination Pipeline

They needed an engineer who could move past simple API calls and build a robust, deterministic data engineering infrastructure. I designed a multi-stage, asynchronous pipeline:

1. The Control Plane (Next.js & Directus)

To manage the ingestion jobs at scale, I built a robust administrative control plane using Next.js on the frontend, backed by a PostgreSQL database via Prisma ORM and managed by Directus headless CMS. This allowed operations staff to securely upload batches of documents, monitor OCR status, vectorization progress, and ingestion errors in real-time. When documents were uploaded, they hit an S3-compatible bucket, which triggered the heavy lifting backend.

2. Intelligent Parsing & Semantic Chunking (Python + FastAPI)

The backend was a highly concurrent Python worker pool exposed via FastAPI. Instead of blindly splitting text by a fixed character count (which destroys table structures), I implemented layout-aware semantic chunking.

The Python workers used advanced OCR and computer vision libraries to detect page layouts, recognizing headers, nested lists, and multi-column tables. Text was grouped logically so that a single "chunk" contained a complete, coherent thought or a full financial table, ensuring the embedding models captured the true semantic meaning.

3. Multi-Tenant Vector Search (Qdrant)

The context chunks were passed through specialized legal/financial embedding models and indexed into Qdrant. In B2B SaaS, data isolation is critical. I engineered a strict multi-tenant namespace strategy using Qdrant’s payload metadata filtering. This mathematically guaranteed that Client A's vectorized data could never, under any circumstances, leak into the retrieval context window of Client B's AI agent.

Advanced Retrieval: Query Rewriting

To further reduce hallucinations during user querying, I implemented a Query Rewriting layer. When a user asked a vague question (e.g., "What are the risks?"), a fast, lightweight LLM step analyzed their conversation history and rewrote the query into an optimized semantic search string (e.g., "Specific liability risks mentioned in the 2024 Q3 compliance audit.") before querying Qdrant. This boosted the retrieval hit rate by over 40%.

Business Impact

By treating RAG as a rigorous data engineering problem rather than a simple prompt-engineering trick, the startup deployed a secure, multi-tenant AI search engine. Automation of document processing increased by 90%, and the system achieved zero hallucinations on ground-truth retrieval, allowing the founding team to confidently secure enterprise pilot contracts.

Are your AI agents returning unreliable data?

Let's fix your vector retrieval pipeline and build a RAG architecture that enterprise clients trust.

Schedule an architecture review mythonggg@gmail.com