← Back to all work
Enterprise Data Platform

Data Ingestion Pipeline

Contract work for Visual Hive, a UK-based startup. Built the core data ingestion system that processes, validates, and indexes unstructured business documents for their RAG pipeline. Paid via Wise Business under a B2B contract.

Next.jsPrismaQdrantDirectusPythonFastAPI

The Problem

Visual Hive's analytics product needed to answer questions across thousands of heterogeneous business documents — spreadsheets, reports, financial data in CSV and XLSX format, and unstructured text. The existing workflow was entirely manual: someone would open each file, extract relevant data, clean formatting inconsistencies, and manually enter it into the system.

At the scale of thousands of documents, this approach was unsustainable. Errors from manual data entry were compounding, and the time-to-insight for new data sets was measured in days rather than minutes. The client needed an automated pipeline that could handle the full lifecycle: ingest, parse, validate, clean, embed, and index — then make it all queryable through their existing RAG infrastructure.

The Approach

Ingestion Platform (Next.js)

Built a Next.js application serving as the primary ingestion interface. It provides upload workflows for multi-format documents (CSV, XLSX, PDF), tracks processing status for each file, and surfaces validation results to operators. Prisma handles the relational data model — document metadata, processing states, validation flags, and audit logs.

Data Processing Pipeline

The core extraction engine handles format-specific parsing (CSV/XLSX cell-level extraction, PDF text extraction), followed by a validation and cleanup stage that normalizes data types, detects anomalies (missing fields, type mismatches, duplicate rows), and flags issues for human review rather than silently discarding data.

Vector Indexing & RAG Integration

Cleaned and validated data is chunked, embedded using sentence-transformers, and stored in Qdrant for high-performance vector similarity search. This integrates directly with the client's RAG pipeline, enabling semantic queries across all indexed documents without additional configuration.

CMS Integration

Connected to Directus to give the client's content team a headless CMS interface for managing document templates, validation rules, and metadata schemas — without requiring engineering involvement for each new document type.

Dual Codebase Strategy

In parallel with the production Next.js system, I built a personal SvelteKit version of the ingestion engine. This served as a rapid prototyping environment where I could test new parsing strategies and validation logic before porting them to the production codebase — significantly accelerating the iteration cycle.

The Outcome

  • Automated ingestion of multi-format documents — replaced manual data entry with a self-service upload pipeline that handles CSV, XLSX, and PDF without operator intervention.
  • RAG-ready vector pipeline — Qdrant-backed semantic search across all indexed documents, enabling the client's analytics product to query ingested data in real-time.
  • Dual codebase strategy — production Next.js + prototype SvelteKit — accelerated iteration by allowing rapid testing of new extraction logic before production deployment.
  • B2B contract delivered — scoped, built, and invoiced via Wise Business with immediate IP assignment, serving as a working reference for the engagement model I use with all clients.