AI & Machine Learning · 2025

AI Document Crawler

A retrieval-augmented pipeline that turns sprawling enterprise PDF libraries into a knowledge base you can actually ask questions of.

Overview

At Concept Plus I built an end-to-end document-intelligence pipeline. A crawler discovers and extracts documents, parses their structure, and chunks the text so it can be embedded — normalizing messy, inconsistent PDF layouts into clean, searchable content.

Each chunk is embedded and stored in an Oracle 23ai vector database. A Flask chat interface sits on top, running retrieval-augmented generation with similarity thresholds and confidence-aware fallbacks — so the system answers from source material, or honestly says when it can't.

Highlights

Crawled and parsed enterprise PDFs end-to-end with crawl4ai, turning varied layouts into clean, chunkable text.
Embedded and indexed content in Oracle 23ai's native vector store for low-latency semantic search.
Wired retrieval to the LLM through RAG with similarity thresholds and low-confidence fallbacks to curb hallucination.
Shipped a Flask chat UI so non-technical users could query the corpus in plain language.