AI Document Crawler
A retrieval-augmented pipeline that turns sprawling enterprise PDF libraries into a knowledge base you can actually ask questions of.
Overview
At Concept Plus I built an end-to-end document-intelligence pipeline. A crawler discovers and extracts documents, parses their structure, and chunks the text so it can be embedded — normalizing messy, inconsistent PDF layouts into clean, searchable content.
Each chunk is embedded and stored in an Oracle 23ai vector database. A Flask chat interface sits on top, running retrieval-augmented generation with similarity thresholds and confidence-aware fallbacks — so the system answers from source material, or honestly says when it can't.
Highlights
- Crawled and parsed enterprise PDFs end-to-end with crawl4ai, turning varied layouts into clean, chunkable text.
- Embedded and indexed content in Oracle 23ai's native vector store for low-latency semantic search.
- Wired retrieval to the LLM through RAG with similarity thresholds and low-confidence fallbacks to curb hallucination.
- Shipped a Flask chat UI so non-technical users could query the corpus in plain language.