Back to Research Preview
Technical Expansion

HybridRAG Case-File Analysis Pipeline

This page describes the planned architecture for processing large case-file corpora and generating the four core visualizations: conversation state, executive-function gaps, procedural bottlenecks, and Observer/Operator reasoning splits.

Core Architecture

A three-stage pipeline combining document processing, knowledge graph construction, and hybrid retrieval:

01
PDF Processing
Extract structured text from case files using unstructured (standard PDFs) or Tesseract OCR (scanned documents). Chunk and clean for LLM ingestion.
02
Knowledge Graph Construction
LLM extracts entities (actors, obligations, deadlines) and relationships (dependencies, contradictions). Store in Neo4j as nodes and edges for relational reasoning.
03
HybridRAG Retrieval
Combine vector embeddings (Qdrant/FAISS) for semantic search with graph queries for structural analysis.

Core Visualizations

Four primary outputs mapped to the pipeline:

VisualizationData SourceTechnical ImplementationPurpose
Conversation StateKnowledge Graph + TimelineReact Force Graph (D3.js), date slidersNear real-time after initial indexing; dynamic filters quickly query the cached graph view.
Executive Function GapsLLM Pattern DetectionLinguistic markers of avoidance/paralysisIdentify decision friction, deferral patterns
Procedural BottlenecksGraph AnalysisCycle detection, dead-end identificationCircular dependencies, unresolved paths
Observer vs OperatorText ClassificationExplicit vs inferred reasoning contrastGround truth vs belief divergence. E.g., Operator infers 'delay due to avoidance'; Observer sees 'no action recorded'. System highlights the gap.

Key Challenges Addressed

  • Scale: Semantic chunking (by headings or natural breaks) to preserve context, plus graph query optimization for 2400+ pages.
  • Accuracy: LLM verifiers + human-in-loop validation.
  • Traceability: Full provenance from source text to visualization.

Current Status

Phase 1 (PoC)2-3 sample PDFs → basic pipeline → conversation state visualization.
Phase 2Full corpus processing + bottleneck detection.
Phase 3Executive function and Observer/Operator analysis.

Tech Stack

PDF Processing
unstructured, Tesseract OCR
LLM
OpenAI GPT-4o / Claude 3.5 Sonnet
Vector Store / Graph DB
Qdrant / Neo4j
Frontend / Backend
React + D3.js (Force Graph) / FastAPI + LangChain

Validation Approach

  • Internal consistency: Graph queries reproduce expected relationships.
  • Face validity: Human review of flagged bottlenecks/gaps.
  • Convergent validity: Compare LLM outputs to manual analysis.

Known Limitations

  • LLM entity extraction may miss implicit relationships (e.g., “as discussed” without an explicit link).
  • Real-time analysis requires pre-indexing; dynamic filters operate on cached graph.
  • Not a substitute for legal review – intended for pattern discovery and decision support; human review recommended.

This architecture enables computational analysis of case-file corpora while preserving transparency and reproducibility.