Core Architecture
A three-stage pipeline combining document processing, knowledge graph construction, and hybrid retrieval:
01
PDF Processing
Extract structured text from case files using unstructured (standard PDFs) or Tesseract OCR (scanned documents). Chunk and clean for LLM ingestion.
02
Knowledge Graph Construction
LLM extracts entities (actors, obligations, deadlines) and relationships (dependencies, contradictions). Store in Neo4j as nodes and edges for relational reasoning.
03
HybridRAG Retrieval
Combine vector embeddings (Qdrant/FAISS) for semantic search with graph queries for structural analysis.
Core Visualizations
Four primary outputs mapped to the pipeline:
| Visualization | Data Source | Technical Implementation | Purpose |
|---|---|---|---|
| Conversation State | Knowledge Graph + Timeline | React Force Graph (D3.js), date sliders | Near real-time after initial indexing; dynamic filters quickly query the cached graph view. |
| Executive Function Gaps | LLM Pattern Detection | Linguistic markers of avoidance/paralysis | Identify decision friction, deferral patterns |
| Procedural Bottlenecks | Graph Analysis | Cycle detection, dead-end identification | Circular dependencies, unresolved paths |
| Observer vs Operator | Text Classification | Explicit vs inferred reasoning contrast | Ground truth vs belief divergence. E.g., Operator infers 'delay due to avoidance'; Observer sees 'no action recorded'. System highlights the gap. |
Key Challenges Addressed
- Scale: Semantic chunking (by headings or natural breaks) to preserve context, plus graph query optimization for 2400+ pages.
- Accuracy: LLM verifiers + human-in-loop validation.
- Traceability: Full provenance from source text to visualization.
Current Status
Phase 1 (PoC)2-3 sample PDFs → basic pipeline → conversation state visualization.
Phase 2Full corpus processing + bottleneck detection.
Phase 3Executive function and Observer/Operator analysis.
Tech Stack
- PDF Processing
- unstructured, Tesseract OCR
- LLM
- OpenAI GPT-4o / Claude 3.5 Sonnet
- Vector Store / Graph DB
- Qdrant / Neo4j
- Frontend / Backend
- React + D3.js (Force Graph) / FastAPI + LangChain
Validation Approach
- Internal consistency: Graph queries reproduce expected relationships.
- Face validity: Human review of flagged bottlenecks/gaps.
- Convergent validity: Compare LLM outputs to manual analysis.
Known Limitations
- LLM entity extraction may miss implicit relationships (e.g., “as discussed” without an explicit link).
- Real-time analysis requires pre-indexing; dynamic filters operate on cached graph.
- Not a substitute for legal review – intended for pattern discovery and decision support; human review recommended.
This architecture enables computational analysis of case-file corpora while preserving transparency and reproducibility.