DocuClaw

YOUR DOCUMENTS. YOUR RULES.

Open-source, local-first, AI-powered document intelligence. Extract, organize, and archive invoices, receipts, and contracts — 100% on your machine.

View on GitHub

Quick Start

# Clone & install
$ git clone https://github.com/astonysh/DocuClaw.git
$ cd DocuClaw && pip install -e .

# Process a document
$ docuclaw process \
    --entity-id "org_mycompany_01" \
    --country DE \
    --input ./scans/invoice.png

What It Does

🛡️

100% Sovereign

All data stays on YOUR machine. Zero cloud dependency. Zero telemetry. Your privacy is non-negotiable.

🏢

Multi-Entity

Manage personal docs, company invoices, and team files — all in one install. Separate or combine as you wish.

🔌

Plugin Architecture

Country-specific parsers snap in like LEGO bricks. Germany, US, China — extend DocuClaw for any locale.

📝

Markdown-Native

Every document becomes a searchable .md file with structured YAML frontmatter. Human-readable, version-controllable.

🤖

AI-Powered Extraction

Multimodal LLM extracts structured data from scans, photos, and emails. Works with Ollama, OpenAI, or any model.

Compliance-Ready

Designed with GoBD (Germany), GDPR, and audit-trail principles baked in. Enterprise-grade from day one.

GDPR & Compliance

DocuClaw is designed from the ground up with EU GDPR compliance in mind. By keeping all data processing local and giving you full control, DocuClaw eliminates the most common compliance risks associated with cloud-based document management.

🏠

Local-First by Design

No data leaves your machine — ever. No third-party servers, no cross-border data transfers, no sub-processors. Full compliance with GDPR Articles 44–49 on international data transfers by simply not transferring data at all.

🎯

Data Minimization

DocuClaw only extracts and stores the structured fields you define. No hidden telemetry, no usage analytics, no behavioral tracking. Aligned with GDPR Article 5(1)(c) — data minimization principle.

🗑️

Right to Erasure

Since all data is stored as plain Markdown files on your local filesystem, exercising the right to erasure (GDPR Article 17) is as simple as deleting a file. No vendor lock-in, no deletion request tickets.

📋

Audit Trail & Accountability

Built-in audit logging and hash-chain integrity verification support GDPR Article 5(2) accountability requirements and GoBD (Germany) compliant archival standards.

Architecture

┌─────────────────────────────────────────────┐
│                   CLI / API                  │
├─────────────────────────────────────────────┤
│               Core Engine                    │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐ │
│  │  Schema   │  │ Storage  │  │  Registry │ │
│  │(Pydantic) │  │  Layer   │  │  (Plugin) │ │
│  └──────────┘  └──────────┘  └───────────┘ │
├─────────────────────────────────────────────┤
│             Parser Plugins                   │
│  ┌────────┐  ┌────────┐  ┌──────────────┐  │
│  │ DE 🇩🇪  │  │ US 🇺🇸  │  │ Custom ...  │  │
│  │Invoice │  │Invoice │  │  Your Parser │  │
│  └────────┘  └────────┘  └──────────────┘  │
├─────────────────────────────────────────────┤
│        Input Adapters (Future)               │
│  📷 Scanner │ 📧 Email │ 🔗 Webhook │ 🔌 API │
└─────────────────────────────────────────────┘

The Data Contract

Every document, whether a €10K enterprise invoice or a personal electricity bill, is normalized into a universal Markdown schema with structured YAML frontmatter.

---
id: doc_20260215_a1b2c3d4
entity_id: "org_acme_01"
entity_type: "company"
source_type: physical_mail
country: DE
document_type: b2b_invoice
date_received: "2026-02-15"
sender_name: "AWS EMEA SARL"
amount_total: 125.50
currency: EUR
status: pending
tags: [IT_Infrastructure, Q1_Expense]
---

How It Works

📄
Document Input
Scan, email, or API
🤖
AI Extraction
LLM-powered parsing
🔍
Validation
Pydantic schema check
📁
Local Archive
Structured Markdown

AI-Powered Output

DocuClaw doesn't just archive your documents — it makes them actionable. Through AI agent integration, your structured data becomes a living knowledge base that can answer questions, automate workflows, and feed directly into the tools you already use.

💬

Ask Your Documents

Chat with your document archive through an AI agent. "How much did I spend on AWS last quarter?" "When does my lease expire?" — Get instant answers from your own data.

📅

Calendar & Reminders

Auto-extract payment due dates, contract renewals, and deadlines from your documents and sync them to your calendar. Never miss a deadline again.

🧾

Tax Filing & Reports

Generate tax-ready summaries, expense reports, and financial overviews directly from your archived invoices and receipts. Export in formats your accountant or tax software expects.

To-Do & Task Lists

Automatically create action items from documents — "Pay invoice #4521 by March 15", "Renew insurance policy", "Submit quarterly VAT return" — and push them to your task manager.

🔗

Third-Party Systems

Generate and submit data in the exact format required by accounting software (DATEV, Xero, QuickBooks), ERP systems, government portals, and banking platforms — all from your local archive.

📊

Custom Analytics

Build custom dashboards and reports from your document data. Track spending trends, vendor relationships, contract status, and compliance metrics — all processed locally.

Ecosystem

Roadmap

Core schema, storage engine, parser framework, CLI
Email ingestion adapter (IMAP / POP3)
Real multimodal LLM integration (Ollama, OpenAI Vision)
Web UI dashboard (local-only, no cloud)
GoBD-compliant audit trail with hash chains
Multi-entity permission model & team collaboration
Webhook & API ingestion endpoints