Project Overview
What We're Working With
Document Collection
- 33,928 document images from congressional records
- Multiple formats including TIF, JPG, and other image types
- Organized structure with volumes and subdirectories
- 32,203 images with OCR (94.9% complete)
Processing Status
OCR processing continues in the background to extract searchable text from all documents.
Project Description
The Epstein Documents Browser is a sophisticated document management system designed to make congressional records easily accessible and searchable. This reference implementation demonstrates the full capabilities of the open-source project.
Key Objectives
- Accessibility - Make government documents easily browsable and searchable
- Transparency - Provide open access to congressional records and documents
- Searchability - Enable full-text search through OCR-processed documents
- Usability - Create an intuitive interface for document discovery and navigation
System Architecture
Backend Components
- Flask Web Server - Python web framework for API and web interface
- SQLite Database - Document metadata and OCR tracking
- Image Indexer - Scans and catalogs all document images
- OCR Processor - Tesseract-based text extraction
- Screen Management - Production process management
Frontend Components
- Bootstrap 5 - Responsive UI framework
- Font Awesome - Icon library
- Custom JavaScript - Navigation and interaction
- PIL/Pillow - Image processing and format conversion
Database Schema
Images Table
id
- Unique identifierfile_path
- Relative path to image filefile_name
- Original filenamefile_size
- File size in bytesfile_type
- File extensionhas_ocr_text
- Boolean flag for OCR completionocr_text_path
- Path to extracted text filefile_hash
- MD5 hash for change detection
Analytics Tables
analytics
- Request tracking and user analyticssearch_queries
- Search query tracking and popular searches
Open Source Repository
The complete source code, documentation, and setup instructions are available at:
What's Available
- Complete source code - All Python, HTML, CSS, and JavaScript files
- Setup instructions - Step-by-step installation and configuration guide
- Documentation - API documentation, database schema, and usage guides
- Production scripts - Screen-based process management for deployment
- Contributing guidelines - How to contribute to the project