Project Overview

What We're Working With

Document Collection
  • 33,928 document images from congressional records
  • Multiple formats including TIF, JPG, and other image types
  • Organized structure with volumes and subdirectories
  • 32,203 images with OCR (94.9% complete)
Processing Status
94.9% Complete

OCR processing continues in the background to extract searchable text from all documents.

Project Description

The Epstein Documents Browser is a sophisticated document management system designed to make congressional records easily accessible and searchable. This reference implementation demonstrates the full capabilities of the open-source project.

Key Objectives
  • Accessibility - Make government documents easily browsable and searchable
  • Transparency - Provide open access to congressional records and documents
  • Searchability - Enable full-text search through OCR-processed documents
  • Usability - Create an intuitive interface for document discovery and navigation

System Architecture

Backend Components
  • Flask Web Server - Python web framework for API and web interface
  • SQLite Database - Document metadata and OCR tracking
  • Image Indexer - Scans and catalogs all document images
  • OCR Processor - Tesseract-based text extraction
  • Screen Management - Production process management
Frontend Components
  • Bootstrap 5 - Responsive UI framework
  • Font Awesome - Icon library
  • Custom JavaScript - Navigation and interaction
  • PIL/Pillow - Image processing and format conversion

Database Schema

Images Table
  • id - Unique identifier
  • file_path - Relative path to image file
  • file_name - Original filename
  • file_size - File size in bytes
  • file_type - File extension
  • has_ocr_text - Boolean flag for OCR completion
  • ocr_text_path - Path to extracted text file
  • file_hash - MD5 hash for change detection
Analytics Tables
  • analytics - Request tracking and user analytics
  • search_queries - Search query tracking and popular searches

Open Source Repository

The complete source code, documentation, and setup instructions are available at:


What's Available
  • Complete source code - All Python, HTML, CSS, and JavaScript files
  • Setup instructions - Step-by-step installation and configuration guide
  • Documentation - API documentation, database schema, and usage guides
  • Production scripts - Screen-based process management for deployment
  • Contributing guidelines - How to contribute to the project