Installation Guide

Set up your own mirror of the Epstein Documents Browser

Why run your own instance? Running your own mirror helps with decentralization, allows for custom queries, and ensures data availability even if the main site is unavailable.

System Requirements

Minimum Requirements
  • CPU: 2 cores, 1.5+ GHz (tested on dual-core systems)
  • RAM: 4 GB
  • Storage: 71+ GB free space (for document data)
  • OS: Windows 10+, macOS 10.14+, or Linux
Recommended
  • CPU: 4+ cores, 2.0+ GHz (for faster OCR processing)
  • RAM: 8+ GB (for better performance with large datasets)
  • Storage: 100+ GB SSD (71GB for data + 29GB for system/processing)
  • Network: Stable internet connection (for initial download)
Storage Warning: The congressional documents require 71 GB minimum of free space. This is the actual size of the released document collection. Plan accordingly!

Prerequisites

Before installing, ensure you have the following software installed:

Python 3.8+

# Check Python version
python --version

# If not installed, download from:
# https://www.python.org/downloads/

Git

# Check Git installation
git --version

# If not installed, download from:
# https://git-scm.com/downloads

Installation Steps

Step 1: Clone the Repository

# Clone the repository
git clone https://github.com/gopoversight/epstein-documents-browser.git
cd epstein-documents-browser

Step 2: Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 3: Configure the Application

Create a configuration file for your instance:

# Create config.py
cat > config.py << EOF
import os

# Data directory (where documents will be stored)
DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')

# Database file
DATABASE = 'images.db'

# Application settings
DEBUG = False
HOST = '0.0.0.0'
PORT = 5000
EOF

Step 4: Index the Documents

This step scans all document images and creates the database:

# Index all document images into the database
python index_images.py

# This will:
# - Scan the data directory for image files
# - Create images.db SQLite database
# - Process each image and extract metadata
# - Show progress as it works through the files
Note: The indexing process may take some time depending on the number of images. The script will show progress as it processes each file.

Document Setup

This application works with local document files. You'll need to download the documents from the official sources and place them in the correct directory structure.

Download Size: The complete document collection is approximately 71 GB. Ensure you have sufficient storage space and a stable internet connection for the download.

Download Documents

# Create data directory
mkdir -p data

# Download from Google Drive (primary source)
# https://drive.google.com/drive/folders/1TrGxDGQLDLZu1vvvZDBAh-e7wN3y6Hoz
# Size: ~71 GB

# Or use Dropbox backup
# https://www.dropbox.com/scl/fo/98fthv8otekjk28lcrnc5/AIn3egnE58MYe4Bn4fliVBw
# Size: ~71 GB

Directory Structure

When you download and extract the official documents, they will be organized in this structure:

data/
└── Prod 01_20250822/
    └── VOL00001/
        ├── IMAGES/
        │   ├── IMAGES001/
        │   ├── IMAGES002/
        │   └── ... (IMAGES003 through IMAGES012)
        └── NATIVES/
            ├── NATIVE006/
            ├── NATIVE008/
            └── ... (other NATIVE directories)
Important: This is the official directory structure as released by the House Oversight Committee. Do not try to reorganize or "fix" this structure - the application is designed to work around their unconventional organization. Just extract the files as-is and let the browser handle the rest.

Running the Application

Step 1: Index the Documents

Before running the web application, you need to index the document images into the database:

# Index all document images into the database
python index_images.py

# This will scan the data directory and create images.db
# Progress will be shown as it processes each image

Step 2: Start the Web Application

# Start the Flask application
python app.py

# The application will start and show:
# 🚀 Starting Epstein Documents Browser (Development)...
# 📖 Browse: http://localhost:8080
# 📊 Stats: http://localhost:8080/api/stats
# 🌐 Accessible from: http://0.0.0.0:8080
# 
# Press Ctrl+C to stop the server

Environment Configuration

The application uses environment variables for configuration. You can create a .env file:

# Create .env file for configuration
cat > .env << EOF
# Data directory (where documents are stored)
DATA_DIR=data

# Database file path
DATABASE_PATH=images.db

# Flask configuration
FLASK_ENV=development
HOST=0.0.0.0
PORT=8080
DEBUG=True
EOF

Production Deployment

For production, set environment variables and use a proper WSGI server:

# Set production environment
export FLASK_ENV=production
export HOST=127.0.0.1
export PORT=8080
export DEBUG=False

# Install production WSGI server
pip install gunicorn

# Run with Gunicorn
gunicorn -w 4 -b 127.0.0.1:8080 app:app

Configuration Options

Environment Variables

# Application Configuration
export DATA_DIR=data                    # Directory containing document images
export DATABASE_PATH=images.db          # SQLite database file path

# Flask Configuration
export FLASK_ENV=development            # or 'production'
export HOST=0.0.0.0                    # or '127.0.0.1' for production
export PORT=8080                       # Port to run on
export DEBUG=True                      # or 'False' for production

# Optional: Secret key for sessions
export SECRET_KEY=your-secret-key-here

Custom Styling

You can customize the appearance by modifying the CSS files in the static/css/ directory or by overriding templates in the templates/ directory.

Security Considerations

  • Firewall: Configure your firewall to only allow necessary ports
  • HTTPS: Use SSL certificates for production deployments
  • Updates: Keep the application and dependencies updated
  • Backups: Regularly backup your database and configuration
  • Access Control: Consider implementing authentication if needed

Troubleshooting

Common Issues

Missing Dependencies

# Install missing packages
pip install -r requirements.txt

# Common missing packages:
# - markdown (for blog posts)
# - pytesseract (for OCR functionality)
# - Pillow (for image processing)

Data Directory Not Found

# Check if data directory exists
ls -la data/

# Create data directory if missing
mkdir -p data

# Set correct data directory
export DATA_DIR=data

Database Issues

# Check if database exists
ls -la images.db

# Recreate database by re-indexing
rm images.db
python index_images.py

# Check database permissions
chmod 644 images.db

Port Already in Use

# Windows - Check what's using port 8080
netstat -ano | findstr :8080

# Windows - Kill process using the port
taskkill /PID  /F

# Use a different port
set PORT=8081
python app.py

Image Processing Errors

# Check if Tesseract is installed (for OCR)
tesseract --version

# Install Tesseract OCR
# Windows: Download from GitHub releases
# macOS: brew install tesseract
# Linux: sudo apt-get install tesseract-ocr

Indexing Problems

# Check indexing progress and errors
python index_images.py

# The script will show:
# - Progress as it processes images
# - Error count if any files fail
# - Summary of indexed/updated/skipped files

Error Messages

Common error messages and solutions:

  • "Data directory does not exist" - Create the data/ directory and download documents
  • "No module named 'markdown'" - Run pip install -r requirements.txt
  • "Address already in use" - Port 8080 is busy, use a different port or kill the existing process
  • "TIF conversion failed" - Image file is corrupted or unsupported format
  • "Error reading OCR file" - OCR text file is missing or corrupted

Debugging

# Enable debug mode
set FLASK_ENV=development
set DEBUG=True
python app.py

# Check application output for errors
# The app will print error messages to console

Performance Tips

  • Database: Keep images.db on fast storage (SSD recommended)
  • Memory: 4GB+ RAM recommended for smooth operation
  • Storage: Ensure 71GB+ free space for document data
  • Indexing: Run indexing during off-peak hours as it's resource-intensive

Getting Help

If you encounter issues during installation or setup:

  • Check the Usage Guide for common questions
  • Review the API Documentation for integration details
  • Check the GitHub repository for issues and discussions
  • Ensure all prerequisites are properly installed
Success! Once installed, your instance will provide search and browsing capabilities for the local document collection.
Data Sources

Official Release:

House Oversight

Document Sources:

Google Drive

Note:

This browser currently works with local document files. API synchronization features are planned for future development.


View complete official context including congressional sources, subpoena details, and document access information.