Installation Guide
Set up your own mirror of the Epstein Documents Browser
System Requirements
Minimum Requirements
- CPU: 2 cores, 1.5+ GHz (tested on dual-core systems)
- RAM: 4 GB
- Storage: 71+ GB free space (for document data)
- OS: Windows 10+, macOS 10.14+, or Linux
Recommended
- CPU: 4+ cores, 2.0+ GHz (for faster OCR processing)
- RAM: 8+ GB (for better performance with large datasets)
- Storage: 100+ GB SSD (71GB for data + 29GB for system/processing)
- Network: Stable internet connection (for initial download)
Prerequisites
Before installing, ensure you have the following software installed:
Python 3.8+
# Check Python version
python --version
# If not installed, download from:
# https://www.python.org/downloads/
Git
# Check Git installation
git --version
# If not installed, download from:
# https://git-scm.com/downloads
Installation Steps
Step 1: Clone the Repository
# Clone the repository
git clone https://github.com/gopoversight/epstein-documents-browser.git
cd epstein-documents-browser
Step 2: Set Up Python Environment
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Step 3: Configure the Application
Create a configuration file for your instance:
# Create config.py
cat > config.py << EOF
import os
# Data directory (where documents will be stored)
DATA_DIR = os.path.join(os.path.dirname(__file__), 'data')
# Database file
DATABASE = 'images.db'
# Application settings
DEBUG = False
HOST = '0.0.0.0'
PORT = 5000
EOF
Step 4: Index the Documents
This step scans all document images and creates the database:
# Index all document images into the database
python index_images.py
# This will:
# - Scan the data directory for image files
# - Create images.db SQLite database
# - Process each image and extract metadata
# - Show progress as it works through the files
Document Setup
This application works with local document files. You'll need to download the documents from the official sources and place them in the correct directory structure.
Download Documents
# Create data directory
mkdir -p data
# Download from Google Drive (primary source)
# https://drive.google.com/drive/folders/1TrGxDGQLDLZu1vvvZDBAh-e7wN3y6Hoz
# Size: ~71 GB
# Or use Dropbox backup
# https://www.dropbox.com/scl/fo/98fthv8otekjk28lcrnc5/AIn3egnE58MYe4Bn4fliVBw
# Size: ~71 GB
Directory Structure
When you download and extract the official documents, they will be organized in this structure:
data/
└── Prod 01_20250822/
└── VOL00001/
├── IMAGES/
│ ├── IMAGES001/
│ ├── IMAGES002/
│ └── ... (IMAGES003 through IMAGES012)
└── NATIVES/
├── NATIVE006/
├── NATIVE008/
└── ... (other NATIVE directories)
Running the Application
Step 1: Index the Documents
Before running the web application, you need to index the document images into the database:
# Index all document images into the database
python index_images.py
# This will scan the data directory and create images.db
# Progress will be shown as it processes each image
Step 2: Start the Web Application
# Start the Flask application
python app.py
# The application will start and show:
# 🚀 Starting Epstein Documents Browser (Development)...
# 📖 Browse: http://localhost:8080
# 📊 Stats: http://localhost:8080/api/stats
# 🌐 Accessible from: http://0.0.0.0:8080
#
# Press Ctrl+C to stop the server
Environment Configuration
The application uses environment variables for configuration. You can create a .env
file:
# Create .env file for configuration
cat > .env << EOF
# Data directory (where documents are stored)
DATA_DIR=data
# Database file path
DATABASE_PATH=images.db
# Flask configuration
FLASK_ENV=development
HOST=0.0.0.0
PORT=8080
DEBUG=True
EOF
Production Deployment
For production, set environment variables and use a proper WSGI server:
# Set production environment
export FLASK_ENV=production
export HOST=127.0.0.1
export PORT=8080
export DEBUG=False
# Install production WSGI server
pip install gunicorn
# Run with Gunicorn
gunicorn -w 4 -b 127.0.0.1:8080 app:app
Configuration Options
Environment Variables
# Application Configuration
export DATA_DIR=data # Directory containing document images
export DATABASE_PATH=images.db # SQLite database file path
# Flask Configuration
export FLASK_ENV=development # or 'production'
export HOST=0.0.0.0 # or '127.0.0.1' for production
export PORT=8080 # Port to run on
export DEBUG=True # or 'False' for production
# Optional: Secret key for sessions
export SECRET_KEY=your-secret-key-here
Custom Styling
You can customize the appearance by modifying the CSS files in the static/css/
directory or by overriding templates in the templates/
directory.
Security Considerations
- Firewall: Configure your firewall to only allow necessary ports
- HTTPS: Use SSL certificates for production deployments
- Updates: Keep the application and dependencies updated
- Backups: Regularly backup your database and configuration
- Access Control: Consider implementing authentication if needed
Troubleshooting
Common Issues
Missing Dependencies
# Install missing packages
pip install -r requirements.txt
# Common missing packages:
# - markdown (for blog posts)
# - pytesseract (for OCR functionality)
# - Pillow (for image processing)
Data Directory Not Found
# Check if data directory exists
ls -la data/
# Create data directory if missing
mkdir -p data
# Set correct data directory
export DATA_DIR=data
Database Issues
# Check if database exists
ls -la images.db
# Recreate database by re-indexing
rm images.db
python index_images.py
# Check database permissions
chmod 644 images.db
Port Already in Use
# Windows - Check what's using port 8080
netstat -ano | findstr :8080
# Windows - Kill process using the port
taskkill /PID /F
# Use a different port
set PORT=8081
python app.py
Image Processing Errors
# Check if Tesseract is installed (for OCR)
tesseract --version
# Install Tesseract OCR
# Windows: Download from GitHub releases
# macOS: brew install tesseract
# Linux: sudo apt-get install tesseract-ocr
Indexing Problems
# Check indexing progress and errors
python index_images.py
# The script will show:
# - Progress as it processes images
# - Error count if any files fail
# - Summary of indexed/updated/skipped files
Error Messages
Common error messages and solutions:
- "Data directory does not exist" - Create the
data/
directory and download documents - "No module named 'markdown'" - Run
pip install -r requirements.txt
- "Address already in use" - Port 8080 is busy, use a different port or kill the existing process
- "TIF conversion failed" - Image file is corrupted or unsupported format
- "Error reading OCR file" - OCR text file is missing or corrupted
Debugging
# Enable debug mode
set FLASK_ENV=development
set DEBUG=True
python app.py
# Check application output for errors
# The app will print error messages to console
Performance Tips
- Database: Keep
images.db
on fast storage (SSD recommended) - Memory: 4GB+ RAM recommended for smooth operation
- Storage: Ensure 71GB+ free space for document data
- Indexing: Run indexing during off-peak hours as it's resource-intensive
Getting Help
If you encounter issues during installation or setup:
- Check the Usage Guide for common questions
- Review the API Documentation for integration details
- Check the GitHub repository for issues and discussions
- Ensure all prerequisites are properly installed
Data Sources
Official Release:
Document Sources:
Note:
This browser currently works with local document files. API synchronization features are planned for future development.
View complete official context including congressional sources, subpoena details, and document access information.