System Architecture
Detailed technical architecture and design principles of ZaroPGx.
Quick Reference: For a high-level overview of components and port mappings, see the Architecture Overview.
High-Level Architecture
ZaroPGx is built as a microservices architecture using both reference and API wrapper Docker containers, orchestrated with Docker Compose. The system is designed for extensibility, maintainability, and ensures PHI data privacy when run locally “on premises”.
Core Components
graph TB
subgraph "Client Layer"
UI[Web UI]
API[REST API]
CLI[CLI Tools]
end
subgraph "Application Layer"
APP[FastAPI App]
AUTH[Authentication]
WORKFLOW[Workflow Engine]
end
subgraph "Processing Layer"
PHARMCAT[PharmCAT Service]
PYPGX[PyPGx Service]
GATK[GATK API]
HLA[HLA Typing]
end
subgraph "Data Layer"
DB[(PostgreSQL)]
FHIR[HAPI FHIR]
STORAGE[File Storage]
end
subgraph "Infrastructure Layer"
DOCKER[Docker Engine]
NETWORK[Network Bridge]
VOLUMES[Data Volumes]
end
UI --> APP
API --> APP
CLI --> APP
APP --> WORKFLOW
APP --> AUTH
APP --> DB
WORKFLOW --> PHARMCAT
WORKFLOW --> PYPGX
WORKFLOW --> GATK
WORKFLOW --> HLA
APP --> FHIR
APP --> STORAGE
DOCKER --> NETWORK
DOCKER --> VOLUMES
Service Architecture
Core FastAPI Application (app)
Purpose: Main orchestrator and web user interface Technology: Python 3.12, FastAPI, SQLAlchemy, psycopg, sam,bcftools Port: 8765 → 8000
Key Responsibilities:
Web UI and API endpoints
Workflow orchestration
Database management
Report generation
Authentication and authorization
Key Modules:
app/api/: API routes and modelsapp/services/: Background processingapp/reports/: Report generationapp/pharmcat/: PharmCAT integrationapp/core/: Core utilities
PostgreSQL Database (db)
Purpose: Primary data storage Technology: PostgreSQL 17 (latest stable revision of) Port: 5444 → 5432
Schemas:
public: Core application datacpic: CPIC guidelines and datafhir: FHIR r5 genomic IG resourcesuser_data: User and patient datareports: Generated reports metadataphenopackets: In progress
Key Tables:
patients: Patient informationgenetic_data: Genomic file metadataworkflows: Analysis workflowsworkflow_steps: Individual processing stepsreports: Generated report metadata
PharmCAT Service (pharmcat)
Purpose: Pharmacogenomic analysis engine Technology: Java 17, FastAPI wrapper Port: 5001 → 5000
Key Features:
Star allele calling for 23 core pharmacogenes
CPIC, DWPG, FDA guidelines integration
HTML Report generation
Outside call integration for uncallable genes
API Endpoints:
POST /analyze: Analyze VCF fileGET /status/{job_id}: Check analysis statusGET /results/{job_id}: Get analysis results
PyPGx Service (pypgx)
Purpose: Comprehensive allele calling Technology: Python, PyPGx affordances Port: 5053 → 5000
Key Features:
Star allele calling for 87 pharmacogenes
Difficult to type genes such as CYP2D6
Considers SVs and CNVs
Diplotype and phenotype prediction
Supported Genes:
see config/genes.json
GATK API (gatk-api)
Purpose: Multiple functions Technology: Java, GATK affordances Port: 5002 → 5000
Key Features:
BAM/SAM/CRAM to VCF conversion
Variant calling and filtering
Quality control metrics
Reference genome processing
Processing Pipeline:
Input validation
Reference genome preparation
Variant calling
Quality filtering
VCF output generation
ZaroHLA Typing Service (zarohla)
Purpose: HLA allele calling Technology: Nextflow, OptiType Port: 5055 → 5055
Key Features:
HLA-A, HLA-B, HLA-C typing
OptiType core
HAPI FHIR Server (fhir-server)
Purpose: Healthcare data interoperability Technology: Java, HAPI FHIR Port: 8090 → 8080
Key Features:
FHIR compliance
Groundwork laid for enterprise expansion
Observation resource storage
Structured semantic FHIR query capability
Data Flow Architecture
Upload and Processing Flow
sequenceDiagram
participant U as User
participant A as FastAPI App
participant F as File Processor
participant W as Workflow Engine
participant P as PharmCAT
participant Py as PyPGx
participant G as GATK
participant R as Report Generator
U->>A: Upload file
A->>F: Process file
F->>A: File analysis
A->>W: Create workflow
W->>G: Preprocess (if needed)
G->>W: VCF output
W->>Py: PyPGx analysis
Py->>W: PyPGx results
W->>P: PharmCAT analysis
P->>W: PharmCAT results
W->>R: Generate reports
R->>A: Report URLs
A->>U: Analysis complete
Database Schema Design
erDiagram
PATIENTS ||--o{ GENETIC_DATA : has
PATIENTS ||--o{ WORKFLOWS : creates
WORKFLOWS ||--o{ WORKFLOW_STEPS : contains
WORKFLOWS ||--o{ REPORTS : generates
GENETIC_DATA ||--o{ WORKFLOWS : processes
PATIENTS {
int id PK
string identifier
string name
datetime created_at
datetime updated_at
}
GENETIC_DATA {
int id PK
int patient_id FK
string file_type
string file_path
json metadata
boolean is_supplementary
datetime created_at
}
WORKFLOWS {
string id PK
int patient_id FK
string status
json workflow_metadata
datetime created_at
datetime updated_at
}
WORKFLOW_STEPS {
int id PK
string workflow_id FK
string step_name
string status
int step_order
json output_data
datetime created_at
datetime updated_at
}
REPORTS {
int id PK
string workflow_id FK
string report_type
string file_path
json metadata
datetime created_at
}
Container Architecture
Docker Compose Structure
see `docker-compose.yml.example`
Network Architecture
Bridge Network: pgx-network
Subnet: 172.28.0.0/16
Gateway: 172.28.0.1
DNS: 172.28.0.1
Service Communication:
All services communicate via internal network
External access only through exposed ports
No direct internet access for processing services
Volume Management
Data Volumes:
./data: Shared data directory./reference: Reference genome datapostgres_data: Database persistencepharmcat_data: PharmCAT reference data
Volume Mounts:
Host directories mounted into containers
Persistent data across container restarts
Shared access between services
Security Architecture
Data Privacy
Local Processing:
All analysis happens locally
No external data transmission
Complete data control
Offline capability
Data Encryption:
Data at rest encryption (configurable)
TLS for API communication
Secure file storage
Encrypted database connections
Network Security
Internal Communication:
Services communicate via internal network
No external network access for processing
Firewall rules for port access
VPN support for remote access
Scalability Architecture
Horizontal Scaling
Application Layer:
Multiple FastAPI instances
Load balancer distribution
Session affinity for workflows
Shared database backend
Processing Layer:
Multiple PharmCAT instances
Queue-based job distribution
Resource-aware scheduling
Auto-scaling based on load
Vertical Scaling
Resource Allocation:
Configurable CPU/memory limits
Dynamic resource adjustment
Priority-based scheduling
Resource monitoring
Storage Scaling
Database Scaling:
Read replicas for queries
Connection pooling
Query optimization
Indexing strategies
File Storage:
Distributed file systems
Object storage integration
Backup and replication
Data lifecycle management
Monitoring and Observability
Logging Architecture
Centralized Logging:
Structured JSON logs
Log aggregation and analysis
Error tracking and alerting
Performance monitoring
Log Levels:
DEBUG: Detailed debugging information
INFO: General information
WARNING: Warning messages
ERROR: Error conditions
CRITICAL: Critical errors
Metrics and Monitoring
Application Metrics:
Request/response times
Error rates
Throughput metrics
Resource utilization
System Metrics:
CPU and memory usage
Disk I/O performance
Network traffic
Container health
Health Checks
Service Health:
HTTP health endpoints
Database connectivity
External service availability
Resource availability
Workflow Health:
Processing status
Queue depth
Error rates
Performance metrics
Development Architecture
Code Organization
Module Structure:
app/
├── api/ # API routes and models
├── core/ # Core utilities
├── pharmcat/ # PharmCAT integration
├── reports/ # Report generation
├── services/ # Background services
├── utils/ # Utility functions
└── visualizations/ # Workflow diagrams
Design Patterns:
Dependency injection
Service layer pattern
Repository pattern
Factory pattern
Observer pattern
Testing Architecture
Test Types:
TO DO
Test Infrastructure:
TO DO
Deployment Architecture
Environment Management
Development:
Local Docker Compose
Debug logging enabled
Hot reloading
Test data included
Staging:
Production-like environment
Real data testing
Performance validation
Security testing
Production:
Optimized configuration
Security hardening
Monitoring and alerting
Backup and recovery
CI/CD Pipeline
Build Process:
Docker image building
Dependency scanning
Security scanning
Image optimization
Next Steps
API Reference: API Reference
Development Setup: Development Setup
Contributing: [CURATION NEEDED]
Deployment: Deployment Guide