This page was composed with the aid of generative artificial intelligence; it is partially curated.

System Architecture

Detailed technical architecture and design principles of ZaroPGx.

Quick Reference: For a high-level overview of components and port mappings, see the Architecture Overview.

High-Level Architecture

ZaroPGx is built as a microservices architecture using both reference and API wrapper Docker containers, orchestrated with Docker Compose. The system is designed for extensibility, maintainability, and ensures PHI data privacy when run locally “on premises”.

Core Components

graph TB
    subgraph "Client Layer"
        UI[Web UI]
        API[REST API]
        CLI[CLI Tools]
    end
    
    subgraph "Application Layer"
        APP[FastAPI App]
        AUTH[Authentication]
        WORKFLOW[Workflow Engine]
    end
    
    subgraph "Processing Layer"
        PHARMCAT[PharmCAT Service]
        PYPGX[PyPGx Service]
        GATK[GATK API]
        HLA[HLA Typing]
    end
    
    subgraph "Data Layer"
        DB[(PostgreSQL)]
        FHIR[HAPI FHIR]
        STORAGE[File Storage]
    end
    
    subgraph "Infrastructure Layer"
        DOCKER[Docker Engine]
        NETWORK[Network Bridge]
        VOLUMES[Data Volumes]
    end
    
    UI --> APP
    API --> APP
    CLI --> APP
    
    APP --> WORKFLOW
    APP --> AUTH
    APP --> DB
    
    WORKFLOW --> PHARMCAT
    WORKFLOW --> PYPGX
    WORKFLOW --> GATK
    WORKFLOW --> HLA
    
    APP --> FHIR
    APP --> STORAGE
    
    DOCKER --> NETWORK
    DOCKER --> VOLUMES

Service Architecture

Core FastAPI Application (app)

Purpose: Main orchestrator and web user interface Technology: Python 3.12, FastAPI, SQLAlchemy, psycopg, sam,bcftools Port: 8765 → 8000

Key Responsibilities:

  • Web UI and API endpoints

  • Workflow orchestration

  • Database management

  • Report generation

  • Authentication and authorization

Key Modules:

  • app/api/: API routes and models

  • app/services/: Background processing

  • app/reports/: Report generation

  • app/pharmcat/: PharmCAT integration

  • app/core/: Core utilities

PostgreSQL Database (db)

Purpose: Primary data storage Technology: PostgreSQL 17 (latest stable revision of) Port: 5444 → 5432

Schemas:

  • public: Core application data

  • cpic: CPIC guidelines and data

  • fhir: FHIR r5 genomic IG resources

  • user_data: User and patient data

  • reports: Generated reports metadata

  • phenopackets: In progress

Key Tables:

  • patients: Patient information

  • genetic_data: Genomic file metadata

  • workflows: Analysis workflows

  • workflow_steps: Individual processing steps

  • reports: Generated report metadata

PharmCAT Service (pharmcat)

Purpose: Pharmacogenomic analysis engine Technology: Java 17, FastAPI wrapper Port: 5001 → 5000

Key Features:

  • Star allele calling for 23 core pharmacogenes

  • CPIC, DWPG, FDA guidelines integration

  • HTML Report generation

  • Outside call integration for uncallable genes

API Endpoints:

  • POST /analyze: Analyze VCF file

  • GET /status/{job_id}: Check analysis status

  • GET /results/{job_id}: Get analysis results

PyPGx Service (pypgx)

Purpose: Comprehensive allele calling Technology: Python, PyPGx affordances Port: 5053 → 5000

Key Features:

  • Star allele calling for 87 pharmacogenes

  • Difficult to type genes such as CYP2D6

  • Considers SVs and CNVs

  • Diplotype and phenotype prediction

Supported Genes:

  • see config/genes.json

GATK API (gatk-api)

Purpose: Multiple functions Technology: Java, GATK affordances Port: 5002 → 5000

Key Features:

  • BAM/SAM/CRAM to VCF conversion

  • Variant calling and filtering

  • Quality control metrics

  • Reference genome processing

Processing Pipeline:

  1. Input validation

  2. Reference genome preparation

  3. Variant calling

  4. Quality filtering

  5. VCF output generation

ZaroHLA Typing Service (zarohla)

Purpose: HLA allele calling Technology: Nextflow, OptiType Port: 5055 → 5055

Key Features:

  • HLA-A, HLA-B, HLA-C typing

  • OptiType core

HAPI FHIR Server (fhir-server)

Purpose: Healthcare data interoperability Technology: Java, HAPI FHIR Port: 8090 → 8080

Key Features:

  • FHIR compliance

  • Groundwork laid for enterprise expansion

  • Observation resource storage

  • Structured semantic FHIR query capability

Data Flow Architecture

Upload and Processing Flow

sequenceDiagram
    participant U as User
    participant A as FastAPI App
    participant F as File Processor
    participant W as Workflow Engine
    participant P as PharmCAT
    participant Py as PyPGx
    participant G as GATK
    participant R as Report Generator
    
    U->>A: Upload file
    A->>F: Process file
    F->>A: File analysis
    A->>W: Create workflow
    W->>G: Preprocess (if needed)
    G->>W: VCF output
    W->>Py: PyPGx analysis
    Py->>W: PyPGx results
    W->>P: PharmCAT analysis
    P->>W: PharmCAT results
    W->>R: Generate reports
    R->>A: Report URLs
    A->>U: Analysis complete

Database Schema Design

erDiagram
    PATIENTS ||--o{ GENETIC_DATA : has
    PATIENTS ||--o{ WORKFLOWS : creates
    WORKFLOWS ||--o{ WORKFLOW_STEPS : contains
    WORKFLOWS ||--o{ REPORTS : generates
    GENETIC_DATA ||--o{ WORKFLOWS : processes
    
    PATIENTS {
        int id PK
        string identifier
        string name
        datetime created_at
        datetime updated_at
    }
    
    GENETIC_DATA {
        int id PK
        int patient_id FK
        string file_type
        string file_path
        json metadata
        boolean is_supplementary
        datetime created_at
    }
    
    WORKFLOWS {
        string id PK
        int patient_id FK
        string status
        json workflow_metadata
        datetime created_at
        datetime updated_at
    }
    
    WORKFLOW_STEPS {
        int id PK
        string workflow_id FK
        string step_name
        string status
        int step_order
        json output_data
        datetime created_at
        datetime updated_at
    }
    
    REPORTS {
        int id PK
        string workflow_id FK
        string report_type
        string file_path
        json metadata
        datetime created_at
    }

Container Architecture

Docker Compose Structure

see `docker-compose.yml.example`

Network Architecture

Bridge Network: pgx-network

  • Subnet: 172.28.0.0/16

  • Gateway: 172.28.0.1

  • DNS: 172.28.0.1

Service Communication:

  • All services communicate via internal network

  • External access only through exposed ports

  • No direct internet access for processing services

Volume Management

Data Volumes:

  • ./data: Shared data directory

  • ./reference: Reference genome data

  • postgres_data: Database persistence

  • pharmcat_data: PharmCAT reference data

Volume Mounts:

  • Host directories mounted into containers

  • Persistent data across container restarts

  • Shared access between services

Security Architecture

Authentication and Authorization

Development Mode:

  • Authentication disabled by default

  • All endpoints publicly accessible

  • Debug logging enabled

Production Mode:

  • JWT-based authentication

  • Role-based access control

  • Secure session management

  • Audit logging

Data Privacy

Local Processing:

  • All analysis happens locally

  • No external data transmission

  • Complete data control

  • Offline capability

Data Encryption:

  • Data at rest encryption (configurable)

  • TLS for API communication

  • Secure file storage

  • Encrypted database connections

Network Security

Internal Communication:

  • Services communicate via internal network

  • No external network access for processing

  • Firewall rules for port access

  • VPN support for remote access

Scalability Architecture

Horizontal Scaling

Application Layer:

  • Multiple FastAPI instances

  • Load balancer distribution

  • Session affinity for workflows

  • Shared database backend

Processing Layer:

  • Multiple PharmCAT instances

  • Queue-based job distribution

  • Resource-aware scheduling

  • Auto-scaling based on load

Vertical Scaling

Resource Allocation:

  • Configurable CPU/memory limits

  • Dynamic resource adjustment

  • Priority-based scheduling

  • Resource monitoring

Storage Scaling

Database Scaling:

  • Read replicas for queries

  • Connection pooling

  • Query optimization

  • Indexing strategies

File Storage:

  • Distributed file systems

  • Object storage integration

  • Backup and replication

  • Data lifecycle management

Monitoring and Observability

Logging Architecture

Centralized Logging:

  • Structured JSON logs

  • Log aggregation and analysis

  • Error tracking and alerting

  • Performance monitoring

Log Levels:

  • DEBUG: Detailed debugging information

  • INFO: General information

  • WARNING: Warning messages

  • ERROR: Error conditions

  • CRITICAL: Critical errors

Metrics and Monitoring

Application Metrics:

  • Request/response times

  • Error rates

  • Throughput metrics

  • Resource utilization

System Metrics:

  • CPU and memory usage

  • Disk I/O performance

  • Network traffic

  • Container health

Health Checks

Service Health:

  • HTTP health endpoints

  • Database connectivity

  • External service availability

  • Resource availability

Workflow Health:

  • Processing status

  • Queue depth

  • Error rates

  • Performance metrics

Development Architecture

Code Organization

Module Structure:

app/
├── api/           # API routes and models
├── core/          # Core utilities
├── pharmcat/      # PharmCAT integration
├── reports/       # Report generation
├── services/      # Background services
├── utils/         # Utility functions
└── visualizations/ # Workflow diagrams

Design Patterns:

  • Dependency injection

  • Service layer pattern

  • Repository pattern

  • Factory pattern

  • Observer pattern

Testing Architecture

Test Types:

  • TO DO

Test Infrastructure:

  • TO DO

Deployment Architecture

Environment Management

Development:

  • Local Docker Compose

  • Debug logging enabled

  • Hot reloading

  • Test data included

Staging:

  • Production-like environment

  • Real data testing

  • Performance validation

  • Security testing

Production:

  • Optimized configuration

  • Security hardening

  • Monitoring and alerting

  • Backup and recovery

CI/CD Pipeline

Build Process:

  • Docker image building

  • Dependency scanning

  • Security scanning

  • Image optimization

Next Steps