This page was composed with the aid of generative artificial intelligence; it is fully curated.

Input Format Support Priorities

  • Priority 0 (Supported): VCF, GRCh38, NGS-derived

  • Priority 1 (Development): VCF, GRCh37, NGS-derived (liftover via bcftools)

  • Priority 1.5 (Development): BAM

  • Priority 2 (Development): CRAM, SAM, FASTQ, BCF (NGS-derived)

  • Priority 3 (Research): Other sequencing/genotyping formats

  • Priority 4 (Research): BED, gVCF, 23andMe, AncestryDNA, TXT formats

  • Priority 5 (Early Research): T2T and other emerging formats

Pipeline Function

  • Add GRCh37 support via bcftools liftover; surface accuracy caveats after conversion

  • Clarify workflow vs job IDs; define single source for workflow definition and per-run job state

  • Represent workflows as finite state matrix, each unique and deterministic workflow should have an assigned ID which can be quickly spot checked

  • Nextflow orchestration

    • Dynamic resource allocation (CPU/memory by attempt, file type+size, etc.)

    • Track active tool/stage; reflect in UI icons and progress

  • Improve progress calculation by normalizing step/substep points to 100%

  • Accept uploads by URL (streamed) and multi-file selects (main + index) with proper pairing

  • Recognize and/or regenerate index files as needed; map unaligned to appropriate reference: currently GRCh38.p14

  • Consider preprocessing complementing PyPGx-led VCF generation (evaluate necessity)

  • Add mtdna-server-2

  • Finish wiring in ZaroHLA

  • Improve analysis, make better use of samtools and bcftools

Calling & Tools

PharmCAT

  • Implement translation layer (lexicon) to translate outside calls to recognized nomenclature

  • Implement optional and intelligent switch to toggle assume reference when missing

PyPGx

  • Batch execution (done) and advanced parallelization controls (CPU/RAM/storage)

  • BAM-to-VCF preprocessing check

  • Evaluate imputation options; expose via advanced settings

HLA Typing

  • Use ZaroHLA (OptiType) for HLA-A/B/C when FASTQ; confirm BAM pathway

  • Align to GRCh38 as part of HLA path

Ancillary and Future tools

  • Now included in Zaromics suite

Reporting

  • Unified report generation combining PharmCAT clinical recommendations with PyPGx gene coverage

  • Add demographics mini-section: mitochondrial lineage/haplogroup and variant rarity context

  • Standardize folder naming of generated reports (timestamp-based) and place logs under data/logs/

  • Display workflow ID specific Kroki/Mermaid workflow diagram in both HTML and PDF outputs

  • Add clear wording: sample vs patient terminology; avoid assumptions of medical context

  • Abstract report theme so cross-pipeline outputs remain stylistically consistent

  • Custom reports: add a QR code containing the raw data

UI/UX

  • Responsive glyphs: wrapping on small screens; grey-out non-applicable steps; size/flex adjustments

  • Add preprocessing glyph (e.g., Liftover) where applicable & mtDNA glyph

  • Unify/clean redundant text

Data & Database

  • PostgreSQL 17: add extensions; implement schemas

  • Adopt JSONB where appropriate; ensure escaping for special characters (done)

  • Begin persisting normalized results; build lexicon layer translating between caller spelling

  • Consolidating reference and sample material (FASTA/CPIC dumps) into a single references/ area

FHIR & Exporting

  • HAPI FHIR server integration; adjust ddl-auto appropriately for prod vs dev

  • Implement export per HL7 Genomics Reporting IG v3 via FHIr r4

  • Explore Fasten as a bridge for import/export to HAPI FHIR

Security & Privacy

  • Ensure self-hosted deployments never transmit genomic data externally

  • Add cookie/consent footer for public deployments with per-user access gating (configurable via .env)

  • Add Privacy Policy and legal page

Docker & CI/CD

  • Clean compose stack; prefer compose.yml naming and remove legacy docker-compose.yml if redundant

  • Implement CI/CD github action to dockerhub image build

  • Clean up deprecated flags

Documentation

  • Achieve complete docs curation

  • Provide example .env guidance; clarify build/run expectations for local Docker

Engineering

  • Modularize large Python modules into smaller, focused files to improve readability and maintainability

Open Questions

  • Where should indexing responsibility live (always regenerate vs recognize existing)?

  • How to unify pipeline progress across heterogeneous inputs (FASTQ/BAM/VCF)?

  • Which schema to implement, ultimately?

  • Visualizations: what would be useful?

  • Should we integrate ClinPGx datasets directly for annotations, instead of (or alongside) a lexicon layer?