--- title: Project To‑Do & Roadmap curation: full --- ## Input Format Support Priorities - **Priority 0 (Supported)**: VCF, GRCh38, NGS-derived - **Priority 1 (Development)**: VCF, GRCh37, NGS-derived (liftover via bcftools) - **Priority 1.5 (Development)**: BAM - **Priority 2 (Development)**: CRAM, SAM, FASTQ, BCF (NGS-derived) - **Priority 3 (Research)**: Other sequencing/genotyping formats - **Priority 4 (Research)**: BED, gVCF, 23andMe, AncestryDNA, TXT formats - **Priority 5 (Early Research)**: T2T and other emerging formats ## Pipeline Function - Add GRCh37 support via `bcftools` liftover; surface accuracy caveats after conversion - Clarify workflow vs job IDs; define single source for workflow definition and per-run job state - Represent workflows as finite state matrix, each unique and deterministic workflow should have an assigned ID which can be quickly spot checked - Nextflow orchestration - Dynamic resource allocation (CPU/memory by attempt, file type+size, etc.) - Track active tool/stage; reflect in UI icons and progress - Improve progress calculation by normalizing step/substep points to 100% - Accept uploads by URL (streamed) and multi-file selects (main + index) with proper pairing - Recognize and/or regenerate index files as needed; map unaligned to appropriate reference: currently GRCh38.p14 - Consider preprocessing complementing PyPGx-led VCF generation (evaluate necessity) - Add mtdna-server-2 - Finish wiring in ZaroHLA - Improve analysis, make better use of samtools and bcftools ## Calling & Tools ### PharmCAT - Implement translation layer (lexicon) to translate outside calls to recognized nomenclature - Implement optional and intelligent switch to toggle assume reference when missing ### PyPGx - Batch execution (done) and advanced parallelization controls (CPU/RAM/storage) - BAM-to-VCF preprocessing check - Evaluate imputation options; expose via advanced settings ### HLA Typing - Use ZaroHLA (OptiType) for HLA-A/B/C when FASTQ; confirm BAM pathway - Align to GRCh38 as part of HLA path ### Ancillary and Future tools - Now included in Zaromics suite ## Reporting - Unified report generation combining PharmCAT clinical recommendations with PyPGx gene coverage - Add demographics mini-section: mitochondrial lineage/haplogroup and variant rarity context - Standardize folder naming of generated reports (timestamp-based) and place logs under `data/logs/` - Display workflow ID specific Kroki/Mermaid workflow diagram in both HTML and PDF outputs - Add clear wording: sample vs patient terminology; avoid assumptions of medical context - Abstract report theme so cross-pipeline outputs remain stylistically consistent - Custom reports: add a QR code containing the raw data ## UI/UX - Responsive glyphs: wrapping on small screens; grey-out non-applicable steps; size/flex adjustments - Add preprocessing glyph (e.g., Liftover) where applicable & mtDNA glyph - Unify/clean redundant text ## Data & Database - PostgreSQL 17: add extensions; implement schemas - Adopt JSONB where appropriate; ensure escaping for special characters (done) - Begin persisting normalized results; build lexicon layer translating between caller spelling - Consolidating reference and sample material (FASTA/CPIC dumps) into a single `references/` area ## FHIR & Exporting - HAPI FHIR server integration; adjust `ddl-auto` appropriately for prod vs dev - Implement export per HL7 Genomics Reporting IG v3 via FHIr r4 - Explore Fasten as a bridge for import/export to HAPI FHIR ## Security & Privacy - Ensure self-hosted deployments never transmit genomic data externally - Add cookie/consent footer for public deployments with per-user access gating (configurable via `.env`) - Add Privacy Policy and legal page ## Docker & CI/CD - Clean compose stack; prefer `compose.yml` naming and remove legacy `docker-compose.yml` if redundant - Implement CI/CD github action to dockerhub image build - Clean up deprecated flags ## Documentation - Achieve complete docs curation - Provide example `.env` guidance; clarify build/run expectations for local Docker ## Engineering - Modularize large Python modules into smaller, focused files to improve readability and maintainability ## Open Questions - Where should indexing responsibility live (always regenerate vs recognize existing)? - How to unify pipeline progress across heterogeneous inputs (FASTQ/BAM/VCF)? - Which schema to implement, ultimately? - Visualizations: what would be useful? - Should we integrate ClinPGx datasets directly for annotations, instead of (or alongside) a lexicon layer?