Dataset¶
Original LOHHLA test data¶
lohhlamod repository includes all the inputs--including BAMs and HLA
reference--to validate itself against the example data provided in the
original HLALOH package.
These inputs are generated by running the mhcflow according to the instructions detailed here.
You can find these files under lohhla_example_og directory within the repository.
As a validation baseline, lohhlamod should yield results consistent with the
original expectations on this dataset.
Simulation data¶
The repository includes four in silico datasets designed to simulate
real-world scenarios. These are idealized datasets intended for demonstration
and user guidance; they should be used to familiarize yourself with the
tool rather than as benchmarks for complex biological noise. Each simulated
case is contained within its own subdirectory under the
simulation/ folder in the repository.
Data generation
All inputs for lohhlamod, including BAM and Fasta files, were generated
using mhcflow.
Understanding the results
For each simulated case, refer to the provided .loh.res.tsv files to
compare the inferred copy numbers estimated by lohhlamod against
the ones simulated in the ground truth tables shown below.
Installation verification
You can also use these pre-generated .loh.res.tsv files to verify your installation. After running lohhlamod on your own system, your output should match the results provided in the repository.
Case 1: Subject s6¶
This case simulates a LOH event in the HLA-A gene, alongside an amplification at the HLA-C locus.
| HLA Gene | A1_CN | A2_CN | Normal Genotype | Tumor Genotype |
|---|---|---|---|---|
| HLA-A | 1 | 0 | Heterozygous | LOH |
| HLA-B | 1 | 1 | Heterozygous | Neutral |
| HLA-C | 3 | 1 | Heterozygous | Amplification |
Case 2: Subject s7¶
This simulates a homozygous deletion of the HLA-C gene (loss of both alleles), and a single-copy amplification at the HLA-B locus.
| HLA Gene | A1_CN | A2_CN | Normal Genotype | Tumor Genotype |
|---|---|---|---|---|
| HLA-A | 1 | 1 | Heterozygous | Neutral |
| HLA-B | 2 | 1 | Heterozygous | Amplification |
| HLA-C | 0 | 0 | Heterozygous | Homozygous deletion |
Case 3: Subject s1¶
This scenario demonstrates a copy-neutral LOH event at the HLA-A gene.
| HLA Gene | A1_CN | A2_CN | Normal Genotype | Tumor Genotype |
|---|---|---|---|---|
| HLA-A | 2 | 0 | Heterozygous | copy-neutral LOH |
| HLA-B | 1 | 1 | Homozygous | -- |
| HLA-C | 1 | 1 | Homozygous | -- |
Note
lohhlamod can only detect LOH events at heterozygous loci.
Case 4: Subject s2¶
This is a non-LOH baseline case used to verify that the tool correctly identifies balanced, heterozygous states.
| HLA Gene | A1_CN | A2_CN | Normal Genotype | Tumor Genotype |
|---|---|---|---|---|
| HLA-A | 1 | 1 | Homozygous | -- |
| HLA-B | 1 | 1 | Heterozygous | Neutral |
| HLA-C | 1 | 1 | Homozygous | -- |
Simulation generation¶
This section outlines the process used to generate the in silico datasets. I utilize a "normal-first" approach, where the tumor profile is derived by modifying the baseline normal genotype.
Normal sample¶
- Select specific HLA alleles for the in silico individual at each locus.
- Create a HLA reference file in FASTA format containing the selected alleles.
-
Generate paired-end reads from the HLA reference using dwgsim.
dwgsim -d 300 -s 20 \ -N 10000 -1 150 -2 150 \ -R 0.02 -S 2 -H -o 1 \ -e 0.00109 -E 0.00109 \ "$HLA_REF" "$OUT_PREFIX"Tip
The example above generates 10k paired-end reads (150bp) with a 300bp insert size. Refer to the dwgsim documentation for details.
-
Align the simulated reads to the human genome, then sort and index the resulting BAM file.
- Run mhcflow to obtain the necessary
inputs for
lohhlamod.
Tumor sample¶
The tumor sample is simulated by introducing copy number variations (CNVs) into the established normal profile.
- Decide the desired copy number for each allele at each HLA gene locus
- Create a tumor-specific HLA Fasta reference.
- For amplification: Include multiple copies of the amplified alleles.
- For LOH: Separate the deleted alleles from the "primary" reference.
-
Generate sequencing reads from the modified tumor reference.
dwgsim -d 280 -s 20 \ -N 20000 -1 150 -2 150 \ -R 0.02 -S 2 -H -o 1 \ -e 0.002 -E 0.002 \ "$Tumor_HLA_REF" "$OUT_PREFIX"Tip
The command uses a higher error rate (
-eand-Eflags) and shorter fragment templates (280 vs. 300) to mimic typical real-world tumor samples. -
For LOH events, simulate the "deleted" alleles separately at a significantly lower coverage (using the -N option).
- Combine all simulated tumor reads, then align, sort and index as performed in the normal workflow.
-
Run mhcflow using the
--realn_onlyflag.Note
You must use the HLA reference generated during the Normal Sample step (Step 5). Please refer to here for details.