Dataset¶

Original LOHHLA test data¶

lohhlamod repository includes all the inputs--including BAMs and HLA reference--to validate itself against the example data provided in the original HLALOH package.

These inputs are generated by running the mhcflow according to the instructions detailed here.

You can find these files under lohhla_example_og directory within the repository. As a validation baseline, lohhlamod should yield results consistent with the original expectations on this dataset.

Simulation data¶

The repository includes four in silico datasets designed to simulate real-world scenarios. These are idealized datasets intended for demonstration and user guidance; they should be used to familiarize yourself with the tool rather than as benchmarks for complex biological noise. Each simulated case is contained within its own subdirectory under the simulation/ folder in the repository.

Data generation

All inputs for lohhlamod, including BAM and Fasta files, were generated using mhcflow.

Understanding the results

For each simulated case, refer to the provided .loh.res.tsv files to compare the inferred copy numbers estimated by lohhlamod against the ones simulated in the ground truth tables shown below.

Installation verification

You can also use these pre-generated .loh.res.tsv files to verify your installation. After running lohhlamod on your own system, your output should match the results provided in the repository.

Case 1: Subject s6¶

This case simulates a LOH event in the HLA-A gene, alongside an amplification at the HLA-C locus.

HLA Gene	A1_CN	A2_CN	Normal Genotype	Tumor Genotype
HLA-A	1	0	Heterozygous	LOH
HLA-B	1	1	Heterozygous	Neutral
HLA-C	3	1	Heterozygous	Amplification

Case 2: Subject s7¶

This simulates a homozygous deletion of the HLA-C gene (loss of both alleles), and a single-copy amplification at the HLA-B locus.

HLA Gene	A1_CN	A2_CN	Normal Genotype	Tumor Genotype
HLA-A	1	1	Heterozygous	Neutral
HLA-B	2	1	Heterozygous	Amplification
HLA-C	0	0	Heterozygous	Homozygous deletion

Case 3: Subject s1¶

This scenario demonstrates a copy-neutral LOH event at the HLA-A gene.

HLA Gene	A1_CN	A2_CN	Normal Genotype	Tumor Genotype
HLA-A	2	0	Heterozygous	copy-neutral LOH
HLA-B	1	1	Homozygous	--
HLA-C	1	1	Homozygous	--

Note

lohhlamod can only detect LOH events at heterozygous loci.

Case 4: Subject s2¶

This is a non-LOH baseline case used to verify that the tool correctly identifies balanced, heterozygous states.

HLA Gene	A1_CN	A2_CN	Normal Genotype	Tumor Genotype
HLA-A	1	1	Homozygous	--
HLA-B	1	1	Heterozygous	Neutral
HLA-C	1	1	Homozygous	--

Simulation generation¶

This section outlines the process used to generate the in silico datasets. I utilize a "normal-first" approach, where the tumor profile is derived by modifying the baseline normal genotype.

Normal sample¶

Select specific HLA alleles for the in silico individual at each locus.
Create a HLA reference file in FASTA format containing the selected alleles.
Generate paired-end reads from the HLA reference using dwgsim.
```
dwgsim -d 300 -s 20 \
    -N 10000 -1 150 -2 150 \
    -R 0.02 -S 2 -H -o 1 \
    -e 0.00109 -E 0.00109 \
    "$HLA_REF" "$OUT_PREFIX"
```
Tip

The example above generates 10k paired-end reads (150bp) with a 300bp insert size. Refer to the dwgsim documentation for details.
Align the simulated reads to the human genome, then sort and index the resulting BAM file.
Run mhcflow to obtain the necessary inputs for lohhlamod.

Tumor sample¶

The tumor sample is simulated by introducing copy number variations (CNVs) into the established normal profile.

Decide the desired copy number for each allele at each HLA gene locus
Create a tumor-specific HLA Fasta reference.
- For amplification: Include multiple copies of the amplified alleles.
- For LOH: Separate the deleted alleles from the "primary" reference.
Generate sequencing reads from the modified tumor reference.
```
dwgsim -d 280 -s 20 \
    -N 20000 -1 150 -2 150 \
    -R 0.02 -S 2 -H -o 1 \
    -e 0.002 -E 0.002 \
    "$Tumor_HLA_REF" "$OUT_PREFIX"
```
Tip

The command uses a higher error rate (-e and -E flags) and shorter fragment templates (280 vs. 300) to mimic typical real-world tumor samples.
For LOH events, simulate the "deleted" alleles separately at a significantly lower coverage (using the -N option).
Combine all simulated tumor reads, then align, sort and index as performed in the normal workflow.
Run mhcflow using the --realn_only flag.

Note

You must use the HLA reference generated during the Normal Sample step (Step 5). Please refer to here for details.