Behind the scene¶

mhcflow features a modular design and is composed of four components: fisher, realigner, typer, and finalizer. The purpose of this page is to address the following questions:

What does each component do?
Which filter(s) are used (both explicitly and implicitly)?
What are the limitations of the current implementation?

Refer to the diagram for an overview of mhcflow workflow.

Fisher¶

The fisher component extracts reads sequenced from DNA molecules originating from MHC class I and II genes. It employs two primary strategies to identify HLA-derived reads:

Reads with Kmer sequence matches
Reads mapped to the HLA regions specified in a BED file

Kmer-matching reads¶

mhcflow utilizes the Aho-Corasick algorithm (via the pyahocorasick package) to efficiently detect reads containing matching Kmer sequences. In this process, only reads that are mapped to chromosome 6 or are unplaced in the genomic alignments are analyzed. This targeted (greedy) approach is based on experiments showing that the vast majority of Kmer-matching reads originate from chromosome 6, while reads from other chromosomes contribute negligibly. Due to the polymorphic nature of MHC sequences, mhcflow also examines unplaced reads that have Kmer matches.

Note

Unplaced reads refer to those that do not have a determined placement anywhere on a given reference-they are not considered unmapped by definition.

Note

A limitation of this Kmer fishing strategy is that it only captures reads with exact matches. Consequently, any sequencing errors reduces its efficiency. To compensate for this limitation, mhcflow also extracts all reads from a predefined list of HLA regions (regardless of the presence of Kmer sequence pattern), as described in the next section.

Reads from HLA regions¶

Extracting all reads from a predefined list of HLA regions enables mhcflow to capture HLA-derived reads that might be missed by the Kmer-fishing process due to sequencing errors.

Note

In its simplest form, you can specify only the regions to the HLA genes you intend to type, which generally produces good results. Alternatively, you may specify additional regions to potentially capture more HLA-derived reads. However, based on experiments with 1000 genome samples, obtaining more fished reads does not necessarily translate to improved typing accuracy.

Other strategies¶

mhcflow implements the original fishing strategy from the polysolver algorithm, which extracts reads from genomic alignments. There are alternative approaches to capturing HLA-derived reads. For example, Optitype directly aligns all reads from a sequencing sample against the HLA reference.

Advantages
- Captures HLA-derived reads with improved sensitivity. It not only detects a greater number of reads but also identify reads that are more likely to be HLA-derived.
- Does not rely on genomic alignment and can be initiated at the begining of a workflow after reads trimming.
Disadvantages
- HLA-derived reads predominantly originate from chromosome 6, as mentioned earlier.
- The razorS3 aligner used by Optitype is not memory-efficient. You must find a balance betten the number of threads and per-thread memory usage, which can be challenging with varying sequencing library sizes.

Realigner¶

The realigner component takes the HLA-derived reads and re-aligns them against the HLA reference using novoalign. The realignment process can run in parallel when mhcflow is executed with a --nproc value greater than 1.

Note

The following options are used in the novoalign command line for realignment: -R 0 -r All -o FullNW

Note

The realigner component consumes most of the mhcflow runtime compared to the other components. The number of reads processed by each realignment is determined by dividing the total number of HLA-derived reads by the value of --nproc.

You can profile your workflow to determine the optimal number of reads per process based on your assay and computation infrastructure.

For example, in a well-profile assay, the run-to-run variation in the number of DNA molecules originating from HLA genes is expected to be minimal. As a result, samples from a study targeting the similar coverage are likely to yield a consistent number of HLA-derived reads. Due to variation in pull-down efficiency among different HLA alleles in targeted assay, it is recommended to determine the optimal number of reads by profiling those samples with high counts of HLA-derived reads.

Typer¶

Please refer to the mhctyper documentaton for further details on the typer component.

Note

Filters applied silently by mchtyper can be found here

Finalizer¶

The finalizer component generates data ready for downstream analyses:

Sample-level HLA reference: HLA sequences corresponding to the typed allele for each sample.
Sample-level HLA realignment: Realignment of HLA-derived reads against the sample-level HLA reference.

Note

In case of presence of homozygous genotypes, the number of reference sequences in the sample-level HLA reference may not match the number of rows in the typing result table.

Note

The realignment output is coordinate-sorted and indexed but does not have duplicates marked.