Skip to content

Work with BAM-related thingy

BAMetadata(fspath) dataclass

A BAMetadata object holds metadata about a given BAM file.

Examples:

Let us use a hypothetical coordinate-sorted BAM file with one read group and two references, as an example:

>>> bametadata = BAMetadata("test.bam")
>>> print(bametadata.sort_by)
coordinate
>>> print(bametadata.read_groups)
[{"ID": "test", "SM": "test"}]
>>> print(bametadata.references)
[{"r1": 1000}, {"r2": 2000}]
>>> print(bametadata)
BAM file: test.bam
Sort by: coordinate
# references: 2
# read groups: 1

Attributes:

Name Type Description
fspath str

path to the BAM file

sort_by str

sort state, e.g. unknown, unsorted, queryname, and coordinate.

references list[dict[str, int]]

list of mappings of reference name to its length.

read_groups list[dict[str, str]]

list of read group dictionaries

count_indel_bases(cigar)

Count the length of indels in the given CIGAR string.

The function counts the number of bases, rather than events.

Examples:

>>> cigar_str = "45M9D27M1I20M5S"
>>> assert count_indel_bases(cigar_str) == 10
true
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_bases(cigar_list) == 1
true

Parameters:

Name Type Description Default
cigar Union[str, Sequence[tuple[str, str]]]

a CIGAR string or a list of tuple items parsed by parse_cigar() function.

required

Returns:

Type Description
int

The number of inserted and deleted bases.

count_indel_events(cigar)

Count the number of Is and Ds event in a given CIGAR string.

Indel events include: insertion(I) and deletion(D).

The function counts the number of events, rather than inserted and deleted bases.

Examples:

>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_indel_events(cigar_str) == 2
true
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_events(cigar_list) == 1
true

Parameters:

Name Type Description Default
cigar Union[str, Sequence[tuple[str, str]]]

a CIGAR string or a list of tuple items parsed by parse_cigar() function.

required

Returns:

Type Description
int

The number of insertion and deletion events.

count_mismatch_events(md)

Count the number of mismatch events in a given MD string.

Mismatches are represented by any non-digit characters, e.g. A, C, G, ^T.

Examples:

>>> md_str = "10A3T0T10"
>>> assert count_mismatch_events(md_str) == 3
true
>>> md_list = ["85", "^A", "16"] # an inserted base A on the read
>>> assert count_mismatch_events(md_list) == 0
true

Parameters:

Name Type Description Default
md Union[str, Sequence[str]]

a CIGAR string or a list strings parsed from parse_md() function.

required

Returns:

Type Description
int

The number of mismatch events.

count_soft_clip_bases(cigar)

Count the number of soft-clipped bases from a given CIGAR string.

Soft-clipped bases are represented as "S" in the CIGAR string.

Examples:

>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(cigar_str) == 27
true
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_soft_clip_bases(cigar_list) == 0
true
>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(parse_cigar(cigar_str)) == 27 # use result from parse_cigar function
true

Parameters:

Name Type Description Default
cigar Union[str, Sequence[tuple[str, str]]]

a CIGAR string or a list of tuple items parsed by parse_cigar() function.

required

Returns:

Type Description
int

The number of soft-clipped bases.

count_unaligned_events(cigar)

Count the number of unaligned events in a given CIGAR string.

Unaligned events in this context include: insertion(I), deletion(D), soft-clipping(S), and hard-clipping(H).

The function counts the number of events, rather than unaligned bases.

Examples:

>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(cigar_str) == 4
true
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_unaligned_events(cigar_list) == 1
true
>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(parse_cigar(cigar_str)) == 4
true

Parameters:

Name Type Description Default
cigar Union[str, Sequence[tuple[str, str]]]

a CIGAR string or a list of tuple items parsed by parse_cigar() function.

required

Returns:

Type Description
int

The number of unaligned events.

parse_cigar(cigar)

Parse a given CIGAR string into a list of tuple items of (length, operation).

Valid operations in a CIGAR string are: M, I, D, N, S, H, P, =, and X.

Examples:

>>> cigar_str = "27S89M1I11M"
>>> assert parse_cigar(cigar_str) == [
    ("27", "S"),
    ("89", "M"),
    ("1", "I"),
    ("11", "M")
]
true

Parameters:

Name Type Description Default
cigar str

a CIGAR string

required

Returns:

Type Description
list[tuple[str, str]]

A list of tuple items, each of which consists of length and operation.

Raises:

Type Description
ValueError

when the given CIGAR string is empty, or when the CIGAR string parsed into an empty list, or when failure of reconstructing parsed list back to the original CIGAR string.

parse_md(md)

Parse a given MD string into a list.

Examples:

>>> md_str = "10A3T0T10"
>>> assert parse_cigar(cigar_str) == [
        "10", "A", "3", "T", "0", "T", "10"
    ]
true

Parameters:

Name Type Description Default
md str

a MD string

required

Returns:

Type Description
list[str]

A list of strings.

Raises:

Type Description
ValueError

when the given MD string is empty, or when the MD string parsed into an empty list, or when failure of reconstructing parsed list back to the original MD string.

parse_region(region_string, one_based=True)

Parse a given samtools-style region string into a tuple of rname, start, and end.

Examples:

>>> from tinyscibio import parse_region
>>> region_str = "chr1"
>>> assert parse_region(region_str) == ("chr1", None, None)
true
>>> region_str = "chr1:100"
>>> assert parse_region(region_str) == ("chr1", 100, None)
true
>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str) == ("chr1", 100, 1000)
true
>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str, one_based=True) == ("chr1", 99, 1000)
true

Parameters:

Name Type Description Default
region_string str

user-provide string following samtools-style region definition.

required
one_based bool

whether or not coordinate in the given string is one-based.

True

Returns:

Type Description
tuple[str, int | None, int | None]

A tuple of (rname, start, end).

Raises:

Type Description
ValueError

When parser finds no matching pattern, or when start is larger than end coordinate in the given region string.

walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False)

Traverse BAM file across given region and collect read-level alignment summary into a dataframe.

Input BAM file must be sorted and indexed to make the walk feasible. Value to region parameter follows the samtools format, and therefore positions are 1-based.

By default, walk_bam returns a dataframe with 7 columns:

- rnames: reference index (easy to map back to sequence name)
- rstarts: 0-based start position on mapped reference
- rends: same as above but marking the end of an alignment
- mqs: mapping qualities
- propers: whether or not a given alignment is properly aligned
- primarys: whether or not a given alignment is primary
- sc_bps: # of soft-clipped base pairs

When return_ecnt is set,

- mm_ecnt: # of mismatch events per alignment
- mm_ecnt: # of indel events per alignment

When return_qname is set,

- qnames: read name

When return_bq is set,

- bqs: base qualities in np.ndarray per alignment

When return_md is set,

- mds: parsed MD in np.ndarray per alignment

Examples:

>>> from tinyscibio import BAMetadata, walk_bam
>>> bam_fspath = "example.bam"
>>> bametadata = BAMetadata(bam_fspath)

Traverse entire region of HLA_A_01_01_01_01 allele:

>>> region_str = "hla_a_01_01_01_01"
>>> df = walk_bam(bam_fspath, region_str)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 7)
┌───────────────────┬─────────┬───────┬─────┬─────────┬──────────┬────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ propers ┆ primarys ┆ sc_bps │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---      ┆ ---    │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆ bool    ┆ bool     ┆ i16    │
╞═══════════════════╪═════════╪═══════╪═════╪═════════╪══════════╪════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ true    ┆ false    ┆ 0      │
└───────────────────┴─────────┴───────┴─────┴─────────┴──────────┴────────┘

Traverse the same region and collect more alignment summaries:

>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True, return_qname=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 10)
┌───────────────────┬─────────┬───────┬─────┬───┬────────┬────────────────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ sc_bps ┆ qnames             ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---    ┆ ---                ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ i16    ┆ str                ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪════════╪════════════════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.2819225  ┆ 0       ┆ 0          │
└───────────────────┴─────────┴───────┴─────┴───┴────────┴────────────────────┴─────────┴────────────┘

Traverse the HLA_A_01_01_01_01 allele starting at position 100. Note that walk_bam has different behavior from pysam.AlignmentFile.fetch() function. It only grabs alignments from postion 100 and onwards.

>>> region_str = "hla_a_01_01_01_01:100"
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 9)
┌───────────────────┬─────────┬───────┬─────┬───┬──────────┬────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ primarys ┆ sc_bps ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---      ┆ ---    ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ bool     ┆ i16    ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪══════════╪════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 341     ┆ 442   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 638     ┆ 724   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 9       ┆ 1          │
└───────────────────┴─────────┴───────┴─────┴───┴──────────┴────────┴─────────┴────────────┘

Parameters:

Name Type Description Default
fspath str

Path pointing to the BAM file.

required
region str

Region string in samtools style, e.g. chr1:100-1000.

required
exclude int

Skip alignments whose flags are set with any bits defined.

3840
chunk_size int

Maximum size of arrays to be held in memory at one time.

100000
read_groups Optional[Set[str]]

Set of read groups whose alignments are retrieved.

None
return_ecnt bool

Whether or not returning number of mismatch+indel events.

False
return_bq bool

Whether or not returning base qualities.

False
return_md bool

Whether or not returning parsed MD tuples.

False
return_qname bool

Whether or not returning query/read names.

False

Returns:

Type Description
DataFrame

Collection of summaries of read alignments in dataframe.