Work with BAM-related thingy¶

`BAMetadata(fspath)` `dataclass` ¶

A BAMetadata object holds metadata about a given BAM file.

Examples:

Let us use a hypothetical coordinate-sorted BAM file with one read group and two references, as an example:

>>> bametadata = BAMetadata("test.bam")
>>> print(bametadata.sort_by)
coordinate
>>> print(bametadata.read_groups)
[{"ID": "test", "SM": "test"}]
>>> print(bametadata.references)
[{"r1": 1000}, {"r2": 2000}]
>>> print(bametadata)
BAM file: test.bam
Sort by: coordinate
# references: 2
# read groups: 1

Attributes:

Name	Type	Description
`fspath`	`str`	path to the BAM file
`sort_by`	`str`	sort state, e.g. unknown, unsorted, queryname, and coordinate.
`references`	`list[dict[str, int]]`	list of mappings of reference name to its length.
`read_groups`	`list[dict[str, str]]`	list of read group dictionaries

`count_indel_bases(cigar)` ¶

Count the length of indels in the given CIGAR string.

The function counts the number of bases, rather than events.

Examples:

>>> cigar_str = "45M9D27M1I20M5S"
>>> assert count_indel_bases(cigar_str) == 10
true

>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_bases(cigar_list) == 1
true

Parameters:

Name	Type	Description	Default
`cigar`	`Union[str, Sequence[tuple[str, str]]]`	a CIGAR string or a list of tuple items parsed by parse_cigar() function.	required

Returns:

Type	Description
`int`	The number of inserted and deleted bases.

`count_indel_events(cigar)` ¶

Count the number of Is and Ds event in a given CIGAR string.

Indel events include: insertion(I) and deletion(D).

The function counts the number of events, rather than inserted and deleted bases.

Examples:

>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_indel_events(cigar_str) == 2
true

>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_events(cigar_list) == 1
true

Parameters:

Name	Type	Description	Default
`cigar`	`Union[str, Sequence[tuple[str, str]]]`	a CIGAR string or a list of tuple items parsed by parse_cigar() function.	required

Returns:

Type	Description
`int`	The number of insertion and deletion events.

`count_mismatch_events(md)` ¶

Count the number of mismatch events in a given MD string.

Mismatches are represented by any non-digit characters, e.g. A, C, G, ^T.

Examples:

>>> md_str = "10A3T0T10"
>>> assert count_mismatch_events(md_str) == 3
true

>>> md_list = ["85", "^A", "16"] # an inserted base A on the read
>>> assert count_mismatch_events(md_list) == 0
true

Parameters:

Name	Type	Description	Default
`md`	`Union[str, Sequence[str]]`	a CIGAR string or a list strings parsed from parse_md() function.	required

Returns:

Type	Description
`int`	The number of mismatch events.

`count_soft_clip_bases(cigar)` ¶

Count the number of soft-clipped bases from a given CIGAR string.

Soft-clipped bases are represented as "S" in the CIGAR string.

Examples:

>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(cigar_str) == 27
true

>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_soft_clip_bases(cigar_list) == 0
true

>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(parse_cigar(cigar_str)) == 27 # use result from parse_cigar function
true

Parameters:

Name	Type	Description	Default
`cigar`	`Union[str, Sequence[tuple[str, str]]]`	a CIGAR string or a list of tuple items parsed by parse_cigar() function.	required

Returns:

Type	Description
`int`	The number of soft-clipped bases.

`count_unaligned_events(cigar)` ¶

Count the number of unaligned events in a given CIGAR string.

Unaligned events in this context include: insertion(I), deletion(D), soft-clipping(S), and hard-clipping(H).

The function counts the number of events, rather than unaligned bases.

Examples:

>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(cigar_str) == 4
true

>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_unaligned_events(cigar_list) == 1
true

>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(parse_cigar(cigar_str)) == 4
true

Parameters:

Name	Type	Description	Default
`cigar`	`Union[str, Sequence[tuple[str, str]]]`	a CIGAR string or a list of tuple items parsed by parse_cigar() function.	required

Returns:

Type	Description
`int`	The number of unaligned events.

`parse_cigar(cigar)` ¶

Parse a given CIGAR string into a list of tuple items of (length, operation).

Valid operations in a CIGAR string are: M, I, D, N, S, H, P, =, and X.

Examples:

>>> cigar_str = "27S89M1I11M"
>>> assert parse_cigar(cigar_str) == [
    ("27", "S"),
    ("89", "M"),
    ("1", "I"),
    ("11", "M")
]
true

Parameters:

Name	Type	Description	Default
`cigar`	`str`	a CIGAR string	required

Returns:

Type	Description
`list[tuple[str, str]]`	A list of tuple items, each of which consists of length and operation.

Raises:

Type	Description
`ValueError`	when the given CIGAR string is empty, or when the CIGAR string parsed into an empty list, or when failure of reconstructing parsed list back to the original CIGAR string.

`parse_md(md)` ¶

Parse a given MD string into a list.

Examples:

>>> md_str = "10A3T0T10"
>>> assert parse_cigar(cigar_str) == [
        "10", "A", "3", "T", "0", "T", "10"
    ]
true

Parameters:

Name	Type	Description	Default
`md`	`str`	a MD string	required

Returns:

Type	Description
`list[str]`	A list of strings.

Raises:

Type	Description
`ValueError`	when the given MD string is empty, or when the MD string parsed into an empty list, or when failure of reconstructing parsed list back to the original MD string.

`parse_region(region_string, one_based=True)` ¶

Parse a given samtools-style region string into a tuple of rname, start, and end.

Examples:

>>> from tinyscibio import parse_region
>>> region_str = "chr1"
>>> assert parse_region(region_str) == ("chr1", None, None)
true

>>> region_str = "chr1:100"
>>> assert parse_region(region_str) == ("chr1", 100, None)
true

>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str) == ("chr1", 100, 1000)
true

>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str, one_based=True) == ("chr1", 99, 1000)
true

Parameters:

Name	Type	Description	Default
`region_string`	`str`	user-provide string following samtools-style region definition.	required
`one_based`	`bool`	whether or not coordinate in the given string is one-based.	`True`

Returns:

Type	Description
`tuple[str, int \| None, int \| None]`	A tuple of (rname, start, end).

Raises:

Type	Description
`ValueError`	When parser finds no matching pattern, or when start is larger than end coordinate in the given region string.

`walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False)` ¶

Traverse BAM file across given region and collect read-level alignment summary into a dataframe.

Input BAM file must be sorted and indexed to make the walk feasible. Value to region parameter follows the samtools format, and therefore positions are 1-based.

By default, walk_bam returns a dataframe with 7 columns:

- rnames: reference index (easy to map back to sequence name)
- rstarts: 0-based start position on mapped reference
- rends: same as above but marking the end of an alignment
- mqs: mapping qualities
- propers: whether or not a given alignment is properly aligned
- primarys: whether or not a given alignment is primary
- sc_bps: # of soft-clipped base pairs

When return_ecnt is set,

- mm_ecnt: # of mismatch events per alignment
- mm_ecnt: # of indel events per alignment

When return_qname is set,

- qnames: read name

When return_bq is set,

- bqs: base qualities in np.ndarray per alignment

When return_md is set,

- mds: parsed MD in np.ndarray per alignment

Examples:

>>> from tinyscibio import BAMetadata, walk_bam
>>> bam_fspath = "example.bam"
>>> bametadata = BAMetadata(bam_fspath)

Traverse entire region of HLA_A_01_01_01_01 allele:

>>> region_str = "hla_a_01_01_01_01"
>>> df = walk_bam(bam_fspath, region_str)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 7)
┌───────────────────┬─────────┬───────┬─────┬─────────┬──────────┬────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ propers ┆ primarys ┆ sc_bps │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---      ┆ ---    │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆ bool    ┆ bool     ┆ i16    │
╞═══════════════════╪═════════╪═══════╪═════╪═════════╪══════════╪════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ true    ┆ false    ┆ 0      │
└───────────────────┴─────────┴───────┴─────┴─────────┴──────────┴────────┘

Traverse the same region and collect more alignment summaries:

>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True, return_qname=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 10)
┌───────────────────┬─────────┬───────┬─────┬───┬────────┬────────────────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ sc_bps ┆ qnames             ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---    ┆ ---                ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ i16    ┆ str                ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪════════╪════════════════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.2819225  ┆ 0       ┆ 0          │
└───────────────────┴─────────┴───────┴─────┴───┴────────┴────────────────────┴─────────┴────────────┘

Traverse the HLA_A_01_01_01_01 allele starting at position 100. Note that walk_bam has different behavior from pysam.AlignmentFile.fetch() function. It only grabs alignments from postion 100 and onwards.

>>> region_str = "hla_a_01_01_01_01:100"
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 9)
┌───────────────────┬─────────┬───────┬─────┬───┬──────────┬────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ primarys ┆ sc_bps ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---      ┆ ---    ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ bool     ┆ i16    ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪══════════╪════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 341     ┆ 442   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 638     ┆ 724   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 9       ┆ 1          │
└───────────────────┴─────────┴───────┴─────┴───┴──────────┴────────┴─────────┴────────────┘

Parameters:

Name	Type	Description	Default
`fspath`	`str`	Path pointing to the BAM file.	required
`region`	`str`	Region string in samtools style, e.g. chr1:100-1000.	required
`exclude`	`int`	Skip alignments whose flags are set with any bits defined.	`3840`
`chunk_size`	`int`	Maximum size of arrays to be held in memory at one time.	`100000`
`read_groups`	`Optional[Set[str]]`	Set of read groups whose alignments are retrieved.	`None`
`return_ecnt`	`bool`	Whether or not returning number of mismatch+indel events.	`False`
`return_bq`	`bool`	Whether or not returning base qualities.	`False`
`return_md`	`bool`	Whether or not returning parsed MD tuples.	`False`
`return_qname`	`bool`	Whether or not returning query/read names.	`False`

Returns:

Type	Description
`DataFrame`	Collection of summaries of read alignments in dataframe.

Work with BAM-related thingy¶

BAMetadata(fspath) dataclass ¶

count_indel_bases(cigar) ¶

count_indel_events(cigar) ¶

count_mismatch_events(md) ¶

count_soft_clip_bases(cigar) ¶

count_unaligned_events(cigar) ¶

parse_cigar(cigar) ¶

parse_md(md) ¶

parse_region(region_string, one_based=True) ¶

walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False) ¶

`BAMetadata(fspath)` `dataclass` ¶

`count_indel_bases(cigar)` ¶

`count_indel_events(cigar)` ¶

`count_mismatch_events(md)` ¶

`count_soft_clip_bases(cigar)` ¶

`count_unaligned_events(cigar)` ¶

`parse_cigar(cigar)` ¶

`parse_md(md)` ¶

`parse_region(region_string, one_based=True)` ¶

`walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False)` ¶