Work with BAM-related thingy¶
BAMetadata(fspath)
dataclass
¶
A BAMetadata object holds metadata about a given BAM file.
Examples:
Let us use a hypothetical coordinate-sorted BAM file with one read group and two references, as an example:
>>> bametadata = BAMetadata("test.bam")
>>> print(bametadata.sort_by)
coordinate
>>> print(bametadata.read_groups)
[{"ID": "test", "SM": "test"}]
>>> print(bametadata.references)
[{"r1": 1000}, {"r2": 2000}]
>>> print(bametadata)
BAM file: test.bam
Sort by: coordinate
# references: 2
# read groups: 1
Attributes:
Name | Type | Description |
---|---|---|
fspath |
str
|
path to the BAM file |
sort_by |
str
|
sort state, e.g. unknown, unsorted, queryname, and coordinate. |
references |
list[dict[str, int]]
|
list of mappings of reference name to its length. |
read_groups |
list[dict[str, str]]
|
list of read group dictionaries |
count_indel_bases(cigar)
¶
Count the length of indels in the given CIGAR string.
The function counts the number of bases, rather than events.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_bases(cigar_list) == 1
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cigar
|
Union[str, Sequence[tuple[str, str]]]
|
a CIGAR string or a list of tuple items parsed by parse_cigar() function. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of inserted and deleted bases. |
count_indel_events(cigar)
¶
Count the number of Is and Ds event in a given CIGAR string.
Indel events include: insertion(I) and deletion(D).
The function counts the number of events, rather than inserted and deleted bases.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_events(cigar_list) == 1
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cigar
|
Union[str, Sequence[tuple[str, str]]]
|
a CIGAR string or a list of tuple items parsed by parse_cigar() function. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of insertion and deletion events. |
count_mismatch_events(md)
¶
Count the number of mismatch events in a given MD string.
Mismatches are represented by any non-digit characters, e.g. A, C, G, ^T.
Examples:
>>> md_list = ["85", "^A", "16"] # an inserted base A on the read
>>> assert count_mismatch_events(md_list) == 0
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md
|
Union[str, Sequence[str]]
|
a CIGAR string or a list strings parsed from parse_md() function. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of mismatch events. |
count_soft_clip_bases(cigar)
¶
Count the number of soft-clipped bases from a given CIGAR string.
Soft-clipped bases are represented as "S" in the CIGAR string.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_soft_clip_bases(cigar_list) == 0
true
>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(parse_cigar(cigar_str)) == 27 # use result from parse_cigar function
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cigar
|
Union[str, Sequence[tuple[str, str]]]
|
a CIGAR string or a list of tuple items parsed by parse_cigar() function. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of soft-clipped bases. |
count_unaligned_events(cigar)
¶
Count the number of unaligned events in a given CIGAR string.
Unaligned events in this context include: insertion(I), deletion(D), soft-clipping(S), and hard-clipping(H).
The function counts the number of events, rather than unaligned bases.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_unaligned_events(cigar_list) == 1
true
>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(parse_cigar(cigar_str)) == 4
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cigar
|
Union[str, Sequence[tuple[str, str]]]
|
a CIGAR string or a list of tuple items parsed by parse_cigar() function. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of unaligned events. |
parse_cigar(cigar)
¶
Parse a given CIGAR string into a list of tuple items of (length, operation).
Valid operations in a CIGAR string are: M, I, D, N, S, H, P, =, and X.
Examples:
>>> cigar_str = "27S89M1I11M"
>>> assert parse_cigar(cigar_str) == [
("27", "S"),
("89", "M"),
("1", "I"),
("11", "M")
]
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cigar
|
str
|
a CIGAR string |
required |
Returns:
Type | Description |
---|---|
list[tuple[str, str]]
|
A list of tuple items, each of which consists of length and operation. |
Raises:
Type | Description |
---|---|
ValueError
|
when the given CIGAR string is empty, or when the CIGAR string parsed into an empty list, or when failure of reconstructing parsed list back to the original CIGAR string. |
parse_md(md)
¶
Parse a given MD string into a list.
Examples:
>>> md_str = "10A3T0T10"
>>> assert parse_cigar(cigar_str) == [
"10", "A", "3", "T", "0", "T", "10"
]
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
md
|
str
|
a MD string |
required |
Returns:
Type | Description |
---|---|
list[str]
|
A list of strings. |
Raises:
Type | Description |
---|---|
ValueError
|
when the given MD string is empty, or when the MD string parsed into an empty list, or when failure of reconstructing parsed list back to the original MD string. |
parse_region(region_string, one_based=True)
¶
Parse a given samtools-style region string into a tuple of rname, start, and end.
Examples:
>>> from tinyscibio import parse_region
>>> region_str = "chr1"
>>> assert parse_region(region_str) == ("chr1", None, None)
true
>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str, one_based=True) == ("chr1", 99, 1000)
true
Parameters:
Name | Type | Description | Default |
---|---|---|---|
region_string
|
str
|
user-provide string following samtools-style region definition. |
required |
one_based
|
bool
|
whether or not coordinate in the given string is one-based. |
True
|
Returns:
Type | Description |
---|---|
tuple[str, int | None, int | None]
|
A tuple of (rname, start, end). |
Raises:
Type | Description |
---|---|
ValueError
|
When parser finds no matching pattern, or when start is larger than end coordinate in the given region string. |
walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False)
¶
Traverse BAM file across given region and collect read-level alignment summary into a dataframe.
Input BAM file must be sorted and indexed to make the walk feasible. Value to region parameter follows the samtools format, and therefore positions are 1-based.
By default, walk_bam returns a dataframe with 7 columns:
- rnames: reference index (easy to map back to sequence name)
- rstarts: 0-based start position on mapped reference
- rends: same as above but marking the end of an alignment
- mqs: mapping qualities
- propers: whether or not a given alignment is properly aligned
- primarys: whether or not a given alignment is primary
- sc_bps: # of soft-clipped base pairs
When return_ecnt is set,
- mm_ecnt: # of mismatch events per alignment
- mm_ecnt: # of indel events per alignment
When return_qname is set,
- qnames: read name
When return_bq is set,
- bqs: base qualities in np.ndarray per alignment
When return_md is set,
- mds: parsed MD in np.ndarray per alignment
Examples:
>>> from tinyscibio import BAMetadata, walk_bam
>>> bam_fspath = "example.bam"
>>> bametadata = BAMetadata(bam_fspath)
Traverse entire region of HLA_A_01_01_01_01 allele:
>>> region_str = "hla_a_01_01_01_01"
>>> df = walk_bam(bam_fspath, region_str)
>>> df = df.with_columns(
pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 7)
┌───────────────────┬─────────┬───────┬─────┬─────────┬──────────┬────────┐
│ rnames ┆ rstarts ┆ rends ┆ mqs ┆ propers ┆ primarys ┆ sc_bps │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ u8 ┆ bool ┆ bool ┆ i16 │
╞═══════════════════╪═════════╪═══════╪═════╪═════════╪══════════╪════════╡
│ hla_a_01_01_01_01 ┆ 4 ┆ 104 ┆ 0 ┆ true ┆ false ┆ 0 │
│ hla_a_01_01_01_01 ┆ 4 ┆ 102 ┆ 0 ┆ true ┆ false ┆ 0 │
│ hla_a_01_01_01_01 ┆ 128 ┆ 229 ┆ 0 ┆ true ┆ false ┆ 0 │
│ hla_a_01_01_01_01 ┆ 287 ┆ 388 ┆ 0 ┆ true ┆ false ┆ 0 │
│ hla_a_01_01_01_01 ┆ 294 ┆ 395 ┆ 0 ┆ true ┆ false ┆ 0 │
└───────────────────┴─────────┴───────┴─────┴─────────┴──────────┴────────┘
Traverse the same region and collect more alignment summaries:
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True, return_qname=True)
>>> df = df.with_columns(
pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 10)
┌───────────────────┬─────────┬───────┬─────┬───┬────────┬────────────────────┬─────────┬────────────┐
│ rnames ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ sc_bps ┆ qnames ┆ mm_ecnt ┆ indel_ecnt │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ u8 ┆ ┆ i16 ┆ str ┆ i16 ┆ i16 │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪════════╪════════════════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 4 ┆ 104 ┆ 0 ┆ … ┆ 0 ┆ SRR702076.16980922 ┆ 0 ┆ 1 │
│ hla_a_01_01_01_01 ┆ 4 ┆ 102 ┆ 0 ┆ … ┆ 0 ┆ SRR702076.16980922 ┆ 0 ┆ 1 │
│ hla_a_01_01_01_01 ┆ 128 ┆ 229 ┆ 0 ┆ … ┆ 0 ┆ SRR702076.10444434 ┆ 6 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 287 ┆ 388 ┆ 0 ┆ … ┆ 0 ┆ SRR702076.10444434 ┆ 10 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 294 ┆ 395 ┆ 0 ┆ … ┆ 0 ┆ SRR702076.2819225 ┆ 0 ┆ 0 │
└───────────────────┴─────────┴───────┴─────┴───┴────────┴────────────────────┴─────────┴────────────┘
Traverse the HLA_A_01_01_01_01 allele starting at position 100. Note that walk_bam has different behavior from pysam.AlignmentFile.fetch() function. It only grabs alignments from postion 100 and onwards.
>>> region_str = "hla_a_01_01_01_01:100"
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True)
>>> df = df.with_columns(
pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 9)
┌───────────────────┬─────────┬───────┬─────┬───┬──────────┬────────┬─────────┬────────────┐
│ rnames ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ primarys ┆ sc_bps ┆ mm_ecnt ┆ indel_ecnt │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ u8 ┆ ┆ bool ┆ i16 ┆ i16 ┆ i16 │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪══════════╪════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 128 ┆ 229 ┆ 0 ┆ … ┆ false ┆ 0 ┆ 6 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 287 ┆ 388 ┆ 0 ┆ … ┆ false ┆ 0 ┆ 10 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 294 ┆ 395 ┆ 0 ┆ … ┆ false ┆ 0 ┆ 0 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 341 ┆ 442 ┆ 0 ┆ … ┆ false ┆ 0 ┆ 0 ┆ 0 │
│ hla_a_01_01_01_01 ┆ 638 ┆ 724 ┆ 0 ┆ … ┆ false ┆ 0 ┆ 9 ┆ 1 │
└───────────────────┴─────────┴───────┴─────┴───┴──────────┴────────┴─────────┴────────────┘
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fspath
|
str
|
Path pointing to the BAM file. |
required |
region
|
str
|
Region string in samtools style, e.g. chr1:100-1000. |
required |
exclude
|
int
|
Skip alignments whose flags are set with any bits defined. |
3840
|
chunk_size
|
int
|
Maximum size of arrays to be held in memory at one time. |
100000
|
read_groups
|
Optional[Set[str]]
|
Set of read groups whose alignments are retrieved. |
None
|
return_ecnt
|
bool
|
Whether or not returning number of mismatch+indel events. |
False
|
return_bq
|
bool
|
Whether or not returning base qualities. |
False
|
return_md
|
bool
|
Whether or not returning parsed MD tuples. |
False
|
return_qname
|
bool
|
Whether or not returning query/read names. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Collection of summaries of read alignments in dataframe. |