Work with BAM-related thingy¶
              BAMetadata(fspath)
  
      dataclass
  
¶
    A BAMetadata object holds metadata about a given BAM file.
Examples:
Let us use a hypothetical coordinate-sorted BAM file with one read group and two references, as an example:
>>> bametadata = BAMetadata("test.bam")
>>> print(bametadata.sort_by)
coordinate
>>> print(bametadata.read_groups)
[{"ID": "test", "SM": "test"}]
>>> print(bametadata.references)
[{"r1": 1000}, {"r2": 2000}]
>>> print(bametadata)
BAM file: test.bam
Sort by: coordinate
# references: 2
# read groups: 1
Attributes:
| Name | Type | Description | 
|---|---|---|
fspath | 
            
                  str
             | 
            
               path to the BAM file  | 
          
sort_by | 
            
                  str
             | 
            
               sort state, e.g. unknown, unsorted, queryname, and coordinate.  | 
          
references | 
            
                  list[dict[str, int]]
             | 
            
               list of mappings of reference name to its length.  | 
          
read_groups | 
            
                  list[dict[str, str]]
             | 
            
               list of read group dictionaries  | 
          
            count_indel_bases(cigar)
¶
    Count the length of indels in the given CIGAR string.
The function counts the number of bases, rather than events.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_bases(cigar_list) == 1
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                cigar
             | 
            
                  Union[str, Sequence[tuple[str, str]]]
             | 
            
               a CIGAR string or a list of tuple items parsed by parse_cigar() function.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  int
             | 
            
               The number of inserted and deleted bases.  | 
          
            count_indel_events(cigar)
¶
    Count the number of Is and Ds event in a given CIGAR string.
Indel events include: insertion(I) and deletion(D).
The function counts the number of events, rather than inserted and deleted bases.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_indel_events(cigar_list) == 1
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                cigar
             | 
            
                  Union[str, Sequence[tuple[str, str]]]
             | 
            
               a CIGAR string or a list of tuple items parsed by parse_cigar() function.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  int
             | 
            
               The number of insertion and deletion events.  | 
          
            count_mismatch_events(md)
¶
    Count the number of mismatch events in a given MD string.
Mismatches are represented by any non-digit characters, e.g. A, C, G, ^T.
Examples:
>>> md_list = ["85", "^A", "16"] # an inserted base A on the read
>>> assert count_mismatch_events(md_list) == 0
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                md
             | 
            
                  Union[str, Sequence[str]]
             | 
            
               a CIGAR string or a list strings parsed from parse_md() function.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  int
             | 
            
               The number of mismatch events.  | 
          
            count_soft_clip_bases(cigar)
¶
    Count the number of soft-clipped bases from a given CIGAR string.
Soft-clipped bases are represented as "S" in the CIGAR string.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_soft_clip_bases(cigar_list) == 0
true
>>> cigar_str = "27S89M1I11M"
>>> assert count_soft_clip_bases(parse_cigar(cigar_str)) == 27 # use result from parse_cigar function
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                cigar
             | 
            
                  Union[str, Sequence[tuple[str, str]]]
             | 
            
               a CIGAR string or a list of tuple items parsed by parse_cigar() function.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  int
             | 
            
               The number of soft-clipped bases.  | 
          
            count_unaligned_events(cigar)
¶
    Count the number of unaligned events in a given CIGAR string.
Unaligned events in this context include: insertion(I), deletion(D), soft-clipping(S), and hard-clipping(H).
The function counts the number of events, rather than unaligned bases.
Examples:
>>> cigar_list = [("89", "M"), ("1", "I"), ("11", "M")]
>>> assert count_unaligned_events(cigar_list) == 1
true
>>> cigar_str = "45H35M1D23M1I30S"
>>> assert count_unaligned_events(parse_cigar(cigar_str)) == 4
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                cigar
             | 
            
                  Union[str, Sequence[tuple[str, str]]]
             | 
            
               a CIGAR string or a list of tuple items parsed by parse_cigar() function.  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  int
             | 
            
               The number of unaligned events.  | 
          
            parse_cigar(cigar)
¶
    Parse a given CIGAR string into a list of tuple items of (length, operation).
Valid operations in a CIGAR string are: M, I, D, N, S, H, P, =, and X.
Examples:
>>> cigar_str = "27S89M1I11M"
>>> assert parse_cigar(cigar_str) == [
    ("27", "S"),
    ("89", "M"),
    ("1", "I"),
    ("11", "M")
]
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                cigar
             | 
            
                  str
             | 
            
               a CIGAR string  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  list[tuple[str, str]]
             | 
            
               A list of tuple items, each of which consists of length and operation.  | 
          
Raises:
| Type | Description | 
|---|---|
                  ValueError
             | 
            
               when the given CIGAR string is empty, or when the CIGAR string parsed into an empty list, or when failure of reconstructing parsed list back to the original CIGAR string.  | 
          
            parse_md(md)
¶
    Parse a given MD string into a list.
Examples:
>>> md_str = "10A3T0T10"
>>> assert parse_cigar(cigar_str) == [
        "10", "A", "3", "T", "0", "T", "10"
    ]
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                md
             | 
            
                  str
             | 
            
               a MD string  | 
            required | 
Returns:
| Type | Description | 
|---|---|
                  list[str]
             | 
            
               A list of strings.  | 
          
Raises:
| Type | Description | 
|---|---|
                  ValueError
             | 
            
               when the given MD string is empty, or when the MD string parsed into an empty list, or when failure of reconstructing parsed list back to the original MD string.  | 
          
            parse_region(region_string, one_based=True)
¶
    Parse a given samtools-style region string into a tuple of rname, start, and end.
Examples:
>>> from tinyscibio import parse_region
>>> region_str = "chr1"
>>> assert parse_region(region_str) == ("chr1", None, None)
true
>>> region_str = "chr1:100-1000"
>>> assert parse_region(region_str, one_based=True) == ("chr1", 99, 1000)
true
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                region_string
             | 
            
                  str
             | 
            
               user-provide string following samtools-style region definition.  | 
            required | 
                one_based
             | 
            
                  bool
             | 
            
               whether or not coordinate in the given string is one-based.  | 
            
                  True
             | 
          
Returns:
| Type | Description | 
|---|---|
                  tuple[str, int | None, int | None]
             | 
            
               A tuple of (rname, start, end).  | 
          
Raises:
| Type | Description | 
|---|---|
                  ValueError
             | 
            
               When parser finds no matching pattern, or when start is larger than end coordinate in the given region string.  | 
          
            walk_bam(fspath, region, exclude=3840, chunk_size=100000, read_groups=None, return_ecnt=False, return_bq=False, return_md=False, return_qname=False, return_qseq=False)
¶
    Traverse BAM file across given region and collect read-level alignment summary into a dataframe.
Input BAM file must be sorted and indexed to make the walk feasible. Value to region parameter follows the samtools format, and therefore positions are 1-based.
By default, walk_bam returns a dataframe with 7 columns:
- rnames: reference index (easy to map back to sequence name)
- rstarts: 0-based start position on mapped reference
- rends: same as above but marking the end of an alignment
- mqs: mapping qualities
- propers: whether or not a given alignment is properly aligned
- primarys: whether or not a given alignment is primary
- sc_bps: # of soft-clipped base pairs
When return_ecnt is set,
- mm_ecnt: # of mismatch events per alignment
- mm_ecnt: # of indel events per alignment
When return_qname is set,
- qnames: read name
When return_qseq is set,
- qseqs: read sequence
When return_bq is set,
- bqs: base qualities in np.ndarray per alignment
When return_md is set,
- mds: parsed MD in np.ndarray per alignment
Examples:
>>> from tinyscibio import BAMetadata, walk_bam
>>> bam_fspath = "example.bam"
>>> bametadata = BAMetadata(bam_fspath)
Traverse entire region of HLA_A_01_01_01_01 allele:
>>> region_str = "hla_a_01_01_01_01"
>>> df = walk_bam(bam_fspath, region_str)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 7)
┌───────────────────┬─────────┬───────┬─────┬─────────┬──────────┬────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ propers ┆ primarys ┆ sc_bps │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---      ┆ ---    │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆ bool    ┆ bool     ┆ i16    │
╞═══════════════════╪═════════╪═══════╪═════╪═════════╪══════════╪════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ true    ┆ false    ┆ 0      │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ true    ┆ false    ┆ 0      │
└───────────────────┴─────────┴───────┴─────┴─────────┴──────────┴────────┘
Traverse the same region and collect more alignment summaries:
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True, return_qname=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 10)
┌───────────────────┬─────────┬───────┬─────┬───┬────────┬────────────────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ sc_bps ┆ qnames             ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---    ┆ ---                ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ i16    ┆ str                ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪════════╪════════════════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 4       ┆ 104   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 4       ┆ 102   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.16980922 ┆ 0       ┆ 1          │
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.10444434 ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ 0      ┆ SRR702076.2819225  ┆ 0       ┆ 0          │
└───────────────────┴─────────┴───────┴─────┴───┴────────┴────────────────────┴─────────┴────────────┘
Traverse the HLA_A_01_01_01_01 allele starting at position 100. Note that walk_bam has different behavior from pysam.AlignmentFile.fetch() function. It only grabs alignments from postion 100 and onwards.
>>> region_str = "hla_a_01_01_01_01:100"
>>> df = walk_bam(bam_fspath, region_str, return_ecnt=True)
>>> df = df.with_columns(
    pl.col("rnames").replace_strict(bametadata.idx2seqnames())
)
>>> df.head()
shape: (5, 9)
┌───────────────────┬─────────┬───────┬─────┬───┬──────────┬────────┬─────────┬────────────┐
│ rnames            ┆ rstarts ┆ rends ┆ mqs ┆ … ┆ primarys ┆ sc_bps ┆ mm_ecnt ┆ indel_ecnt │
│ ---               ┆ ---     ┆ ---   ┆ --- ┆   ┆ ---      ┆ ---    ┆ ---     ┆ ---        │
│ str               ┆ i32     ┆ i32   ┆ u8  ┆   ┆ bool     ┆ i16    ┆ i16     ┆ i16        │
╞═══════════════════╪═════════╪═══════╪═════╪═══╪══════════╪════════╪═════════╪════════════╡
│ hla_a_01_01_01_01 ┆ 128     ┆ 229   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 6       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 287     ┆ 388   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 10      ┆ 0          │
│ hla_a_01_01_01_01 ┆ 294     ┆ 395   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 341     ┆ 442   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 0       ┆ 0          │
│ hla_a_01_01_01_01 ┆ 638     ┆ 724   ┆ 0   ┆ … ┆ false    ┆ 0      ┆ 9       ┆ 1          │
└───────────────────┴─────────┴───────┴─────┴───┴──────────┴────────┴─────────┴────────────┘
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                fspath
             | 
            
                  str
             | 
            
               Path pointing to the BAM file.  | 
            required | 
                region
             | 
            
                  str
             | 
            
               Region string in samtools style, e.g. chr1:100-1000.  | 
            required | 
                exclude
             | 
            
                  int
             | 
            
               Skip alignments whose flags are set with any bits defined.  | 
            
                  3840
             | 
          
                chunk_size
             | 
            
                  int
             | 
            
               Maximum size of arrays to be held in memory at one time.  | 
            
                  100000
             | 
          
                read_groups
             | 
            
                  Optional[Set[str]]
             | 
            
               Set of read groups whose alignments are retrieved.  | 
            
                  None
             | 
          
                return_ecnt
             | 
            
                  bool
             | 
            
               Whether or not returning number of mismatch+indel events.  | 
            
                  False
             | 
          
                return_bq
             | 
            
                  bool
             | 
            
               Whether or not returning base qualities.  | 
            
                  False
             | 
          
                return_md
             | 
            
                  bool
             | 
            
               Whether or not returning parsed MD tuples.  | 
            
                  False
             | 
          
                return_qname
             | 
            
                  bool
             | 
            
               Whether or not returning query/read names.  | 
            
                  False
             | 
          
                return_qseq
             | 
            
                  bool
             | 
            
               Whether or not returning query/read sequences.  | 
            
                  False
             | 
          
Returns:
| Type | Description | 
|---|---|
                  DataFrame
             | 
            
               Collection of summaries of read alignments in dataframe.  |