Skip to content

Intervals

NoMatchingChr

Bases: Exception

When no matching chromosome found between query and subject

NotInt64StartEnd

Bases: Exception

When columns for start and end coords are not of type Int64

NotSatisfyMinColReq

Bases: Exception

When input does not meet the requirement of minimum 3 columns

bed_to_df(bed_file, one_based=False)

Retrieve intervals from given BED file.

Examples:

Assuming we have a BED file with the following intervals:

│ chr6 ┆ 29909037 ┆ 29913661 │

│ chr6 ┆ 31236526 ┆ 31239869 │

│ chr6 ┆ 31321649 ┆ 31324964 │

>>> import polars as pl
>>> from tinyscbio import bed_to_df
>>> bed_file = "hla.bed"
>>> bed_to_df(bed_file)
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ i64      ┆ i64      │
╞══════════╪══════════╪══════════╡
│ chr6     ┆ 29909037 ┆ 29913661 │
│ chr6     ┆ 31236526 ┆ 31239869 │
│ chr6     ┆ 31321649 ┆ 31324964 │
└──────────┴──────────┴──────────┘

Parameters:

Name Type Description Default
bed_file _PathLike

string or path object pointing to a BED file.

required
one_based bool

whether or not the given BED file is 1-based

False

Returns:

Type Description
DataFrame

polars dataframe with input columns.

Raises:

Type Description
FileNotFoundError

when the given path to the BED file does not exist.

NotSatisfyMinColReq

when input qry and/or subj dataframes fail to have a minimum of 3 columns.

NotInt64StartEnd

when 2nd and 3rd columns of qry and/or subj dataframes have non-int64 types.

find_overlaps(qry, subj)

Find overlaps between query and subject intervals. Strand is not taken into consideration at the moment.

The function returns an empty (not None) polars dataframe if no overlapping intervals are found.

Also, "_q" and "_s" are automatically appended to column names of the query and subject dataframes, respectively, to avoid column name conflict error when combining overlapping intervals from query and subject.

Examples:

>>> import polars as pl
>>> from tinyscibio import find_overlaps
>>> qry = pl.DataFrame(
...    {
...        "column_1": ["chr1", "chr2", "chr3"],
...        "column_2": [1, 200, 800],
...        "column_3": [51, 240, 850]
...    }
... )
>>> subj = pl.DataFrame(
...    {
...        "column_1": ["chr1", "chr2"],
...        "column_2": [38, 120],
...        "column_3": [95, 330]
...    }
... )
>>> find_overlaps(qry, subj)
shape: (2, 6)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ column_1_q ┆ column_2_q ┆ column_3_q ┆ column_1_s ┆ column_2_s ┆ column_3_s │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ str        ┆ i64        ┆ i64        ┆ str        ┆ i64        ┆ i64        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╡
│ chr1       ┆ 1          ┆ 51         ┆ chr1       ┆ 38         ┆ 95         │
│ chr2       ┆ 200        ┆ 240        ┆ chr2       ┆ 120        ┆ 330        │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Parameters:

Name Type Description Default
qry DataFrame

query intervals in dataframe.

required
subj DataFrame

subject intervals from which qry querys.

required

Returns:

Type Description
DataFrame

polars dataframe with overlapping intervals between

DataFrame

query and subject inputs.

Raises:

Type Description
NotSatisfyMinColReq

when input qry and/or subj dataframes fail to have a minimum of 3 columns.

NotInt64StartEnd

when 2nd and 3rd columns of qry and/or subj dataframes have non-int64 types.

NoMatchingChr

when no matching chromosomes (1st column) found between qry and subj dataframes.