Intervals¶
NoMatchingChr
¶
Bases: Exception
When no matching chromosome found between query and subject
NotInt64StartEnd
¶
Bases: Exception
When columns for start and end coords are not of type Int64
NotSatisfyMinColReq
¶
Bases: Exception
When input does not meet the requirement of minimum 3 columns
bed_to_df(bed_file, one_based=False)
¶
Retrieve intervals from given BED file.
Examples:
Assuming we have a BED file with the following intervals:
│ chr6 ┆ 29909037 ┆ 29913661 │
│ chr6 ┆ 31236526 ┆ 31239869 │
│ chr6 ┆ 31321649 ┆ 31324964 │
>>> import polars as pl
>>> from tinyscbio import bed_to_df
>>> bed_file = "hla.bed"
>>> bed_to_df(bed_file)
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╡
│ chr6 ┆ 29909037 ┆ 29913661 │
│ chr6 ┆ 31236526 ┆ 31239869 │
│ chr6 ┆ 31321649 ┆ 31324964 │
└──────────┴──────────┴──────────┘
Parameters:
Name | Type | Description | Default |
---|---|---|---|
bed_file
|
_PathLike
|
string or path object pointing to a BED file. |
required |
one_based
|
bool
|
whether or not the given BED file is 1-based |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
polars dataframe with input columns. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
when the given path to the BED file does not exist. |
NotSatisfyMinColReq
|
when input qry and/or subj dataframes fail to have a minimum of 3 columns. |
NotInt64StartEnd
|
when 2nd and 3rd columns of qry and/or subj dataframes have non-int64 types. |
find_overlaps(qry, subj)
¶
Find overlaps between query and subject intervals. Strand is not taken into consideration at the moment.
The function returns an empty (not None) polars dataframe if no overlapping intervals are found.
Also, "_q" and "_s" are automatically appended to column names of the query and subject dataframes, respectively, to avoid column name conflict error when combining overlapping intervals from query and subject.
Examples:
>>> import polars as pl
>>> from tinyscibio import find_overlaps
>>> qry = pl.DataFrame(
... {
... "column_1": ["chr1", "chr2", "chr3"],
... "column_2": [1, 200, 800],
... "column_3": [51, 240, 850]
... }
... )
>>> subj = pl.DataFrame(
... {
... "column_1": ["chr1", "chr2"],
... "column_2": [38, 120],
... "column_3": [95, 330]
... }
... )
>>> find_overlaps(qry, subj)
shape: (2, 6)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ column_1_q ┆ column_2_q ┆ column_3_q ┆ column_1_s ┆ column_2_s ┆ column_3_s │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╡
│ chr1 ┆ 1 ┆ 51 ┆ chr1 ┆ 38 ┆ 95 │
│ chr2 ┆ 200 ┆ 240 ┆ chr2 ┆ 120 ┆ 330 │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘
Parameters:
Name | Type | Description | Default |
---|---|---|---|
qry
|
DataFrame
|
query intervals in dataframe. |
required |
subj
|
DataFrame
|
subject intervals from which qry querys. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
polars dataframe with overlapping intervals between |
DataFrame
|
query and subject inputs. |
Raises:
Type | Description |
---|---|
NotSatisfyMinColReq
|
when input qry and/or subj dataframes fail to have a minimum of 3 columns. |
NotInt64StartEnd
|
when 2nd and 3rd columns of qry and/or subj dataframes have non-int64 types. |
NoMatchingChr
|
when no matching chromosomes (1st column) found between qry and subj dataframes. |