HGVS Models
Public Python data model for parsed HGVS variants.
The package is split into:
- :mod:
tinyhgvs.models.sharedfor shared reference and coordinate models - :mod:
tinyhgvs.models.nucleotidefor nucleotide coordinates, edits, and variants - :mod:
tinyhgvs.models.proteinfor protein coordinates, effects, and variants
Type Aliases:
| Name | Description |
|---|---|
VariantDescription |
Tagged union for supported top-level variant models. |
VariantDescription
module-attribute
VariantDescription: TypeAlias = NucleotideVariant | ProteinVariant
Tagged union for supported top-level variant models:
HgvsVariant
dataclass
HgvsVariant(reference: ReferenceSpec | None, coordinate_system: CoordinateSystem, description: VariantDescription)
Top-level model describing a parsed HGVS variant.
This is the root object returned by :func:tinyhgvs.parse_hgvs. It ties the
reference field, coordinate system, and parsed variant description together.
Attributes:
| Name | Type | Description |
|---|---|---|
reference |
ReferenceSpec | None
|
Reference sequence field preceding the |
coordinate_system |
CoordinateSystem
|
HGVS coordinate type. |
description |
VariantDescription
|
Model describing a nucleotide or protein variant. |
Examples:
A coding DNA splice-site substitution parsed into reference, location, and edit models:
>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> variant.reference.primary.id
'NM_004006.2'
>>> variant.coordinate_system.value
'c'
>>> variant_description = variant.description
>>> variant_location = variant_description.location
>>> variant_location.start.coordinate
357
>>> variant_location.start.offset
1
>>> variant_location.start.anchor
<NucleotideAnchor.ABSOLUTE: 'absolute'>
>>> variant_location.end is None
True
>>> variant_edit = variant_description.edit
>>> variant_edit
NucleotideSubstitutionEdit(reference='G', alternate='A', kind='substitution')
A 5' UTR substitution keeps the signed coordinate from the HGVS string:
>>> utr = parse_hgvs("NM_007373.4:c.-1C>T")
>>> utr.description.location.start.coordinate
-1
>>> utr.description.location.start.is_five_prime_utr
True
Shared Python model types for parsed HGVS variants.
This module groups the model pieces used by both nucleotide and protein variants: reference identifiers, coordinate-system labels, and generic intervals over a reference.
CoordinateSystem
Bases: str, Enum
Supported HGVS coordinate types.
The coordinate system tells users what kind of biological reference frame is being used by the parsed variant.
Attributes:
| Name | Type | Description |
|---|---|---|
GENOMIC |
Genomic DNA coordinates written as |
|
CIRCULAR_GENOMIC |
Circular genomic DNA coordinates written as |
|
MITOCHONDRIAL |
Mitochondrial DNA coordinates written as |
|
CODING_DNA |
Coding DNA coordinates written as |
|
NON_CODING_DNA |
Non-coding DNA coordinates written as |
|
RNA |
RNA coordinates written as |
|
PROTEIN |
Protein coordinates written as |
Examples:
Genomic DNA variant:
>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.11:g.33038255C>A")
>>> variant.coordinate_system
<CoordinateSystem.GENOMIC: 'g'>
Coding DNA variant:
>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> variant.coordinate_system
<CoordinateSystem.CODING_DNA: 'c'>
Protein variant:
>>> variant = parse_hgvs("NP_003997.1:p.Trp24Ter")
>>> variant.coordinate_system
<CoordinateSystem.PROTEIN: 'p'>
Accession
dataclass
Sequence accession with optional version.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
Accession string exactly as it appears in the HGVS expression. |
version |
int | None
|
Parsed version suffix when one is present. |
Examples:
A RefSeq protein accession with version:
>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NP_003997.2:p.Val7del")
>>> variant.reference.primary.id
'NP_003997.2'
>>> variant.reference.primary.version
2
A transcript accession without an explicit contextual reference:
>>> variant = parse_hgvs("NM_007373.4:c.-1C>T")
>>> variant.reference.primary.id
'NM_007373.4'
>>> variant.reference.primary.version
4
A genomic accession with transcript context:
>>> variant = parse_hgvs("ENSG00000160190.9(ENST00000352133.2):c.1521+898G>A")
>>> variant.reference.primary.id
'ENSG00000160190.9'
>>> variant.reference.primary.version
9
ReferenceSpec
dataclass
Reference sequence field preceding the : in a HGVS string.
Attributes:
| Name | Type | Description |
|---|---|---|
primary |
Accession
|
Primary accession being described. |
context |
Accession | None
|
Optional contextual accession, commonly a transcript nested inside a genomic reference. |
Examples:
A variant described directly on one reference sequence:
>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.10:g.33038255C>A")
>>> variant.reference.primary.id
'NC_000023.10'
>>> variant.reference.context is None
True
A coding variant described on a genomic reference with transcript context:
>>> variant = parse_hgvs("NC_000023.11(NM_004006.2):c.3921dup")
>>> variant.reference.primary.id
'NC_000023.11'
>>> variant.reference.context.id
'NM_004006.2'
Interval
dataclass
Bases: Generic[PositionT]
Inclusive interval used for nucleotide and protein locations.
When end is omitted, the interval represents a single coordinate.
Attributes:
| Name | Type | Description |
|---|---|---|
start |
PositionT
|
Start position in the range. |
end |
PositionT | None
|
Optional inclusive end position. |
Examples:
A single-position nucleotide substitution has no end coordinate:
>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> variant_location = variant.description.location
>>> variant_location.start.coordinate
357
>>> variant_location.end is None
True
A protein deletion spanning multiple residues has both start and end:
>>> variant = parse_hgvs("NP_003997.2:p.Lys23_Val25del")
>>> variant_location = variant.description.effect.location
>>> variant_location.start.residue
'Lys'
>>> variant_location.end.residue
'Val'