Skip to content

Nucleotide Models

Nucleotide-focused Python model types for parsed HGVS variants.

Type Aliases:

Name Description
NucleotideSequenceItem

Tagged union for supported inserted or replacement nucleotide sequence models.

NucleotideEdit

Tagged union for supported nucleotide edit models.

NucleotideSequenceItem module-attribute

NucleotideSequenceItem: TypeAlias = LiteralSequenceItem | RepeatSequenceItem | CopiedSequenceItem

Tagged union for supported inserted or replacement nucleotide components:

CoordinateSystem

Bases: str, Enum

Supported HGVS coordinate types.

The coordinate system tells users what kind of biological reference frame is being used by the parsed variant.

Attributes:

Name Type Description
GENOMIC

Genomic DNA coordinates written as g..

CIRCULAR_GENOMIC

Circular genomic DNA coordinates written as o..

MITOCHONDRIAL

Mitochondrial DNA coordinates written as m..

CODING_DNA

Coding DNA coordinates written as c..

NON_CODING_DNA

Non-coding DNA coordinates written as n..

RNA

RNA coordinates written as r..

PROTEIN

Protein coordinates written as p..

Examples:

Genomic DNA variant:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.11:g.33038255C>A")
>>> variant.coordinate_system
<CoordinateSystem.GENOMIC: 'g'>

Coding DNA variant:

>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> variant.coordinate_system
<CoordinateSystem.CODING_DNA: 'c'>

Protein variant:

>>> variant = parse_hgvs("NP_003997.1:p.Trp24Ter")
>>> variant.coordinate_system
<CoordinateSystem.PROTEIN: 'p'>

NucleotideAnchor

Bases: str, Enum

HGVS reference point used to interpret a nucleotide position.

Attributes:

Name Type Description
ABSOLUTE

Coordinate is read directly on the named reference sequence.

RELATIVE_CDS_START

Coordinate is read relative to the CDS start site.

RELATIVE_CDS_END

Coordinate is read relative to the CDS end site.

Examples:

An intronic splice-site substitution uses direct coordinates:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> variant.description.location.start.anchor
<NucleotideAnchor.ABSOLUTE: 'absolute'>

A 5' UTR substitution is anchored to the CDS start:

>>> variant = parse_hgvs("NM_007373.4:c.-1C>T")
>>> variant.description.location.start.anchor
<NucleotideAnchor.RELATIVE_CDS_START: 'relative_cds_start'>

A 3' UTR substitution is anchored to the CDS end:

>>> variant = parse_hgvs("NM_001272071.2:c.*1C>T")
>>> variant.description.location.start.anchor
<NucleotideAnchor.RELATIVE_CDS_END: 'relative_cds_end'>

NucleotideCoordinate dataclass

NucleotideCoordinate(anchor: NucleotideAnchor, coordinate: int, offset: int = 0)

Nucleotide coordinate with anchor and signed offset semantics.

Attributes:

Name Type Description
anchor NucleotideAnchor

Reference point used to interpret the coordinate.

coordinate int

Primary HGVS coordinate as written. For example, c.-81 uses coordinate == -81 and c.*24 uses coordinate == 24.

offset int

Signed secondary displacement from the primary coordinate. Positive values move downstream and negative values move upstream.

Examples:

Duplication crossing an exon/intron border:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.11(NM_004006.2):c.260_264+48dup")
>>> variant.description.location.start.coordinate
260
>>> variant.description.location.end.coordinate
264
>>> variant.description.location.end.offset
48

Upstream intronic substitution:

>>> variant = parse_hgvs("NG_012232.1(NM_004006.2):c.264-2A>G")
>>> variant.description.location.start.coordinate
264
>>> variant.description.location.start.offset
-2

5' UTR substitution:

>>> variant = parse_hgvs("NM_007373.4:c.-1C>T")
>>> variant.description.location.start.anchor
<NucleotideAnchor.RELATIVE_CDS_START: 'relative_cds_start'>
>>> variant.description.location.start.coordinate
-1
>>> variant.description.location.start.offset
0

3' UTR substitution:

>>> variant = parse_hgvs("NM_001272071.2:c.*1C>T")
>>> variant.description.location.start.anchor
<NucleotideAnchor.RELATIVE_CDS_END: 'relative_cds_end'>
>>> variant.description.location.start.coordinate
1
>>> variant.description.location.start.offset
0

is_intronic property

is_intronic: bool

Return True for coordinate-anchored positions with an offset.

Examples:

>>> from tinyhgvs import parse_hgvs
>>> parse_hgvs("NM_004006.2:c.357+1G>A").description.location.start.is_intronic
True
>>> parse_hgvs("NG_012232.1(NM_004006.2):c.264-2A>G").description.location.start.is_intronic
True

is_five_prime_utr property

is_five_prime_utr: bool

Return True for positions in the 5' UTR.

Examples:

>>> from tinyhgvs import parse_hgvs
>>> position = parse_hgvs("NM_007373.4:c.-123C>T").description.location.start
>>> position.is_five_prime_utr
True
>>> position.is_three_prime_utr
False

is_three_prime_utr property

is_three_prime_utr: bool

Return True for positions in the 3' UTR.

Examples:

>>> from tinyhgvs import parse_hgvs
>>> position = parse_hgvs("NM_001272071.2:c.*1C>T").description.location.start
>>> position.is_three_prime_utr
True
>>> position.is_five_prime_utr
False

CopiedSequenceItem dataclass

CopiedSequenceItem(source_reference: ReferenceSpec | None, source_coordinate_system: CoordinateSystem | None, source_location: Interval[NucleotideCoordinate], is_inverted: bool)

Copied nucleotide sequence used in an insertion or deletion-insertion.

Attributes:

Name Type Description
source_reference ReferenceSpec | None

Source reference when the copied sequence comes from a different accession. None means the same outer reference.

source_coordinate_system CoordinateSystem | None

Source coordinate system when it differs from the outer variant. None means the same outer coordinate system.

source_location Interval[NucleotideCoordinate]

Inclusive interval on the source reference.

is_inverted bool

Whether the copied sequence is inserted in reverse orientation.

Examples:

A stretch of sequence from the same transcript is inserted in reverse orientation:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.849_850ins850_900inv")
>>> item = variant.description.edit.items[0]
>>> item.is_from_same_reference
True
>>> item.source_location.start.coordinate
850
>>> item.source_location.end.coordinate
900
>>> item.is_inverted
True

A copied sequence can also come from another chromosome:

>>> variant = parse_hgvs("NC_000002.11:g.47643464_47643465ins[NC_000022.10:g.35788169_35788352]")
>>> item = variant.description.edit.items[0]
>>> item.source_reference.primary.id
'NC_000022.10'
>>> item.source_coordinate_system
<CoordinateSystem.GENOMIC: 'g'>

LiteralSequenceItem dataclass

LiteralSequenceItem(value: str)

Model describing literal-base-type sequence edit component.

Attributes:

Name Type Description
value str

Nucleotide bases.

Examples:

A literal insertion of three nucleotides:

>>> from tinyhgvs import LiteralSequenceItem, parse_hgvs
>>> variant = parse_hgvs("NC_000023.10:g.32862923_32862924insCCT")
>>> item = variant.description.edit.items[0]
>>> isinstance(item, LiteralSequenceItem)
True
>>> item.value
'CCT'

RepeatSequenceItem dataclass

RepeatSequenceItem(unit: str, count: int)

Model describing repeat-type sequence edit component.

Attributes:

Name Type Description
unit str

Repeat unit of nucleotide bases.

count int

Number of units being repeated.

Examples:

The insertion contains 100 copies of N:

>>> from tinyhgvs import RepeatSequenceItem, parse_hgvs
>>> variant = parse_hgvs("NC_000023.10:g.32717298_32717299insN[100]")
>>> item = variant.description.edit.items[0]
>>> isinstance(item, RepeatSequenceItem)
True
>>> item.unit
'N'
>>> item.count
100

NucleotideRepeatBlock dataclass

NucleotideRepeatBlock(count: int, unit: str | None = None, location: Interval[NucleotideCoordinate] | None = None)

One repeat block/unit in a nucleotide repeat description.

Attributes:

Name Type Description
count int

Number of repeated units.

unit str | None

Literal base(s) repeat unit. None when repeat unit is described in the form of location[count].

location Interval[NucleotideCoordinate] | None

Location per repeat block. None when repeat unit is described in the form of unit[count].

Examples:

A literal 3bp bases repeat with 23 units:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000014.8:g.123CAG[23]")
>>> block = variant.description.edit.blocks[0]
>>> block.unit
'CAG'
>>> block.count
23
>>> block.location is None
True

A RNA repeat variant composed of consecutive repeat units, each described in the form location[count], rather than unit[count]: a repetitive unit from a location:

>>> variant = parse_hgvs("NM_004006.3:r.456_465[4]466_489[9]490_499[3]")
>>> block = variant.description.edit.blocks[1]
>>> block.unit is None
True
>>> block.location.start.coordinate
466
>>> block.location.end.coordinate
489

NucleotideSequenceOmittedEdit

Bases: str, Enum

Nucleotide edits whose altered sequence is not written explicitly.

Attributes:

Name Type Description
NO_CHANGE

No nucleotide change, written as =.

DELETION

Deletion of the reference interval, written as del.

DUPLICATION

Duplication of the reference interval, written as dup.

INVERSION

Inversion of the reference interval, written as inv.

Examples:

A coding DNA deletion:

>>> from tinyhgvs import NucleotideSequenceOmittedEdit, parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.5697del")
>>> variant.description.edit is NucleotideSequenceOmittedEdit.DELETION
True

A genomic duplication:

>>> variant = parse_hgvs("NC_000001.11:g.1234_2345dup")
>>> variant.description.edit
<NucleotideSequenceOmittedEdit.DUPLICATION: 'duplication'>

NucleotideSubstitutionEdit dataclass

NucleotideSubstitutionEdit(reference: str, alternate: str)

Model describing nucleotide substitution.

Examples:

A reference base C is substituted by A at the described location.

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.10:g.33038255C>A")
>>> variant_edit = variant.description.edit
>>> variant_edit.reference
'C'
>>> variant_edit.alternate
'A'
>>> variant_edit.kind
'substitution'

NucleotideInsertionEdit dataclass

NucleotideInsertionEdit(items: tuple[NucleotideSequenceItem, ...])

Model describing nucleotide insertion.

Attributes:

Name Type Description
items tuple[NucleotideSequenceItem, ...]

Inserted sequence items in the order they appear in the HGVS expression.

kind Literal['insertion']

Edit kind.

Examples:

Literal nucleotide insertion:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000023.10:g.32862923_32862924insCCT")
>>> variant_edit = variant.description.edit
>>> variant_edit.items[0].value
'CCT'

A composite insertion can mix literal and copied sequence:

>>> variant = parse_hgvs("LRG_199t1:c.419_420ins[T;450_470;AGGG]")
>>> variant_edit = variant.description.edit
>>> len(variant_edit.items)
3
>>> variant_edit.items[0].value
'T'
>>> variant_edit.items[1].is_from_same_reference
True
>>> variant_edit.items[2].value
'AGGG'

Insertion of unspecified repeated bases:

>>> variant = parse_hgvs("NC_000023.10:g.32717298_32717299insN[100]")
>>> variant_edit = variant.description.edit
>>> variant_edit.items[0].unit
'N'
>>> variant_edit.items[0].count
100

NucleotideDeletionInsertionEdit dataclass

NucleotideDeletionInsertionEdit(items: tuple[NucleotideSequenceItem, ...])

Model describing nucleotide deletion-insertion.

Attributes:

Name Type Description
items tuple[NucleotideSequenceItem, ...]

Replacement sequence items in the order they appear in the HGVS expression.

kind Literal['deletion_insertion']

Edit kind.

Examples:

A deleted interval is replaced by one literal sequence component.

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("LRG_199t1:c.850_901delinsTTCCTCGATGCCTG")
>>> variant_edit = variant.description.edit
>>> variant_edit.items[0].value
'TTCCTCGATGCCTG'

A deleted interval can be replaced by copied sequence from the same reference:

>>> variant = parse_hgvs("NC_000022.10:g.42522624_42522669delins42536337_42536382")
>>> variant_edit = variant.description.edit
>>> variant_edit.items[0].source_location.start.coordinate
42536337

A deleted interval is replaced by repeated unspecified bases.

>>> variant = parse_hgvs("NM_004006.2:c.812_829delinsN[12]")
>>> variant_edit = variant.description.edit
>>> variant_edit.items[0].unit
'N'
>>> variant_edit.items[0].count
12

NucleotideRepeatEdit dataclass

NucleotideRepeatEdit(blocks: tuple[NucleotideRepeatBlock, ...])

Model describing a top-level nucleotide repeat variant.

Attributes:

Name Type Description
blocks tuple[NucleotideRepeatBlock, ...]

Repeat blocks/units written in the HGVS description.

kind Literal['repeat']

Edit kind.

Examples:

A DNA repeat variant with explicit repeat unit:

>>> from tinyhgvs import parse_hgvs
>>> variant = parse_hgvs("NC_000014.8:g.123CAG[23]")
>>> variant_edit = variant.description.edit
>>> variant_edit.blocks[0].unit
'CAG'
>>> variant_edit.blocks[0].count
23

A RNA repeat variant composed of consecutive blocks/units, each represented as a location span:

>>> variant = parse_hgvs("NM_004006.3:r.456_465[4]466_489[9]490_499[3]")
>>> variant_edit = variant.description.edit
>>> len(variant_edit.blocks)
3
>>> variant_edit.blocks[2].count
3

NucleotideVariant dataclass

NucleotideVariant(location: Interval[NucleotideCoordinate], edit: NucleotideEdit)

Model describing a nucleotide-level variant.

Attributes:

Name Type Description
location Interval[NucleotideCoordinate]

Inclusive nucleotide interval where the edit is applied.

edit NucleotideEdit

Nucleotide edit applied at that interval.

Examples:

A splice-site substitution is represented by a nucleotide location and a nucleotide substitution edit.

>>> from tinyhgvs import NucleotideSubstitutionEdit, parse_hgvs
>>> variant = parse_hgvs("NM_004006.2:c.357+1G>A")
>>> isinstance(variant.description.edit, NucleotideSubstitutionEdit)
True
>>> variant_description = variant.description
>>> variant_description.location.start.coordinate
357
>>> variant_description.location.start.offset
1