FGPYO

API

SAM/BAM/CRAM files

Utility Classes and Methods for SAM/BAM

This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.

Motivation for Reader and Writer methods

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
Makes the requirement to provide a header when opening a file for writing more explicit.
Adds support for Path.
Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.

Examples of Opening a SAM/BAM for Reading or Writing

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType for the supported file types.

>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader("/path/to/sample.bam") as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
...     pass  # do something

Examples of Manipulating Cigars

Creating a Cigar from a pysam.AlignedSegment.

>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str.

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
... CigarException("Malformed cigar: 10M5[U]")

The cigar contains a tuple of CigarOp) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Examples of parsing the SA tag and individual supplementary alignments

>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']

Module Contents

The module contains the following public classes:

SupplementaryAlignment – Stores a supplementary alignment record
produced by BWA and stored in the SA SAM tag.
SamFileType – Enumeration of valid SAM/BAM/CRAM file types.
SamOrder – Enumeration of possible SAM/BAM/CRAM sort orders.
CigarOp – Enumeration of operators that can appear in a Cigar string.
CigarElement – Class representing an element in a Cigar string.
CigarParsingException – The exception raised specific to parsing a
cigar
Cigar – Class representing a cigar string.
ReadEditInfo – Class describing how a read differs from the reference

The module contains the following methods:

reader() – opens a SAM/BAM/CRAM file for reading.

writer() – opens a SAM/BAM/CRAM file for writing

calc_edit_info() – calculates how a read differs from the reference

class fgpyo.sam.Cigar(elements: Tuple[fgpyo.sam.CigarElement, ...] = ())

Class representing a cigar string.

- elements

zero or more cigar elements

Type: Tuple[CigarElement, …]

classmethod from_cigarstring(cigarstring: str) → fgpyo.sam.Cigar

Constructs a Cigar from a string returned by pysam.

If “*” is given, returns an empty Cigar.

classmethod from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) → fgpyo.sam.Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp for more information on the various operators. If None is given, returns an empty Cigar.

length_on_query() → int: Returns the length of the alignment on the query sequence.

length_on_target() → int: Returns the length of the alignment on the target sequence.

reversed() → fgpyo.sam.Cigar: Returns a copy of the Cigar with the elements in reverse order.

class fgpyo.sam.CigarElement(length: int, operator: fgpyo.sam.CigarOp)

Represents an element in a Cigar

- length

the length of the element

Type: int

- operator

the operator of the element

Type: CigarOp

property length_on_query: int: Returns the length of the element on the query sequence.

property length_on_target: int: Returns the length of the element on the target (often reference) sequence.

class fgpyo.sam.CigarOp(value)

Enumeration of operators that can appear in a Cigar string.

code

The pysam cigar operator code.

Type: int

character

The single character cigar operator.

Type: int

consumes_query

True if this operator consumes query bases, False otherwise.

Type: bool

consumes_target

True if this operator consumes target bases, False otherwise.

Type: bool

D = (2, 'D', False, True): Deletion versus the reference

EQ = (7, '=', True, True): Matches the reference

H = (5, 'H', False, False): Hard clip

I = (1, 'I', True, False)

E741

Type: Insertion versus the reference # noqa

M = (0, 'M', True, True): Match or Mismatch the reference

N = (3, 'N', False, True): Skipped region from the reference

P = (6, 'P', False, False): Padding

S = (4, 'S', True, False): Soft clip

X = (8, 'X', True, True): Mismatches the reference

static from_character(character: str) → fgpyo.sam.CigarOp: Returns the operator from the single character.

static from_code(code: int) → fgpyo.sam.CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from pysam.

property is_clipping: bool: Returns true if the operator is a soft/hard clip, false otherwise.

property is_indel: bool: Returns true if the operator is an indel, false otherwise.

exception fgpyo.sam.CigarParsingException: The exception raised specific to parsing a cigar.

fgpyo.sam.NO_REF_INDEX: int = -1: The reference index to use to indicate no reference in SAM/BAM.

fgpyo.sam.NO_REF_NAME: str = '*': The reference name to use to indicate no reference in SAM/BAM.

fgpyo.sam.NO_REF_POS: int = -1: The reference position to use to indicate no position in SAM/BAM.

class fgpyo.sam.ReadEditInfo(matches: int, mismatches: int, insertions: int, inserted_bases: int, deletions: int, deleted_bases: int, nm: int)

Counts various stats about how a read compares to a reference sequence.

matches

the number of bases in the read that match the reference

Type: int

mismatches

the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.

Type: int

insertions

the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.

Type: int

inserted_bases

the total number of bases contained within insertions in the read

Type: int

deletions

the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.

Type: int

deleted_bases

the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).

Type: int

nm

the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases

Type: int

class fgpyo.sam.SamFileType(value)

Enumeration of valid SAM/BAM/CRAM file types.

mode

The additional mode character to add when opening this file type.

Type: str

ext

The standard file extension for this file type.

Type: str

classmethod from_path(path: Union[pathlib.Path, str]) → fgpyo.sam.SamFileType

Infers the file type based on the file extension.

Parameters: path – the path to the SAM/BAM/CRAM to read or write.

class fgpyo.sam.SamOrder(value)

Enumerations of possible sort orders for a SAM file.

Coordinate = 'coordinate': coordinate sorted

QueryName = 'queryname': queryname sorted

Unsorted = 'unsorted': the SAM / BAM / CRAM is unsorted

fgpyo.sam.SamPath

The valid base classes for opening a SAM/BAM/CRAM file.

alias of Union[IO[Any], pathlib.Path, str]

class fgpyo.sam.SupplementaryAlignment(reference_name: str, start: int, is_forward: bool, cigar: fgpyo.sam.Cigar, mapq: int, nm: int)

Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

reference_name

the name of the reference (i.e. contig, chromosome) aligned to

Type: str

start

the 0-based start position of the alignment

Type: int

is_forward

true if the alignment is in the forward strand, false otherwise

Type: bool

cigar

the cigar for the alignment

Type: fgpyo.sam.Cigar

mapq

the mapping quality

Type: int

nm

the number of edits

Type: int

property end: int: The 0-based exclusive end position of the alignment.

classmethod from_read(read: pysam.libcalignedsegment.AlignedSegment) → List[fgpyo.sam.SupplementaryAlignment]

Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

Parameters: read – An alignment. The presence of the “SA” tag is not required.
Returns: A list of all SupplementaryAlignments present in the SA tag. If the SA tag is not present, or it is empty, an empty list will be returned.

static parse(string: str) → fgpyo.sam.SupplementaryAlignment: Returns a supplementary alignment parsed from the given string. The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4)

static parse_sa_tag(tag: str) → List[fgpyo.sam.SupplementaryAlignment]: Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.

class fgpyo.sam.Template(name: str, r1: Optional[pysam.libcalignedsegment.AlignedSegment], r2: Optional[pysam.libcalignedsegment.AlignedSegment], r1_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r2_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r1_secondaries: List[pysam.libcalignedsegment.AlignedSegment], r2_secondaries: List[pysam.libcalignedsegment.AlignedSegment])

A container for alignment records corresponding to a single sequenced template or insert.

It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.

name

the name of the template/query

Type: str

r1

Primary alignment for read 1, or None if there is none

Type: Optional[pysam.libcalignedsegment.AlignedSegment]

r2

Primary alignment for read 2, or None if there is none

Type: Optional[pysam.libcalignedsegment.AlignedSegment]

r1_supplementals

Supplementary alignments for read 1

Type: List[pysam.libcalignedsegment.AlignedSegment]

r2_supplementals

Supplementary alignments for read 2

Type: List[pysam.libcalignedsegment.AlignedSegment]

r1_secondaries

Secondary (non-primary) alignments for read 1

Type: List[pysam.libcalignedsegment.AlignedSegment]

r2_secondaries

Secondary (non-primary) alignments for read 2

Type: List[pysam.libcalignedsegment.AlignedSegment]

all_recs() → Iterator[pysam.libcalignedsegment.AlignedSegment]: Returns a list with all the records for the template.

static build(recs: Iterable[pysam.libcalignedsegment.AlignedSegment], validate: bool = True) → fgpyo.sam.Template: Build a template from a set of records all with the same queryname.

static iterator(alns: Iterator[pysam.libcalignedsegment.AlignedSegment]) → Iterator[fgpyo.sam.Template]: Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.

primary_recs() → Iterator[pysam.libcalignedsegment.AlignedSegment]: Returns a list with all the primary records for the template.

set_tag(tag: str, value: Optional[Union[str, int, float]]) → None

Add a tag to all records associated with the template.

Setting a tag to None will remove the tag.

Parameters

tag – The name of the tag.
value – The value of the tag.

validate() → None: Performs sanity checks that all the records in the Template are as expected.

write_to(writer: pysam.libcalignmentfile.AlignmentFile, primary_only: bool = False) → None

Write the records associated with the template to file.

Parameters

writer – An open, writable AlignmentFile.
primary_only – If True, only write primary alignments.

class fgpyo.sam.TemplateIterator(iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]): An iterator that converts an iterator over query-grouped reads into an iterator over templates.

fgpyo.sam.calculate_edit_info(rec: pysam.libcalignedsegment.AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) → fgpyo.sam.ReadEditInfo

Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag. The read must be aligned.

Parameters

rec – the read/record for which to calculate values
reference_sequence – the reference sequence (or fragment thereof) that the read is aligned to
reference_offset – if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start

Returns

a ReadEditInfo with information about how the read differs from the reference

fgpyo.sam.isize(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment) → int: Computes the insert size for a pair of records.

fgpyo.sam.reader(path: Union[IO[Any], pathlib.Path, str], file_type: Optional[fgpyo.sam.SamFileType] = None, unmapped: bool = False) → pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for reading.

Parameters

path – a file handle or path to the SAM/BAM/CRAM to read or write.
file_type – the file type to assume when opening the file. If None, then the file type will be auto-detected.
unmapped – True if the file is unmapped and has no sequence dictionary, False otherwise.

fgpyo.sam.set_pair_info(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment, proper_pair: bool = True) → None

Resets mate pair information between reads in a pair. Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters

r1 – read 1
r2 – read 2 with the same queryname as r1

fgpyo.sam.writer(path: Union[IO[Any], pathlib.Path, str], header: Union[str, Dict[str, Any], pysam.libcalignmentfile.AlignmentHeader], file_type: Optional[fgpyo.sam.SamFileType] = None) → pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for writing.

Parameters

path – a file handle or path to the SAM/BAM/CRAM to read or write.
header – Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
file_type – the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object.

Classes for generating SAM and BAM files and records for testing

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

The module contains the following public classes:

SamBuilder – A builder class that allows the accumulation
of alignment records and access as a list and writing to file.

class fgpyo.sam.builder.SamBuilder(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: fgpyo.sam.SamOrder = SamOrder.Coordinate)

Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file using to_path().

add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = - 1, start2: int = - 1, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) → Tuple[pysam.libcalignedsegment.AlignedSegment, pysam.libcalignedsegment.AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:

add_pair(chrom, start1, start2) will create a mapped pair where both reads map to the same contig (chrom).

add_pair(chrom1, start1, chrom2, start2) will create a mapped pair where the reads map to different contigs (chrom1 and chrom2).

A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.

add_pair(chrom, start1)

add_pair(chrom1, start1)

add_pair(chrom, start2)

add_pair(chrom2, start2)

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters

name – The name of the template. If None is given a unique name will be auto-generated.
bases1 – The bases for R1. If None is given a random sequence is generated.
bases2 – The bases for R2. If None is given a random sequence is generated.
quals1 – The list of int qualities for R1. If None, the default base quality is used.
quals2 – The list of int qualities for R2. If None, the default base quality is used.
chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.
chrom1 – The chromosome to which R1 is mapped. If None, chrom is used.
chrom2 – The chromosome to which R2 is mapped. If None, chrom is used.
start1 – The start position of R1. Defaults to the unmapped value.
start2 – The start position of R2. Defaults to the unmapped value.
cigar1 – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
cigar2 – The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
mapq1 – Mapping quality for R1. Defaults to self.mapping_quality if None.
mapq2 – Mapping quality for R2. Defaults to self.mapping_quality if None.
strand1 – The strand for R1, either “+” or “-”. Defaults to “+”.
strand2 – The strand for R2, either “+” or “-”. Defaults to “-“.
attrs – An optional dictionary of SAM attribute to place on both R1 and R2.

Raises

ValueError – if either strand field is not “+” or “-”
ValueError – if bases/quals/cigar are set in a way that is not self-consistent

Returns

The pair of records created, R1 then R2.

Return type

Tuple[AlignedSegment, AlignedSegment]

add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = '*', start: int = - 1, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) → pysam.libcalignedsegment.AlignedSegment

Generates a new single reads, adds them to the internal collection, and returns it.

Most fields are optional.

If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it’s paired flag set and read number flags set.

An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.

A mapped read is created by providing chrom and start.

The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters

name – The name of the template. If None is given a unique name will be auto-generated.
read_num – Either None, 1 for R1 or 2 for R2
bases – The bases for the read. If None is given a random sequence is generated.
quals – The list of qualities for the read. If None, the default base quality is used.
chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.
start – The start position of the read. Defaults to the unmapped value.
cigar – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
mapq – Mapping quality for the read. Default to self.mapping_quality if not given.
strand – The strand for R1, either “+” or “-”. Defaults to “+”.
secondary – If true the read will be flagged as secondary
supplementary – If true the read will be flagged as supplementary
attrs – An optional dictionary of SAM attribute to place on both R1 and R2.

Raises

ValueError – if strand field is not “+” or “-”
ValueError – if read_num is not None, 1 or 2
ValueError – if bases/quals/cigar are set in a way that is not self-consistent

Returns

The record created

Return type

AlignedSegment

static default_rg() → Dict[str, str]: Returns the default read group used by the SamBuilder, as a dictionary.

static default_sd() → List[Dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns: A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

property header: pysam.libcalignmentfile.AlignmentHeader: Returns the builder’s SAM header.

rg() → Dict[str, Any]: Returns the single read group that is defined in the header.

rg_id() → str: Returns the ID of the single read group that is defined in the header.

to_path(path: Optional[pathlib.Path] = None, index: bool = True, pred: Callable[[pysam.libcalignedsegment.AlignedSegment], bool] = <function SamBuilder.<lambda>>) → pathlib.Path

Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.

Parameters

path – a path at which to write the file, otherwise a temp file is used.
index – if True and sort_order is Coordinate index is generated, otherwise not.
pred – optional predicate to specify which reads should be output

Returns

The path to the sorted (and possibly indexed) file.

Return type

Path

to_sorted_list() → List[pysam.libcalignedsegment.AlignedSegment]: Returns the accumulated records in coordinate order.

to_unsorted_list() → List[pysam.libcalignedsegment.AlignedSegment]: Returns the accumulated records in the order they were created.

Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

softclip_start_of_alignment_by_query() clips the start of the alignment in terms of query bases

softclip_end_of_alignment_by_query() clips the end of the alignment in terms of query bases

softclip_start_of_alignment_by_ref() clips the start of the alignment in terms of reference bases

softclip_end_of_alignment_by_ref() clips the end of the alignment in terms of reference bases

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("/path/to/sample.sam") as fh:
...     for rec in fh:
...         clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         print(rec.cigarstring)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

Mate pairs may have incorrect information about their mate’s positions

Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd

class fgpyo.sam.clipping.ClippingInfo(query_bases_clipped: int, ref_bases_clipped: int)

Named tuple holding the number of bases clipped on the query and reference respectively.

query_bases_clipped: int: The number of query bases in the alignment that were clipped.

ref_bases_clipped: int: The number of reference bases in the alignment that were clipped.

fgpyo.sam.clipping.softclip_end_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) → fgpyo.sam.clipping.ClippingInfo

Adds soft-clipping to the end of a read’s alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters

rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number: of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_end_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) → fgpyo.sam.clipping.ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters

rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number: of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_start_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) → fgpyo.sam.clipping.ClippingInfo

Adds soft-clipping to the start of a read’s alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters

rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number: of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_start_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) → fgpyo.sam.clipping.ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters

rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number: of target/reference bases clipped.

Return type

ClippingInfo

VCF/BCF files

Classes for generating VCF and records for testing

This module contains utility classes for the generation of VCF files and variant records, for use in testing.

The module contains the following public classes:

VariantBuilder – A builder class that allows the
accumulation of variant records and access as a list and writing to file.

Examples

Typically, we have VariantRecord records obtained from reading from a VCF file. The VariantBuilder class builds such records.

Variants are added with the add() method, which returns a pysam.VariantRecord.

>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>>     contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
>>>     qual=40, filter=["PASS"]
>>> )

VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs

>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>>     samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
>>> )

The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the to_path() method:

>>>     from pathlib import Path
>>>     path_to_vcf: Path = builder.to_path()

The variants may also be retrieved in the order they were added via the to_unsorted_list() method and in coordinate sorted order via the to_sorted_list() method.

fgpyo.vcf.reader(path: Union[pathlib.Path, str, TextIO]) → Generator[pysam.libcbcf.VariantFile, None, None]

Opens the given path for VCF reading

Parameters: path – the path to a VCF, or an open file handle

fgpyo.vcf.writer(path: Union[pathlib.Path, str, TextIO], header: pysam.libcbcf.VariantHeader) → Generator[pysam.libcbcf.VariantFile, None, None]

Opens the given path for VCF writing.

Parameters

path – the path to a VCF, or an open filehandle
header – the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header

Classes for generating VCF and records for testing

class fgpyo.vcf.builder.VariantBuilder(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None)

Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.

Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.

A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS (“sample_ids”) must be provided to the VariantBuilder constructor.

Variants are then added using the add() method. Once accumulated the variants can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). Lastly, the records can be written to a temporary file using to_path().

sample_ids

the sample name(s)

Type: List[str]

sd

sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain “ID” (the same as contig_name) and “length”, the contig length. Other values will be added to the VCF header line for that contig.

Type: Dict[str, Dict[str, Any]]

seq_idx_lookup

dictionary mapping contig name to index of contig in sd

Type: Dict[str, int]

records

the list of variant records

Type: List[pysam.libcbcf.VariantRecord]

header

the pysam header

Type: pysam.libcbcf.VariantHeader

add(contig: Optional[str] = None, pos: int = 1000, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) → pysam.libcbcf.VariantRecord

Generates a new variant and adds it to the internal collection.

Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property “pos”, which is 1-based. pysam will also report “pos” as 1-based, so that is the property that should be accessed when using the records produced by this function (not “start”).

Parameters

contig – the chromosome name. If None, will use the first contig in the sequence dictionary.
pos – the 1-based position of the variant
id – the variant id
ref – the reference allele
alts – the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.
qual – the variant quality
filter – the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.
info – the dictionary of INFO key-value pairs
samples – the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (“.”) for samples that don’t have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.

add_filter_header(name: str, description: Optional[str] = None) → None

Add a FILTER header field to the VCF header.

Parameters

name – the name of the field
description – the description of the field

add_format_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES, description: Optional[str] = None) → None

Add a FORMAT header field to the VCF header.

Parameters

name – the name of the field
field_type – the field_type of the field
number – the number of the field
description – the description of the field

add_header_line(line: str) → None: Adds a header line to the header

add_info_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) → None

Add an INFO header field to the VCF header.

Parameters

name – the name of the field
field_type – the field_type of the field
number – the number of the field
description – the description of the field
source – the source of the field
version – the version of the field

classmethod default_sd() → Dict[str, Dict[str, Any]]

Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.

Returns: A new copy of the sequence dictionary as a map of contig name to dictionary, one per contig.

to_path(path: Optional[pathlib.Path] = None) → pathlib.Path: Returns a path to a VCF for variants added to this builder. :param path: optional path to the VCF

to_sorted_list() → List[pysam.libcbcf.VariantRecord]: Returns the accumulated records in coordinate order.

to_unsorted_list() → List[pysam.libcbcf.VariantRecord]: Returns the accumulated records in the order they were created.

class fgpyo.vcf.builder.VcfFieldNumber(value): Special codes for VCF field numbers

class fgpyo.vcf.builder.VcfFieldType(value): Codes for VCF field types

FASTA files

Classes for generating fasta files and records for testing

This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).

Examples of creating sets of contigs for writing to fasta

Writing a FASTA with two contigs each with 100 bases:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
>>> builder.add("chr11").add("GGGGGGGGGG", 10)
>>> builder.to_file(path = pathlib.Path("test.fasta"))

Writing a FASTA with one contig with 100 A’s and 50 T’s:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
>>> builder.to_file(path = pathlib.Path("test.fasta"))

Add bases to existing contig:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
>>> contig_one.bases
'AAAAAAAAAANNN'

class fgpyo.fasta.builder.ContigBuilder(name: str, assembly: str, species: str)

Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.

name: Unique contig ID, ie., “chr10”

assembly: Assembly information, if None default is ‘testassembly’

species: Species information, if None default is ‘testspecies’

bases: The bases to be added to the contig ex “A”

add(bases: str, times: int = 1) → fgpyo.fasta.builder.ContigBuilder

Method for adding bases to a new or existing instance of ContigBuilder.

Parameters

bases – The bases to be added to the contig
times – The number of times the bases should be repeated

Example add(“AAA”, 2) results in the following bases -> “AAAAAA”

class fgpyo.fasta.builder.FastaBuilder(assembly: str = 'testassembly', species: str = 'testspecies', line_length: int = 80)

Builder for constructing sets of one or more contigs.

Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.

A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.

Contigs are added to FastaBuilder using: add()

Bases are added to existing contigs using: add()

Once accumulated the contigs can be written to a file using: to_file()

Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

assembly: Assembly information, if None default is ‘testassembly’

species: Species, if None default is ‘testspecies’

line_length: Desired line length, if None default is 80

contig_builders: Private dictionary of contig names and instances of ContigBuilder

add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) → fgpyo.fasta.builder.ContigBuilder

Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.

Parameters

name – Unique contig ID, ie., “chr10”
assembly – Assembly information, if None default is ‘testassembly’
species – Species information, if None default is ‘testspecies’

to_file(path: pathlib.Path) → None

Writes out the set of accumulated contigs to a FASTA file at the path given. Also generates the accompanying fasta index file (.fa.fai) and sequence dictionary file (.dict).

Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.

Parameters: path – Path to write files to.

Example: FastaBuilder.to_file(path = pathlib.Path(“my_fasta.fa”))

fgpyo.fasta.builder.pysam_dict(assembly: str, species: str, output_path: str, input_path: str) → None

Calls pysam.dict and writes the sequence dictionary to the provided output path

Args: assembly: Assembly species: Species output_path: File path to write dictionary to input_path: Path to fasta file

fgpyo.fasta.builder.pysam_faidx(input_path: str) → None

Calls pysam.faidx and writes fasta index in the same file location as the fasta file

Args: input_path: Path to fasta file

FASTX files

Zipping FASTX Files

Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics and can be achieved with the FastxZipped context manager. The context manager facilitates opening of all input FASTA/FASTQ files and closing them after iteration is complete. For every iteration of FastxZipped, a tuple of the next FASTX records are returned (of type FastxRecord). An exception will be raised if any of the input files are malformed or truncated and if record names are not equivalent and in sync.

Importantly, this context manager is optimized for fast streaming read-only usage and, by default, any previous records saved while advancing the iterator will not be correct as the underlying pointer in memory will refer to the most recent record only, and not any past records. To preserve the state of all previously iterated records, set the parameter persist to True.

>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:
...    for (r1, r2) in zipped:
...         print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT

class fgpyo.fastx.FastxZipped(*paths: Union[pathlib.Path, str], persist: bool = False)

A context manager that will lazily zip over any number of FASTA/FASTQ files.

Parameters

paths – Paths to the FASTX files to zip over.
persist – Whether to persit the state of previous records during iteration.

close() → None: Close the FastxZipped context manager by closing all FASTX files.

Metric files

Metrics

Module for storing, reading, and writing metric-like tab-delimited information.

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R. For example, a row per person, with columns for age, gender, and address.

The Metric class makes it easy to read, write, and store one or metrics of the same type, all the while preserving types for each value in a metric. It is an abstract base class decorated by @dataclass, or @attr.s, with attributes storing one or more typed values. If using multiple layers of inheritance, keep in mind that it’s not possible to mix these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately initialize the values of the attr superclass.

Examples

Defining a new metric class:

>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

or using attr:

>>> from fgpyo.util.metric import Metric
>>> import attr
>>> @attr.s(auto_attribs=True, frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.

>>> Person.header()
['name', 'age']

Getting the values from a metric class instance. The values are in the same order as the header.

>>> list(Person(name="Alice", age=47).values())
["Alice", 47]

Writing a list of metrics to a file:

>>> metrics = [
...     Person(name="Alice", age=47),
...     Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)

Then the contents of the written metrics file:

$ column -t /path/to/metrics.txt
name   age
Alice  47
Bob    24

Reading the metrics file back in:

>>> list(Person.read(Path("/path/to/metrics.txt")))
[Person(name='Alice', age=47, address=None), Person(name='Bob', age=24, address='North Pole')]

Formatting and parsing the values for custom types is supported by overriding the _parsers() and format_value() methods.

>>> @dataclasses.dataclass(frozen=True)
... class Name:
...     first: str
...     last: str
...     @classmethod
...     def parse(cls, value: str) -> "Name":
...          fields = value.split(" ")
...          return Name(first=fields[0], last=fields[1])
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: Name
...     age: int
...     def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
...         return {Name: lambda value: Name.parse(value=value)}
...     @classmethod
...     def format_value(cls, value: Any) -> str:
...         if isinstance(value, (Name)):
...             return f"{value.first} {value.last}"
...         else:
...             return super().format_value(value=value)
>>> Person.parse(fields=["john doe", "42"])
Person(name=Name(first='john', last='doe'), age=42)
>>> Person(name=Name(first='john', last='doe'), age=42, address=None).formatted_values()
["first last", "42"]

class fgpyo.util.metric.Metric(*args, **kwds)

Abstract base class for all metric-like tab-delimited files

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R.

Sub-classes of Metric can support parsing and formatting custom types with _parsers() and format_value().

classmethod format_value(value: Any) → str

The default method to format values of a given type.

By default, this method will comma-delimit list, tuple, and set types, and apply str to all others.

Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.

In addition, lists will be flanked with ‘[]’, tuples with ‘()’ and sets and dictionaries with ‘{}’

Parameters: value – the value to format.

formatted_values() → List[str]: An iterator over formatted attribute values in the same order as the header.

classmethod header() → List[str]: The list of header values for the metric.

classmethod parse(fields: List[str]) → Any: Parses the string-representation of this metric. One string per attribute should be given.

classmethod read(path: pathlib.Path, ignore_extra_fields: bool = True) → Iterator[Any]

Reads in zero or more metrics from the given path.

The metric file must contain a matching header.

Columns that are not present in the file but are optional in the metric class will be default values.

Parameters

path – the path to the metrics file.
ignore_extra_fields – True to ignore any extra columns, False to raise an exception.

classmethod to_list(value: str) → List[Any]: Returns a list value split on comma delimeter.

values() → Iterator[Any]: An iterator over attribute values in the same order as the header.

classmethod write(path: pathlib.Path, *values: fgpyo.util.metric.MetricType) → None

Writes zero or more metrics to the given path.

The header will always be written.

Parameters

path – Path to the output file.
values – Zero or more metrics.

fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → fgpyo.util.inspect._AttrFromType

Builds an attr or dataclasses class from key-word arguments

Parameters

cls – the attr or dataclasses class to be built
kwargs – a dictionary of keyword arguments
parsers – a dictionary of parser functions to apply to specific types

Read Structures

Classes for representing Read Structures

A Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the –use-bases-mask in Illumina’s bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Examples

>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
'75T', '8B', '75T'
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
...   ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
...   print(str(ex))
...
Read structure missing length information: 23T2T[T]23T

Module Contents

The module contains the following public classes:

ReadStructure – Describes the structure of a give read
ReadSegment – Describes all the information about a segment within a read structure
SegmentType – The type of segments that can show up in a read structure
SubReadWithoutQuals – Contains the bases that correspond to the given read segment
SubReadWithQuals – Contains the bases and qualities that correspond to the given read segment

fgpyo.read_structure.ANY_LENGTH_CHAR: str = '+': A character that can be put in place of a number in a read structure to mean “0 or more bases”.

class fgpyo.read_structure.ReadSegment(*, offset: int, length: Optional[int], kind: fgpyo.read_structure.SegmentType)

Encapsulates all the information about a segment within a read structure. A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.

offset

The offset of the read segment in the read.

Type: int

length

The length of the segment, or None if it is variable length.

Type: Optional[int]

kind

The kind of read segment.

Type: fgpyo.read_structure.SegmentType

extract(bases: str) → fgpyo.read_structure.SubReadWithoutQuals: Gets the bases associated with this read segment.

extract_with_quals(bases: str, quals: str) → fgpyo.read_structure.SubReadWithQuals: Gets the bases and qualities associated with this read segment.

property fixed_length: int: The fixed length if there is one. Throws an exception on segments without fixed lengths!

property has_fixed_length: bool: True if the read segment has a defined length.

class fgpyo.read_structure.ReadStructure(*, segments: Tuple[fgpyo.read_structure.ReadSegment, ...])

Describes the structure of a give read. A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.

segments

The segments composing the read structure

Type: Tuple[fgpyo.read_structure.ReadSegment, …]

extract(bases: str) → Tuple[fgpyo.read_structure.SubReadWithoutQuals, ...]: Splits the given bases into tuples with its associated read segment.

extract_with_quals(bases: str, quals: str) → Tuple[fgpyo.read_structure.SubReadWithQuals, ...]: Splits the given bases and qualities into triples with its associated read segment.

property fixed_length: int: The fixed length if there is one. Throws an exception on segments without fixed lengths!

classmethod from_segments(segments: Tuple[fgpyo.read_structure.ReadSegment, ...], reset_offsets: bool = False) → fgpyo.read_structure.ReadStructure: Creates a new ReadStructure, optionally resetting the offsets on each of the segments

property has_fixed_length: bool: True if the ReadStructure has a fixed (i.e. non-variable) length

property length: int: Length is defined as the number of segments (not bases!) in the read structure

segments_by_kind(kind: fgpyo.read_structure.SegmentType) → Tuple[fgpyo.read_structure.ReadSegment, ...]: Returns just the segments of a given kind.

with_variable_last_segment() → fgpyo.read_structure.ReadStructure: Generates a new ReadStructure that is the same as this one except that the last segment has undefined length

class fgpyo.read_structure.SegmentType(value)

The type of segments that can show up in a read structure

MolecularBarcode = 'M': The segment type for molecular barcode bases.

SampleBarcode = 'B': The segment type for sample barcode bases.

Skip = 'S': The segment type for bases that need to be skipped.

Template = 'T': The segment type for template bases.

class fgpyo.read_structure.SubReadWithQuals(*, bases: str, quals: str, segment: ReadSegment)

Contains the bases and qualities that correspond to the given read segment

bases: str: The sub-read bases that correspond to the given read segment.

property kind: fgpyo.read_structure.SegmentType: The kind of read segment that corresponds to this sub-read.

quals: str: The sub-read base qualities that correspond to the given read segment.

segment: fgpyo.read_structure.ReadSegment: The segment of the read structure that describes this sub-read.

class fgpyo.read_structure.SubReadWithoutQuals(*, bases: str, segment: ReadSegment)

Contains the bases that correspond to the given read segment.

bases: str: The sub-read bases that correspond to the given read segment.

property kind: fgpyo.read_structure.SegmentType: The kind of read segment that corresponds to this sub-read.

segment: fgpyo.read_structure.ReadSegment: The segment of the read structure that describes this sub-read.

Manipulating DNA and RNA Sequences

Utility Functions for Manipulating DNA and RNA sequences.

This module contains utility functions for manipulating DNA and RNA sequences.

fgpyo.sequence.complement(base: str) → str: Returns the complement of any base.

fgpyo.sequence.reverse_complement(bases: str) → str

Reverse complements a base sequence.

Parameters: bases – the bases to be reverse complemented.
Returns: the reverse complement of the provided base string

Collections

Functions for Working with Collections

This module contains classes and functions for working with collections and iterators.

Examples of a “Peekable” Iterator

“Peekable” iterators are useful to “peek” at the next item in an iterator without consuming it. For example, this is useful when consuming items in iterator while a predicate is true, and not consuming the first element where the element is not true. See the takewhile() and dropwhile() methods.

An empty peekable iterator throws StopIteration:

>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
StopIteration

A peekable iterator will return the next item before consuming it.

>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]

The can_peek() function can be used to determine if the iterator can be peeked without StopIteration being thrown:

>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
StopIteration

PeekableIterator’s constructor supports creation from iterable objects as well as iterators.

Module Contents

The module contains the following public classes:

PeekableIterator – Iterator that allows you to peek at the
next value before calling next

class fgpyo.collections.PeekableIterator(source: Union[Iterator[fgpyo.collections.IterType], Iterable[fgpyo.collections.IterType]])

A peekable iterator wrapping an iterator or iterable.

This allows returning the next item without consuming it.

Parameters: source – an iterator over the objects

can_peek() → bool: Returns true if there is a value that can be peeked at, false otherwise.

dropwhile(pred: Callable[[fgpyo.collections.IterType], bool]) → fgpyo.collections.PeekableIterator[fgpyo.collections.IterType]

Drops elements from the iterator while the predicate is true.

Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.

Parameters

pred (Callable[[V], bool]) – a function that takes a value from the iterator
false. (and returns true or) –

Returns

a reference to this iterator, so calls can be chained

Return type

PeekableIterator[V]

peek() → fgpyo.collections.IterType: Returns the next element without consuming it, or StopIteration otherwise.

takewhile(pred: Callable[[fgpyo.collections.IterType], bool]) → List[fgpyo.collections.IterType]

Consumes from the iterator while pred is true, and returns the result as a List.

The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.

Parameters: pred – a function that takes the next value from the iterator and returns true or false.
Returns: A list of the values from the iterator, in order, up until and excluding the first value that does not match the predicate.
Return type: List[V]

Logging

Methods for setting up logging for tools.

Progress Logging Examples

Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging progress is useful to not only log how many inputs have been consumed, but also their genomic coordinate. ProgressLogger can log progress every fixed number of records. Logging can be written to logging.Logger as well as custom print method.

>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
...     printer=lambda s: logged_lines.append(s),
...     verb="recorded",
...     noun="items",
...     unit=2
... )
>>> progress.record(reference_name="chr1", position=1)  # does not log
False
>>> progress.record(reference_name="chr1", position=2)  # logs
True
>>> progress.record(reference_name="chr1", position=3)  # does not log
False
>>> progress.log_last()  # will log the last recorded item, if not previously logged
True
>>> logged_lines  # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']

class fgpyo.util.logging.ProgressLogger(printer: Union[logging.Logger, Callable[[str], Any]], noun: str = 'records', verb: str = 'Read', unit: int = 100000)

A little class to track progress.

This will output a log message every unit number times recorded.

printer: either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string

noun: the noun to use in the log message

verb: the verb to use in the log message

unit: the number of items for every log message

count: the total count of items recorded

log_last() → bool: Force logging the last record, for example when progress has completed.

record(reference_name: Optional[str] = None, position: Optional[int] = None) → bool

Record an item at a given genomic coordinate. :param reference_name: the reference name of the item :param position: the 1-based start position of the item

Returns: true if a message was logged, false otherwise

record_alignment(rec: pysam.libcalignedsegment.AlignedSegment) → bool

Correctly record pysam.AlignedSegments (zero-based coordinates).

Parameters: rec – pysam.AlignedSegment object
Returns: true if a message was logged, false otherwise

fgpyo.util.logging.setup_logging(level: str = 'INFO', name: str = 'fgpyo') → None

Globally configure logging for all modules

Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.

Parameters

level – the default level for the logger
name – the name of the logger

IO

Module for reading and writing files

The functions in this module make it easy to:

check if a file exists and is writeable
check if a file and its parent directories exist and are writeable
check if a file exists and is readable
check if a path exists and is a directory
open an appropriate reader or writer based on the file extension
write items to a file, one per line
read lines from a file

fgpyo.io Examples:

>>> import fgpyo.io as fio
>>> from pathlib import Path
Assert that a path exists and is readable
>>> path_flat: Path = Path("example.txt")
>>> path_compressed: Path = Path("example.txt.gz")
>>> fio.path_is_readable(path_flat)
AssertionError: Cannot read non-existent path: example.txt
>>> fio.path_is_readable(compressed_file)
AssertionError: Cannot read non-existent path: example.txt.gz
Write to and read from path
>>> write_lines(path = path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path = path_compressed, lines_to_write=["gzip file", 10])
Read lines from a path into a generator
>>> lines = read_lines(path = path_flat)
>>> next(lines)
"flat file"
>>> next(lines)
"10"
>>> lines = read_lines(path = path_compressed)
>>> next(lines)
"gzip file"
>>> next(lines)
"10"

fgpyo.io.assert_directory_exists(path: pathlib.Path) → None

Asserts that a path exist and is a directory

Parameters: path – Path to check

Example

assert_directory_exists(path = Path(“/example/directory/”))

fgpyo.io.assert_path_is_readable(path: pathlib.Path) → None

Checks that file exists and returns True, else raises AssertionError

Parameters: path – a Path to check

Example

assert_file_exists(path = Path(“some_file.csv”))

fgpyo.io.assert_path_is_writeable(path: pathlib.Path, parent_must_exist: bool = True) → None

Asserts the following: If the file exists then it must also be writeable. Else if the path is not a file and parent_must_exist is true then assert that the parent directory exists and is writeable. Else if the path is not a directory and parent_must_exist is false then look at each parent directory until one is found that exists and is writeable.

Parameters

path – Path to check
parent_must_exist – True/False the parent directory must exist, the default is True

Example

assert_path_is_writeable(path = Path(“example.txt”))

fgpyo.io.read_lines(path: pathlib.Path, strip: bool = False) → Iterator[str]

Takes a path and reads each line into a generator, removing line terminators along the way. By default only line terminators (CR/LF) are stripped. The strip parameter may be used to strip both leading and trailing whitespace from each line.

Parameters

path – Path to read from
strip – True to strip lines of all leading and trailing whitespace,
characters. (False to only remove trailing CR/LF) –

Example

read_back = fio.read_lines(path)

fgpyo.io.redirect_to_dev_null(file_num: int) → Generator[None, None, None]

A context manager that redirects output of file handle to /dev/null

Parameters: file_num – number of filehandle to redirect.

fgpyo.io.suppress_stderr() → Generator[None, None, None]: A context manager that redirects output of stderr to /dev/null

fgpyo.io.to_reader(path: pathlib.Path) → Union[_io.TextIOWrapper, TextIO, IO[Any]]

Opens a Path for reading and based on extension uses open() or gzip.open()

Parameters: path – Path to read from

Example

>>> reader = fio.to_reader(path = Path("reader.txt"))
>>> reader.readlines()
>>> reader.close()

fgpyo.io.to_writer(path: pathlib.Path, append: bool = False) → Union[IO[Any], _io.TextIOWrapper]

Opens a Path for writing (or appending) and based on extension uses open() or gzip.open()

Parameters: path – Path to write (or append) to

Example

>>> writer = fio.to_writer(path = Path("writer.txt"))
>>> writer.write(f'{something}\n')
>>> writer.close()

fgpyo.io.write_lines(path: pathlib.Path, lines_to_write: Iterable[Any], append: bool = False) → None

Writes (or appends) a file with one line per item in provided iterable

Parameters

path – Path to write (or append) to
lines_to_write – items to write (or append) to file

Example

lines: List[Any] = [“things to write”, 100] path_to_write_to: Path = Path(“file_to_write_to.txt”) fio.write_lines(path = path_to_write_to, lines_to_write = lines)

Utility Functions

class fgpyo.util.inspect.AttrsInstance(*args, **kwargs)

class fgpyo.util.inspect.DataclassesProtocol(*args, **kwargs)

fgpyo.util.inspect.FieldType

TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass

alias of Union[dataclasses.Field, attr._make.Attribute]

exception fgpyo.util.inspect.ParserNotFoundException

fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → fgpyo.util.inspect._AttrFromType

Builds an attr or dataclasses class from key-word arguments

Parameters

cls – the attr or dataclasses class to be built
kwargs – a dictionary of keyword arguments
parsers – a dictionary of parser functions to apply to specific types

fgpyo.util.inspect.dict_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → functools.partial

Returns a function that parses a stringified dict into a Dict of the correct type.

Parameters

cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –

fgpyo.util.inspect.get_fields(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) → Tuple[Union[dataclasses.Field, attr._make.Attribute], ...]: Get the fields tuple from either a dataclasses or attr dataclass (or instance)

fgpyo.util.inspect.get_fields_dict(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) → Mapping[str, Union[dataclasses.Field, attr._make.Attribute]]: Get the fields dict from either a dataclasses or attr dataclass (or instance)

fgpyo.util.inspect.is_attr_class(cls: type) → bool: Return True if the class is an attr class, and False otherwise

fgpyo.util.inspect.list_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → functools.partial

Returns a function that parses a stringified list into a List of the correct type.

Parameters

cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –

fgpyo.util.inspect.set_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → functools.partial

Returns a function that parses a stringified set into a Set of the correct type.

Parameters

cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –

fgpyo.util.inspect.split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) → List[str]

Splits a nested field by its outer-most level

Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth

Not currently smart enough to deal with fields enclosed in quotes (’’ or “”) - TODO

fgpyo.util.inspect.tuple_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) → functools.partial

Returns a function that parses a stringified tuple into a Tuple of the correct type.

Parameters

cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –

fgpyo.util.string.column_it(rows: List[List[str]], delimiter: str = ' ') → str

A simple version of Unix’s column utility. This assumes the table is NxM.

Parameters

rows – the rows to adjust. Each row must have the same number of delimited fields.
delimiter – the delimiter for each field in a row.

exception fgpyo.util.types.InspectException

fgpyo.util.types.is_constructible_from_str(type_: type) → bool: Returns true if the provided type can be constructed from a string

fgpyo.util.types.is_list_like(type_: type) → bool: Returns true if the value is a list or list like object

fgpyo.util.types.make_enum_parser(enum: Type[fgpyo.util.types.EnumType]) → functools.partial: Makes a parser function for enum classes

fgpyo.util.types.make_literal_parser(literal: Type[fgpyo.util.types.LiteralType], parsers: Iterable[Callable[[str], fgpyo.util.types.LiteralType]]) → functools.partial: Generates a parser function for a literal type object and a set of parsers for the possible parsers to that literal type object

fgpyo.util.types.make_union_parser(union: Type[fgpyo.util.types.UnionType], parsers: Iterable[Callable[[str], type]]) → functools.partial: Generates a parser function for a union type object and set of parsers for the possible parsers to that union type object

fgpyo.util.types.none_parser(value: str) → Literal[None]: Returns None if the value is ‘None’, else raises an error

fgpyo.util.types.parse_bool(string: str) → bool: Parses strings into bools accounting for the many different text representations of bools that can be used