FGPYO

API

SAM/BAM/CRAM files

Utility Classes and Methods for SAM/BAM

This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.

Motivation for Reader and Writer methods

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

  1. Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.

  2. Makes the requirement to provide a header when opening a file for writing more explicit.

  3. Adds support for Path.

  4. Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.

  5. An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.

Examples of Opening a SAM/BAM for Reading or Writing

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType for the supported file types.

>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader("/path/to/sample.bam") as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
...     pass  # do something

Examples of Manipulating Cigars

Creating a Cigar from a pysam.AlignedSegment.

>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str.

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
... CigarException("Malformed cigar: 10M5[U]")

The cigar contains a tuple of CigarOp) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Examples of parsing the SA tag and individual supplementary alignments

>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']

Module Contents

The module contains the following public classes:
  • SupplementaryAlignment – Stores a supplementary alignment record

    produced by BWA and stored in the SA SAM tag.

  • SamFileType – Enumeration of valid SAM/BAM/CRAM file types.

  • SamOrder – Enumeration of possible SAM/BAM/CRAM sort orders.

  • CigarOp – Enumeration of operators that can appear in a Cigar string.

  • CigarElement – Class representing an element in a Cigar string.

  • CigarParsingException – The exception raised specific to parsing a

    cigar

  • Cigar – Class representing a cigar string.

  • ReadEditInfo – Class describing how a read differs from the reference

The module contains the following methods:

  • reader() – opens a SAM/BAM/CRAM file for reading.

  • writer() – opens a SAM/BAM/CRAM file for writing

  • calc_edit_info() – calculates how a read differs from the reference

class fgpyo.sam.Cigar(elements: Tuple[fgpyo.sam.CigarElement, ...] = ())

Class representing a cigar string.

- elements

zero or more cigar elements

Type

Tuple[CigarElement, …]

classmethod from_cigarstring(cigarstring: str) fgpyo.sam.Cigar

Constructs a Cigar from a string returned by pysam.

If “*” is given, returns an empty Cigar.

classmethod from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) fgpyo.sam.Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp for more information on the various operators. If None is given, returns an empty Cigar.

length_on_query() int

Returns the length of the alignment on the query sequence.

length_on_target() int

Returns the length of the alignment on the target sequence.

reversed() fgpyo.sam.Cigar

Returns a copy of the Cigar with the elements in reverse order.

class fgpyo.sam.CigarElement(length: int, operator: fgpyo.sam.CigarOp)

Represents an element in a Cigar

- length

the length of the element

Type

int

- operator

the operator of the element

Type

CigarOp

property length_on_query: int

Returns the length of the element on the query sequence.

property length_on_target: int

Returns the length of the element on the target (often reference) sequence.

class fgpyo.sam.CigarOp(value)

Enumeration of operators that can appear in a Cigar string.

code

The pysam cigar operator code.

Type

int

character

The single character cigar operator.

Type

int

consumes_query

True if this operator consumes query bases, False otherwise.

Type

bool

consumes_target

True if this operator consumes target bases, False otherwise.

Type

bool

D = (2, 'D', False, True)

Deletion versus the reference

EQ = (7, '=', True, True)

Matches the reference

H = (5, 'H', False, False)

Hard clip

I = (1, 'I', True, False)

E741

Type

Insertion versus the reference # noqa

M = (0, 'M', True, True)

Match or Mismatch the reference

N = (3, 'N', False, True)

Skipped region from the reference

P = (6, 'P', False, False)

Padding

S = (4, 'S', True, False)

Soft clip

X = (8, 'X', True, True)

Mismatches the reference

static from_character(character: str) fgpyo.sam.CigarOp

Returns the operator from the single character.

static from_code(code: int) fgpyo.sam.CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from pysam.

property is_clipping: bool

Returns true if the operator is a soft/hard clip, false otherwise.

property is_indel: bool

Returns true if the operator is an indel, false otherwise.

exception fgpyo.sam.CigarParsingException

The exception raised specific to parsing a cigar.

fgpyo.sam.NO_REF_INDEX: int = -1

The reference index to use to indicate no reference in SAM/BAM.

fgpyo.sam.NO_REF_NAME: str = '*'

The reference name to use to indicate no reference in SAM/BAM.

fgpyo.sam.NO_REF_POS: int = -1

The reference position to use to indicate no position in SAM/BAM.

class fgpyo.sam.ReadEditInfo(matches: int, mismatches: int, insertions: int, inserted_bases: int, deletions: int, deleted_bases: int, nm: int)

Counts various stats about how a read compares to a reference sequence.

matches

the number of bases in the read that match the reference

Type

int

mismatches

the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.

Type

int

insertions

the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.

Type

int

inserted_bases

the total number of bases contained within insertions in the read

Type

int

deletions

the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.

Type

int

deleted_bases

the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).

Type

int

nm

the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases

Type

int

class fgpyo.sam.SamFileType(value)

Enumeration of valid SAM/BAM/CRAM file types.

mode

The additional mode character to add when opening this file type.

Type

str

ext

The standard file extension for this file type.

Type

str

classmethod from_path(path: Union[pathlib.Path, str]) fgpyo.sam.SamFileType

Infers the file type based on the file extension.

Parameters

path – the path to the SAM/BAM/CRAM to read or write.

class fgpyo.sam.SamOrder(value)

Enumerations of possible sort orders for a SAM file.

Coordinate = 'coordinate'

coordinate sorted

QueryName = 'queryname'

queryname sorted

Unsorted = 'unsorted'

the SAM / BAM / CRAM is unsorted

fgpyo.sam.SamPath

The valid base classes for opening a SAM/BAM/CRAM file.

alias of Union[IO[Any], pathlib.Path, str]

class fgpyo.sam.SupplementaryAlignment(reference_name: str, start: int, is_forward: bool, cigar: fgpyo.sam.Cigar, mapq: int, nm: int)

Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

reference_name

the name of the reference (i.e. contig, chromosome) aligned to

Type

str

start

the 0-based start position of the alignment

Type

int

is_forward

true if the alignment is in the forward strand, false otherwise

Type

bool

cigar

the cigar for the alignment

Type

fgpyo.sam.Cigar

mapq

the mapping quality

Type

int

nm

the number of edits

Type

int

property end: int

The 0-based exclusive end position of the alignment.

classmethod from_read(read: pysam.libcalignedsegment.AlignedSegment) List[fgpyo.sam.SupplementaryAlignment]

Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

Parameters

read – An alignment. The presence of the “SA” tag is not required.

Returns

A list of all SupplementaryAlignments present in the SA tag. If the SA tag is not present, or it is empty, an empty list will be returned.

static parse(string: str) fgpyo.sam.SupplementaryAlignment

Returns a supplementary alignment parsed from the given string. The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4)

static parse_sa_tag(tag: str) List[fgpyo.sam.SupplementaryAlignment]

Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.

class fgpyo.sam.Template(name: str, r1: Optional[pysam.libcalignedsegment.AlignedSegment], r2: Optional[pysam.libcalignedsegment.AlignedSegment], r1_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r2_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r1_secondaries: List[pysam.libcalignedsegment.AlignedSegment], r2_secondaries: List[pysam.libcalignedsegment.AlignedSegment])

A container for alignment records corresponding to a single sequenced template or insert.

It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.

name

the name of the template/query

Type

str

r1

Primary alignment for read 1, or None if there is none

Type

Optional[pysam.libcalignedsegment.AlignedSegment]

r2

Primary alignment for read 2, or None if there is none

Type

Optional[pysam.libcalignedsegment.AlignedSegment]

r1_supplementals

Supplementary alignments for read 1

Type

List[pysam.libcalignedsegment.AlignedSegment]

r2_supplementals

Supplementary alignments for read 2

Type

List[pysam.libcalignedsegment.AlignedSegment]

r1_secondaries

Secondary (non-primary) alignments for read 1

Type

List[pysam.libcalignedsegment.AlignedSegment]

r2_secondaries

Secondary (non-primary) alignments for read 2

Type

List[pysam.libcalignedsegment.AlignedSegment]

all_recs() Iterator[pysam.libcalignedsegment.AlignedSegment]

Returns a list with all the records for the template.

static build(recs: Iterable[pysam.libcalignedsegment.AlignedSegment], validate: bool = True) fgpyo.sam.Template

Build a template from a set of records all with the same queryname.

static iterator(alns: Iterator[pysam.libcalignedsegment.AlignedSegment]) Iterator[fgpyo.sam.Template]

Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.

primary_recs() Iterator[pysam.libcalignedsegment.AlignedSegment]

Returns a list with all the primary records for the template.

set_tag(tag: str, value: Optional[Union[str, int, float]]) None

Add a tag to all records associated with the template.

Setting a tag to None will remove the tag.

Parameters
  • tag – The name of the tag.

  • value – The value of the tag.

validate() None

Performs sanity checks that all the records in the Template are as expected.

write_to(writer: pysam.libcalignmentfile.AlignmentFile, primary_only: bool = False) None

Write the records associated with the template to file.

Parameters
  • writer – An open, writable AlignmentFile.

  • primary_only – If True, only write primary alignments.

class fgpyo.sam.TemplateIterator(iterator: Iterator[pysam.libcalignedsegment.AlignedSegment])

An iterator that converts an iterator over query-grouped reads into an iterator over templates.

fgpyo.sam.calculate_edit_info(rec: pysam.libcalignedsegment.AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) fgpyo.sam.ReadEditInfo

Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag. The read must be aligned.

Parameters
  • rec – the read/record for which to calculate values

  • reference_sequence – the reference sequence (or fragment thereof) that the read is aligned to

  • reference_offset – if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start

Returns

a ReadEditInfo with information about how the read differs from the reference

fgpyo.sam.isize(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment) int

Computes the insert size for a pair of records.

fgpyo.sam.reader(path: Union[IO[Any], pathlib.Path, str], file_type: Optional[fgpyo.sam.SamFileType] = None, unmapped: bool = False) pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for reading.

Parameters
  • path – a file handle or path to the SAM/BAM/CRAM to read or write.

  • file_type – the file type to assume when opening the file. If None, then the file type will be auto-detected.

  • unmapped – True if the file is unmapped and has no sequence dictionary, False otherwise.

fgpyo.sam.set_pair_info(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment, proper_pair: bool = True) None

Resets mate pair information between reads in a pair. Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters
  • r1 – read 1

  • r2 – read 2 with the same queryname as r1

fgpyo.sam.writer(path: Union[IO[Any], pathlib.Path, str], header: Union[str, Dict[str, Any], pysam.libcalignmentfile.AlignmentHeader], file_type: Optional[fgpyo.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for writing.

Parameters
  • path – a file handle or path to the SAM/BAM/CRAM to read or write.

  • header – Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.

  • file_type – the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object.

Classes for generating SAM and BAM files and records for testing

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

The module contains the following public classes:

  • SamBuilder – A builder class that allows the accumulation

    of alignment records and access as a list and writing to file.

class fgpyo.sam.builder.SamBuilder(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: fgpyo.sam.SamOrder = SamOrder.Coordinate)

Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file using to_path().

add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = - 1, start2: int = - 1, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) Tuple[pysam.libcalignedsegment.AlignedSegment, pysam.libcalignedsegment.AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:

  • add_pair(chrom, start1, start2) will create a mapped pair where both reads map to the same contig (chrom).

  • add_pair(chrom1, start1, chrom2, start2) will create a mapped pair where the reads map to different contigs (chrom1 and chrom2).

A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.

  • add_pair(chrom, start1)

  • add_pair(chrom1, start1)

  • add_pair(chrom, start2)

  • add_pair(chrom2, start2)

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters
  • name – The name of the template. If None is given a unique name will be auto-generated.

  • bases1 – The bases for R1. If None is given a random sequence is generated.

  • bases2 – The bases for R2. If None is given a random sequence is generated.

  • quals1 – The list of int qualities for R1. If None, the default base quality is used.

  • quals2 – The list of int qualities for R2. If None, the default base quality is used.

  • chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.

  • chrom1 – The chromosome to which R1 is mapped. If None, chrom is used.

  • chrom2 – The chromosome to which R2 is mapped. If None, chrom is used.

  • start1 – The start position of R1. Defaults to the unmapped value.

  • start2 – The start position of R2. Defaults to the unmapped value.

  • cigar1 – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

  • cigar2 – The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.

  • mapq1 – Mapping quality for R1. Defaults to self.mapping_quality if None.

  • mapq2 – Mapping quality for R2. Defaults to self.mapping_quality if None.

  • strand1 – The strand for R1, either “+” or “-”. Defaults to “+”.

  • strand2 – The strand for R2, either “+” or “-”. Defaults to “-“.

  • attrs – An optional dictionary of SAM attribute to place on both R1 and R2.

Raises
  • ValueError – if either strand field is not “+” or “-”

  • ValueError – if bases/quals/cigar are set in a way that is not self-consistent

Returns

The pair of records created, R1 then R2.

Return type

Tuple[AlignedSegment, AlignedSegment]

add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = '*', start: int = - 1, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) pysam.libcalignedsegment.AlignedSegment

Generates a new single reads, adds them to the internal collection, and returns it.

Most fields are optional.

If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it’s paired flag set and read number flags set.

An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.

A mapped read is created by providing chrom and start.

The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters
  • name – The name of the template. If None is given a unique name will be auto-generated.

  • read_num – Either None, 1 for R1 or 2 for R2

  • bases – The bases for the read. If None is given a random sequence is generated.

  • quals – The list of qualities for the read. If None, the default base quality is used.

  • chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.

  • start – The start position of the read. Defaults to the unmapped value.

  • cigar – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

  • mapq – Mapping quality for the read. Default to self.mapping_quality if not given.

  • strand – The strand for R1, either “+” or “-”. Defaults to “+”.

  • secondary – If true the read will be flagged as secondary

  • supplementary – If true the read will be flagged as supplementary

  • attrs – An optional dictionary of SAM attribute to place on both R1 and R2.

Raises
  • ValueError – if strand field is not “+” or “-”

  • ValueError – if read_num is not None, 1 or 2

  • ValueError – if bases/quals/cigar are set in a way that is not self-consistent

Returns

The record created

Return type

AlignedSegment

static default_rg() Dict[str, str]

Returns the default read group used by the SamBuilder, as a dictionary.

static default_sd() List[Dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns

A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

property header: pysam.libcalignmentfile.AlignmentHeader

Returns the builder’s SAM header.

rg() Dict[str, Any]

Returns the single read group that is defined in the header.

rg_id() str

Returns the ID of the single read group that is defined in the header.

to_path(path: Optional[pathlib.Path] = None, index: bool = True, pred: Callable[[pysam.libcalignedsegment.AlignedSegment], bool] = <function SamBuilder.<lambda>>) pathlib.Path

Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.

Parameters
  • path – a path at which to write the file, otherwise a temp file is used.

  • index – if True and sort_order is Coordinate index is generated, otherwise not.

  • pred – optional predicate to specify which reads should be output

Returns

The path to the sorted (and possibly indexed) file.

Return type

Path

to_sorted_list() List[pysam.libcalignedsegment.AlignedSegment]

Returns the accumulated records in coordinate order.

to_unsorted_list() List[pysam.libcalignedsegment.AlignedSegment]

Returns the accumulated records in the order they were created.

Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

  • softclip_start_of_alignment_by_query() clips the start of the alignment in terms of query bases

  • softclip_end_of_alignment_by_query() clips the end of the alignment in terms of query bases

  • softclip_start_of_alignment_by_ref() clips the start of the alignment in terms of reference bases

  • softclip_end_of_alignment_by_ref() clips the end of the alignment in terms of reference bases

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("/path/to/sample.sam") as fh:
...     for rec in fh:
...         clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         print(rec.cigarstring)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

  • Mate pairs may have incorrect information about their mate’s positions

  • Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd
class fgpyo.sam.clipping.ClippingInfo(query_bases_clipped: int, ref_bases_clipped: int)

Named tuple holding the number of bases clipped on the query and reference respectively.

query_bases_clipped: int

The number of query bases in the alignment that were clipped.

ref_bases_clipped: int

The number of reference bases in the alignment that were clipped.

fgpyo.sam.clipping.softclip_end_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo

Adds soft-clipping to the end of a read’s alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired in the read/query

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_end_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired on the reference

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_start_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo

Adds soft-clipping to the start of a read’s alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired in the read/query

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

fgpyo.sam.clipping.softclip_start_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired on the reference

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

VCF/BCF files

Classes for generating VCF and records for testing

This module contains utility classes for the generation of VCF files and variant records, for use in testing.

The module contains the following public classes:

  • VariantBuilder – A builder class that allows the

    accumulation of variant records and access as a list and writing to file.

Examples

Typically, we have VariantRecord records obtained from reading from a VCF file. The VariantBuilder class builds such records.

Variants are added with the add() method, which returns a pysam.VariantRecord.

>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>>     contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
>>>     qual=40, filter=["PASS"]
>>> )

VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs

>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>>     samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
>>> )

The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the to_path() method:

>>>     from pathlib import Path
>>>     path_to_vcf: Path = builder.to_path()

The variants may also be retrieved in the order they were added via the to_unsorted_list() method and in coordinate sorted order via the to_sorted_list() method.

fgpyo.vcf.reader(path: Union[pathlib.Path, str, TextIO]) Generator[pysam.libcbcf.VariantFile, None, None]

Opens the given path for VCF reading

Parameters

path – the path to a VCF, or an open file handle

fgpyo.vcf.writer(path: Union[pathlib.Path, str, TextIO], header: pysam.libcbcf.VariantHeader) Generator[pysam.libcbcf.VariantFile, None, None]

Opens the given path for VCF writing.

Parameters
  • path – the path to a VCF, or an open filehandle

  • header – the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header

Classes for generating VCF and records for testing

class fgpyo.vcf.builder.VariantBuilder(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None)

Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.

Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.

A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS (“sample_ids”) must be provided to the VariantBuilder constructor.

Variants are then added using the add() method. Once accumulated the variants can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). Lastly, the records can be written to a temporary file using to_path().

sample_ids

the sample name(s)

Type

List[str]

sd

sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain “ID” (the same as contig_name) and “length”, the contig length. Other values will be added to the VCF header line for that contig.

Type

Dict[str, Dict[str, Any]]

seq_idx_lookup

dictionary mapping contig name to index of contig in sd

Type

Dict[str, int]

records

the list of variant records

Type

List[pysam.libcbcf.VariantRecord]

header

the pysam header

Type

pysam.libcbcf.VariantHeader

add(contig: Optional[str] = None, pos: int = 1000, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) pysam.libcbcf.VariantRecord

Generates a new variant and adds it to the internal collection.

Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property “pos”, which is 1-based. pysam will also report “pos” as 1-based, so that is the property that should be accessed when using the records produced by this function (not “start”).

Parameters
  • contig – the chromosome name. If None, will use the first contig in the sequence dictionary.

  • pos – the 1-based position of the variant

  • id – the variant id

  • ref – the reference allele

  • alts – the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.

  • qual – the variant quality

  • filter – the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.

  • info – the dictionary of INFO key-value pairs

  • samples – the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (“.”) for samples that don’t have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.

add_filter_header(name: str, description: Optional[str] = None) None

Add a FILTER header field to the VCF header.

Parameters
  • name – the name of the field

  • description – the description of the field

add_format_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES, description: Optional[str] = None) None

Add a FORMAT header field to the VCF header.

Parameters
  • name – the name of the field

  • field_type – the field_type of the field

  • number – the number of the field

  • description – the description of the field

add_header_line(line: str) None

Adds a header line to the header

add_info_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) None

Add an INFO header field to the VCF header.

Parameters
  • name – the name of the field

  • field_type – the field_type of the field

  • number – the number of the field

  • description – the description of the field

  • source – the source of the field

  • version – the version of the field

classmethod default_sd() Dict[str, Dict[str, Any]]

Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.

Returns

A new copy of the sequence dictionary as a map of contig name to dictionary, one per contig.

to_path(path: Optional[pathlib.Path] = None) pathlib.Path

Returns a path to a VCF for variants added to this builder. :param path: optional path to the VCF

to_sorted_list() List[pysam.libcbcf.VariantRecord]

Returns the accumulated records in coordinate order.

to_unsorted_list() List[pysam.libcbcf.VariantRecord]

Returns the accumulated records in the order they were created.

class fgpyo.vcf.builder.VcfFieldNumber(value)

Special codes for VCF field numbers

class fgpyo.vcf.builder.VcfFieldType(value)

Codes for VCF field types

FASTA files

Classes for generating fasta files and records for testing

This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).

Examples of creating sets of contigs for writing to fasta

Writing a FASTA with two contigs each with 100 bases:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
>>> builder.add("chr11").add("GGGGGGGGGG", 10)
>>> builder.to_file(path = pathlib.Path("test.fasta"))

Writing a FASTA with one contig with 100 A’s and 50 T’s:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
>>> builder.to_file(path = pathlib.Path("test.fasta"))

Add bases to existing contig:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
>>> contig_one.bases
'AAAAAAAAAANNN'
class fgpyo.fasta.builder.ContigBuilder(name: str, assembly: str, species: str)

Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.

name

Unique contig ID, ie., “chr10”

assembly

Assembly information, if None default is ‘testassembly’

species

Species information, if None default is ‘testspecies’

bases

The bases to be added to the contig ex “A”

add(bases: str, times: int = 1) fgpyo.fasta.builder.ContigBuilder

Method for adding bases to a new or existing instance of ContigBuilder.

Parameters
  • bases – The bases to be added to the contig

  • times – The number of times the bases should be repeated

Example add(“AAA”, 2) results in the following bases -> “AAAAAA”

class fgpyo.fasta.builder.FastaBuilder(assembly: str = 'testassembly', species: str = 'testspecies', line_length: int = 80)

Builder for constructing sets of one or more contigs.

Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.

A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.

Contigs are added to FastaBuilder using: add()

Bases are added to existing contigs using: add()

Once accumulated the contigs can be written to a file using: to_file()

Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

assembly

Assembly information, if None default is ‘testassembly’

species

Species, if None default is ‘testspecies’

line_length

Desired line length, if None default is 80

contig_builders

Private dictionary of contig names and instances of ContigBuilder

add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) fgpyo.fasta.builder.ContigBuilder

Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.

Parameters
  • name – Unique contig ID, ie., “chr10”

  • assembly – Assembly information, if None default is ‘testassembly’

  • species – Species information, if None default is ‘testspecies’

to_file(path: pathlib.Path) None

Writes out the set of accumulated contigs to a FASTA file at the path given. Also generates the accompanying fasta index file (.fa.fai) and sequence dictionary file (.dict).

Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.

Parameters

path – Path to write files to.

Example: FastaBuilder.to_file(path = pathlib.Path(“my_fasta.fa”))

fgpyo.fasta.builder.pysam_dict(assembly: str, species: str, output_path: str, input_path: str) None

Calls pysam.dict and writes the sequence dictionary to the provided output path

Args

assembly: Assembly species: Species output_path: File path to write dictionary to input_path: Path to fasta file

fgpyo.fasta.builder.pysam_faidx(input_path: str) None

Calls pysam.faidx and writes fasta index in the same file location as the fasta file

Args

input_path: Path to fasta file

FASTX files

Zipping FASTX Files

Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics and can be achieved with the FastxZipped context manager. The context manager facilitates opening of all input FASTA/FASTQ files and closing them after iteration is complete. For every iteration of FastxZipped, a tuple of the next FASTX records are returned (of type FastxRecord). An exception will be raised if any of the input files are malformed or truncated and if record names are not equivalent and in sync.

Importantly, this context manager is optimized for fast streaming read-only usage and, by default, any previous records saved while advancing the iterator will not be correct as the underlying pointer in memory will refer to the most recent record only, and not any past records. To preserve the state of all previously iterated records, set the parameter persist to True.

>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:
...    for (r1, r2) in zipped:
...         print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT
class fgpyo.fastx.FastxZipped(*paths: Union[pathlib.Path, str], persist: bool = False)

A context manager that will lazily zip over any number of FASTA/FASTQ files.

Parameters
  • paths – Paths to the FASTX files to zip over.

  • persist – Whether to persit the state of previous records during iteration.

close() None

Close the FastxZipped context manager by closing all FASTX files.

Metric files

Metrics

Module for storing, reading, and writing metric-like tab-delimited information.

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R. For example, a row per person, with columns for age, gender, and address.

The Metric class makes it easy to read, write, and store one or metrics of the same type, all the while preserving types for each value in a metric. It is an abstract base class decorated by @dataclass, or @attr.s, with attributes storing one or more typed values. If using multiple layers of inheritance, keep in mind that it’s not possible to mix these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately initialize the values of the attr superclass.

Examples

Defining a new metric class:

>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

or using attr:

>>> from fgpyo.util.metric import Metric
>>> import attr
>>> @attr.s(auto_attribs=True, frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.

>>> Person.header()
['name', 'age']

Getting the values from a metric class instance. The values are in the same order as the header.

>>> list(Person(name="Alice", age=47).values())
["Alice", 47]

Writing a list of metrics to a file:

>>> metrics = [
...     Person(name="Alice", age=47),
...     Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)

Then the contents of the written metrics file:

$ column -t /path/to/metrics.txt
name   age
Alice  47
Bob    24

Reading the metrics file back in:

>>> list(Person.read(Path("/path/to/metrics.txt")))
[Person(name='Alice', age=47, address=None), Person(name='Bob', age=24, address='North Pole')]

Formatting and parsing the values for custom types is supported by overriding the _parsers() and format_value() methods.

>>> @dataclasses.dataclass(frozen=True)
... class Name:
...     first: str
...     last: str
...     @classmethod
...     def parse(cls, value: str) -> "Name":
...          fields = value.split(" ")
...          return Name(first=fields[0], last=fields[1])
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: Name
...     age: int
...     def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
...         return {Name: lambda value: Name.parse(value=value)}
...     @classmethod
...     def format_value(cls, value: Any) -> str:
...         if isinstance(value, (Name)):
...             return f"{value.first} {value.last}"
...         else:
...             return super().format_value(value=value)
>>> Person.parse(fields=["john doe", "42"])
Person(name=Name(first='john', last='doe'), age=42)
>>> Person(name=Name(first='john', last='doe'), age=42, address=None).formatted_values()
["first last", "42"]
class fgpyo.util.metric.Metric(*args, **kwds)

Abstract base class for all metric-like tab-delimited files

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R.

Sub-classes of Metric can support parsing and formatting custom types with _parsers() and format_value().

classmethod format_value(value: Any) str

The default method to format values of a given type.

By default, this method will comma-delimit list, tuple, and set types, and apply str to all others.

Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.

In addition, lists will be flanked with ‘[]’, tuples with ‘()’ and sets and dictionaries with ‘{}’

Parameters

value – the value to format.

formatted_values() List[str]

An iterator over formatted attribute values in the same order as the header.

classmethod header() List[str]

The list of header values for the metric.

classmethod parse(fields: List[str]) Any

Parses the string-representation of this metric. One string per attribute should be given.

classmethod read(path: pathlib.Path, ignore_extra_fields: bool = True) Iterator[Any]

Reads in zero or more metrics from the given path.

The metric file must contain a matching header.

Columns that are not present in the file but are optional in the metric class will be default values.

Parameters
  • path – the path to the metrics file.

  • ignore_extra_fields – True to ignore any extra columns, False to raise an exception.

classmethod to_list(value: str) List[Any]

Returns a list value split on comma delimeter.

values() Iterator[Any]

An iterator over attribute values in the same order as the header.

classmethod write(path: pathlib.Path, *values: fgpyo.util.metric.MetricType) None

Writes zero or more metrics to the given path.

The header will always be written.

Parameters
  • path – Path to the output file.

  • values – Zero or more metrics.

fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) fgpyo.util.inspect._AttrFromType

Builds an attr or dataclasses class from key-word arguments

Parameters
  • cls – the attr or dataclasses class to be built

  • kwargs – a dictionary of keyword arguments

  • parsers – a dictionary of parser functions to apply to specific types

See also

https://docs.python.org/3/library/dataclasses.html

Documentation for the dataclasses standard module

https://www.attrs.org/en/stable/examples.html

The attrs website for bringing back the joy to writing classes.

Read Structures

Classes for representing Read Structures

A Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the –use-bases-mask in Illumina’s bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Examples

>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
'75T', '8B', '75T'
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
...   ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
...   print(str(ex))
...
Read structure missing length information: 23T2T[T]23T

Module Contents

The module contains the following public classes:
  • ReadStructure – Describes the structure of a give read

  • ReadSegment – Describes all the information about a segment within a read structure

  • SegmentType – The type of segments that can show up in a read structure

  • SubReadWithoutQuals – Contains the bases that correspond to the given read segment

  • SubReadWithQuals – Contains the bases and qualities that correspond to the given read segment

fgpyo.read_structure.ANY_LENGTH_CHAR: str = '+'

A character that can be put in place of a number in a read structure to mean “0 or more bases”.

class fgpyo.read_structure.ReadSegment(*, offset: int, length: Optional[int], kind: fgpyo.read_structure.SegmentType)

Encapsulates all the information about a segment within a read structure. A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.

offset

The offset of the read segment in the read.

Type

int

length

The length of the segment, or None if it is variable length.

Type

Optional[int]

kind

The kind of read segment.

Type

fgpyo.read_structure.SegmentType

extract(bases: str) fgpyo.read_structure.SubReadWithoutQuals

Gets the bases associated with this read segment.

extract_with_quals(bases: str, quals: str) fgpyo.read_structure.SubReadWithQuals

Gets the bases and qualities associated with this read segment.

property fixed_length: int

The fixed length if there is one. Throws an exception on segments without fixed lengths!

property has_fixed_length: bool

True if the read segment has a defined length.

class fgpyo.read_structure.ReadStructure(*, segments: Tuple[fgpyo.read_structure.ReadSegment, ...])

Describes the structure of a give read. A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.

segments

The segments composing the read structure

Type

Tuple[fgpyo.read_structure.ReadSegment, …]

extract(bases: str) Tuple[fgpyo.read_structure.SubReadWithoutQuals, ...]

Splits the given bases into tuples with its associated read segment.

extract_with_quals(bases: str, quals: str) Tuple[fgpyo.read_structure.SubReadWithQuals, ...]

Splits the given bases and qualities into triples with its associated read segment.

property fixed_length: int

The fixed length if there is one. Throws an exception on segments without fixed lengths!

classmethod from_segments(segments: Tuple[fgpyo.read_structure.ReadSegment, ...], reset_offsets: bool = False) fgpyo.read_structure.ReadStructure

Creates a new ReadStructure, optionally resetting the offsets on each of the segments

property has_fixed_length: bool

True if the ReadStructure has a fixed (i.e. non-variable) length

property length: int

Length is defined as the number of segments (not bases!) in the read structure

segments_by_kind(kind: fgpyo.read_structure.SegmentType) Tuple[fgpyo.read_structure.ReadSegment, ...]

Returns just the segments of a given kind.

with_variable_last_segment() fgpyo.read_structure.ReadStructure

Generates a new ReadStructure that is the same as this one except that the last segment has undefined length

class fgpyo.read_structure.SegmentType(value)

The type of segments that can show up in a read structure

MolecularBarcode = 'M'

The segment type for molecular barcode bases.

SampleBarcode = 'B'

The segment type for sample barcode bases.

Skip = 'S'

The segment type for bases that need to be skipped.

Template = 'T'

The segment type for template bases.

class fgpyo.read_structure.SubReadWithQuals(*, bases: str, quals: str, segment: ReadSegment)

Contains the bases and qualities that correspond to the given read segment

bases: str

The sub-read bases that correspond to the given read segment.

property kind: fgpyo.read_structure.SegmentType

The kind of read segment that corresponds to this sub-read.

quals: str

The sub-read base qualities that correspond to the given read segment.

segment: fgpyo.read_structure.ReadSegment

The segment of the read structure that describes this sub-read.

class fgpyo.read_structure.SubReadWithoutQuals(*, bases: str, segment: ReadSegment)

Contains the bases that correspond to the given read segment.

bases: str

The sub-read bases that correspond to the given read segment.

property kind: fgpyo.read_structure.SegmentType

The kind of read segment that corresponds to this sub-read.

segment: fgpyo.read_structure.ReadSegment

The segment of the read structure that describes this sub-read.

See also

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

The fgbio wiki for an explanation of Read Structures.

Manipulating DNA and RNA Sequences

Utility Functions for Manipulating DNA and RNA sequences.

This module contains utility functions for manipulating DNA and RNA sequences.

fgpyo.sequence.complement(base: str) str

Returns the complement of any base.

fgpyo.sequence.reverse_complement(bases: str) str

Reverse complements a base sequence.

Parameters

bases – the bases to be reverse complemented.

Returns

the reverse complement of the provided base string

Collections

Functions for Working with Collections

This module contains classes and functions for working with collections and iterators.

Examples of a “Peekable” Iterator

“Peekable” iterators are useful to “peek” at the next item in an iterator without consuming it. For example, this is useful when consuming items in iterator while a predicate is true, and not consuming the first element where the element is not true. See the takewhile() and dropwhile() methods.

An empty peekable iterator throws StopIteration:

>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
StopIteration

A peekable iterator will return the next item before consuming it.

>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]

The can_peek() function can be used to determine if the iterator can be peeked without StopIteration being thrown:

>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
StopIteration

PeekableIterator’s constructor supports creation from iterable objects as well as iterators.

Module Contents

The module contains the following public classes:

  • PeekableIterator – Iterator that allows you to peek at the

    next value before calling next

class fgpyo.collections.PeekableIterator(source: Union[Iterator[fgpyo.collections.IterType], Iterable[fgpyo.collections.IterType]])

A peekable iterator wrapping an iterator or iterable.

This allows returning the next item without consuming it.

Parameters

source – an iterator over the objects

can_peek() bool

Returns true if there is a value that can be peeked at, false otherwise.

dropwhile(pred: Callable[[fgpyo.collections.IterType], bool]) fgpyo.collections.PeekableIterator[fgpyo.collections.IterType]

Drops elements from the iterator while the predicate is true.

Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.

Parameters
  • pred (Callable[[V], bool]) – a function that takes a value from the iterator

  • false. (and returns true or) –

Returns

a reference to this iterator, so calls can be chained

Return type

PeekableIterator[V]

peek() fgpyo.collections.IterType

Returns the next element without consuming it, or StopIteration otherwise.

takewhile(pred: Callable[[fgpyo.collections.IterType], bool]) List[fgpyo.collections.IterType]

Consumes from the iterator while pred is true, and returns the result as a List.

The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.

Parameters

pred – a function that takes the next value from the iterator and returns true or false.

Returns

A list of the values from the iterator, in order, up until and excluding the first value that does not match the predicate.

Return type

List[V]

Logging

Methods for setting up logging for tools.

Progress Logging Examples

Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging progress is useful to not only log how many inputs have been consumed, but also their genomic coordinate. ProgressLogger can log progress every fixed number of records. Logging can be written to logging.Logger as well as custom print method.

>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
...     printer=lambda s: logged_lines.append(s),
...     verb="recorded",
...     noun="items",
...     unit=2
... )
>>> progress.record(reference_name="chr1", position=1)  # does not log
False
>>> progress.record(reference_name="chr1", position=2)  # logs
True
>>> progress.record(reference_name="chr1", position=3)  # does not log
False
>>> progress.log_last()  # will log the last recorded item, if not previously logged
True
>>> logged_lines  # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']
class fgpyo.util.logging.ProgressLogger(printer: Union[logging.Logger, Callable[[str], Any]], noun: str = 'records', verb: str = 'Read', unit: int = 100000)

A little class to track progress.

This will output a log message every unit number times recorded.

printer

either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string

noun

the noun to use in the log message

verb

the verb to use in the log message

unit

the number of items for every log message

count

the total count of items recorded

log_last() bool

Force logging the last record, for example when progress has completed.

record(reference_name: Optional[str] = None, position: Optional[int] = None) bool

Record an item at a given genomic coordinate. :param reference_name: the reference name of the item :param position: the 1-based start position of the item

Returns

true if a message was logged, false otherwise

record_alignment(rec: pysam.libcalignedsegment.AlignedSegment) bool

Correctly record pysam.AlignedSegments (zero-based coordinates).

Parameters

rec – pysam.AlignedSegment object

Returns

true if a message was logged, false otherwise

fgpyo.util.logging.setup_logging(level: str = 'INFO', name: str = 'fgpyo') None

Globally configure logging for all modules

Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.

Parameters
  • level – the default level for the logger

  • name – the name of the logger

IO

Module for reading and writing files

The functions in this module make it easy to:

  • check if a file exists and is writeable

  • check if a file and its parent directories exist and are writeable

  • check if a file exists and is readable

  • check if a path exists and is a directory

  • open an appropriate reader or writer based on the file extension

  • write items to a file, one per line

  • read lines from a file

fgpyo.io Examples:

>>> import fgpyo.io as fio
>>> from pathlib import Path
Assert that a path exists and is readable
>>> path_flat: Path = Path("example.txt")
>>> path_compressed: Path = Path("example.txt.gz")
>>> fio.path_is_readable(path_flat)
AssertionError: Cannot read non-existent path: example.txt
>>> fio.path_is_readable(compressed_file)
AssertionError: Cannot read non-existent path: example.txt.gz
Write to and read from path
>>> write_lines(path = path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path = path_compressed, lines_to_write=["gzip file", 10])
Read lines from a path into a generator
>>> lines = read_lines(path = path_flat)
>>> next(lines)
"flat file"
>>> next(lines)
"10"
>>> lines = read_lines(path = path_compressed)
>>> next(lines)
"gzip file"
>>> next(lines)
"10"
fgpyo.io.assert_directory_exists(path: pathlib.Path) None

Asserts that a path exist and is a directory

Parameters

path – Path to check

Example

assert_directory_exists(path = Path(“/example/directory/”))

fgpyo.io.assert_path_is_readable(path: pathlib.Path) None

Checks that file exists and returns True, else raises AssertionError

Parameters

path – a Path to check

Example

assert_file_exists(path = Path(“some_file.csv”))

fgpyo.io.assert_path_is_writeable(path: pathlib.Path, parent_must_exist: bool = True) None

Asserts the following: If the file exists then it must also be writeable. Else if the path is not a file and parent_must_exist is true then assert that the parent directory exists and is writeable. Else if the path is not a directory and parent_must_exist is false then look at each parent directory until one is found that exists and is writeable.

Parameters
  • path – Path to check

  • parent_must_exist – True/False the parent directory must exist, the default is True

Example

assert_path_is_writeable(path = Path(“example.txt”))

fgpyo.io.read_lines(path: pathlib.Path, strip: bool = False) Iterator[str]

Takes a path and reads each line into a generator, removing line terminators along the way. By default only line terminators (CR/LF) are stripped. The strip parameter may be used to strip both leading and trailing whitespace from each line.

Parameters
  • path – Path to read from

  • strip – True to strip lines of all leading and trailing whitespace,

  • characters. (False to only remove trailing CR/LF) –

Example

read_back = fio.read_lines(path)

fgpyo.io.redirect_to_dev_null(file_num: int) Generator[None, None, None]

A context manager that redirects output of file handle to /dev/null

Parameters

file_num – number of filehandle to redirect.

fgpyo.io.suppress_stderr() Generator[None, None, None]

A context manager that redirects output of stderr to /dev/null

fgpyo.io.to_reader(path: pathlib.Path) Union[_io.TextIOWrapper, TextIO, IO[Any]]

Opens a Path for reading and based on extension uses open() or gzip.open()

Parameters

path – Path to read from

Example

>>> reader = fio.to_reader(path = Path("reader.txt"))
>>> reader.readlines()
>>> reader.close()
fgpyo.io.to_writer(path: pathlib.Path, append: bool = False) Union[IO[Any], _io.TextIOWrapper]

Opens a Path for writing (or appending) and based on extension uses open() or gzip.open()

Parameters

path – Path to write (or append) to

Example

>>> writer = fio.to_writer(path = Path("writer.txt"))
>>> writer.write(f'{something}\n')
>>> writer.close()
fgpyo.io.write_lines(path: pathlib.Path, lines_to_write: Iterable[Any], append: bool = False) None

Writes (or appends) a file with one line per item in provided iterable

Parameters
  • path – Path to write (or append) to

  • lines_to_write – items to write (or append) to file

Example

lines: List[Any] = [“things to write”, 100] path_to_write_to: Path = Path(“file_to_write_to.txt”) fio.write_lines(path = path_to_write_to, lines_to_write = lines)

Utility Functions

class fgpyo.util.inspect.AttrsInstance(*args, **kwargs)
class fgpyo.util.inspect.DataclassesProtocol(*args, **kwargs)
fgpyo.util.inspect.FieldType

TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass

alias of Union[dataclasses.Field, attr._make.Attribute]

exception fgpyo.util.inspect.ParserNotFoundException
fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) fgpyo.util.inspect._AttrFromType

Builds an attr or dataclasses class from key-word arguments

Parameters
  • cls – the attr or dataclasses class to be built

  • kwargs – a dictionary of keyword arguments

  • parsers – a dictionary of parser functions to apply to specific types

fgpyo.util.inspect.dict_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial

Returns a function that parses a stringified dict into a Dict of the correct type.

Parameters
  • cls – the type of the class object this is being parsed for (used to get default val for

  • parsers)

  • type – the type of the attribute to be parsed

  • parsers – an optional mapping from type to the function to use for parsing that type (allows

  • types) (for parsing of more complex) –

fgpyo.util.inspect.get_fields(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) Tuple[Union[dataclasses.Field, attr._make.Attribute], ...]

Get the fields tuple from either a dataclasses or attr dataclass (or instance)

fgpyo.util.inspect.get_fields_dict(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) Mapping[str, Union[dataclasses.Field, attr._make.Attribute]]

Get the fields dict from either a dataclasses or attr dataclass (or instance)

fgpyo.util.inspect.is_attr_class(cls: type) bool

Return True if the class is an attr class, and False otherwise

fgpyo.util.inspect.list_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial

Returns a function that parses a stringified list into a List of the correct type.

Parameters
  • cls – the type of the class object this is being parsed for (used to get default val for

  • parsers)

  • type – the type of the attribute to be parsed

  • parsers – an optional mapping from type to the function to use for parsing that type (allows

  • types) (for parsing of more complex) –

fgpyo.util.inspect.set_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial

Returns a function that parses a stringified set into a Set of the correct type.

Parameters
  • cls – the type of the class object this is being parsed for (used to get default val for

  • parsers)

  • type – the type of the attribute to be parsed

  • parsers – an optional mapping from type to the function to use for parsing that type (allows

  • types) (for parsing of more complex) –

fgpyo.util.inspect.split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) List[str]

Splits a nested field by its outer-most level

Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth

Not currently smart enough to deal with fields enclosed in quotes (’’ or “”) - TODO

fgpyo.util.inspect.tuple_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial

Returns a function that parses a stringified tuple into a Tuple of the correct type.

Parameters
  • cls – the type of the class object this is being parsed for (used to get default val for

  • parsers)

  • type – the type of the attribute to be parsed

  • parsers – an optional mapping from type to the function to use for parsing that type (allows

  • types) (for parsing of more complex) –

fgpyo.util.string.column_it(rows: List[List[str]], delimiter: str = ' ') str

A simple version of Unix’s column utility. This assumes the table is NxM.

Parameters
  • rows – the rows to adjust. Each row must have the same number of delimited fields.

  • delimiter – the delimiter for each field in a row.

exception fgpyo.util.types.InspectException
fgpyo.util.types.is_constructible_from_str(type_: type) bool

Returns true if the provided type can be constructed from a string

fgpyo.util.types.is_list_like(type_: type) bool

Returns true if the value is a list or list like object

fgpyo.util.types.make_enum_parser(enum: Type[fgpyo.util.types.EnumType]) functools.partial

Makes a parser function for enum classes

fgpyo.util.types.make_literal_parser(literal: Type[fgpyo.util.types.LiteralType], parsers: Iterable[Callable[[str], fgpyo.util.types.LiteralType]]) functools.partial

Generates a parser function for a literal type object and a set of parsers for the possible parsers to that literal type object

fgpyo.util.types.make_union_parser(union: Type[fgpyo.util.types.UnionType], parsers: Iterable[Callable[[str], type]]) functools.partial

Generates a parser function for a union type object and set of parsers for the possible parsers to that union type object

fgpyo.util.types.none_parser(value: str) Literal[None]

Returns None if the value is ‘None’, else raises an error

fgpyo.util.types.parse_bool(string: str) bool

Parses strings into bools accounting for the many different text representations of bools that can be used