FGPYO
API
SAM/BAM/CRAM files
Utility Classes and Methods for SAM/BAM
This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.
Motivation for Reader and Writer methods
The following are the reasons for choosing to implement methods to open a SAM/BAM file for
reading and writing, rather than relying on pysam.AlignmentFile
directly:
Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
Makes the requirement to provide a header when opening a file for writing more explicit.
Adds support for
Path
.Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.
Examples of Opening a SAM/BAM for Reading or Writing
Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See
SamFileType
for the supported file types.
>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
... for record in fh:
... print(record.name) # do something
>>> with reader("/path/to/sample.bam") as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for reading, explicitly passing the file type.
>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
... for record in fh:
... print(record.name) # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for reading, using an existing file-like object
>>> with open("/path/to/sample.sam", "rb") as file_object:
... with reader(path=file_object, file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for writing follows similar to the reader()
method,
but the SAM file header object is required.
>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
... "HD": {"VN": "1.5", "SO": "coordinate"},
... "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
... "SQ": [
... {"SN": "chr1", "LN": 249250621},
... {"SN": "chr2", "LN": 243199373}
... ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
... pass # do something
Examples of Manipulating Cigars
Creating a Cigar
from a pysam.AlignedSegment
.
>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
... record = next(fh)
... cigar = Cigar.from_cigartuples(record.cigartuples)
... print(str(cigar))
50M2D5M10S
>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S
If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.
>>> cigar = Cigar.from_cigarstring("10M5U")
... CigarException("Malformed cigar: 10M5[U]")
The cigar contains a tuple of CigarOp
) and associated operator length. A number of
useful methods are part of both classes.
The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]
Examples of parsing the SA tag and individual supplementary alignments
>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']
Module Contents
- The module contains the following public classes:
SupplementaryAlignment
– Stores a supplementary alignment recordproduced by BWA and stored in the SA SAM tag.
SamFileType
– Enumeration of valid SAM/BAM/CRAM file types.SamOrder
– Enumeration of possible SAM/BAM/CRAM sort orders.CigarOp
– Enumeration of operators that can appear in a Cigar string.CigarElement
– Class representing an element in a Cigar string.CigarParsingException
– The exception raised specific to parsing acigar
Cigar
– Class representing a cigar string.ReadEditInfo
– Class describing how a read differs from the reference
The module contains the following methods:
- class fgpyo.sam.Cigar(elements: Tuple[fgpyo.sam.CigarElement, ...] = ())
Class representing a cigar string.
- - elements
zero or more cigar elements
- Type
Tuple[CigarElement, …]
- classmethod from_cigarstring(cigarstring: str) fgpyo.sam.Cigar
Constructs a Cigar from a string returned by pysam.
If “*” is given, returns an empty Cigar.
- classmethod from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) fgpyo.sam.Cigar
Returns a Cigar from a list of tuples returned by pysam.
Each tuple denotes the operation and length. See
CigarOp
for more information on the various operators. If None is given, returns an empty Cigar.
- reversed() fgpyo.sam.Cigar
Returns a copy of the Cigar with the elements in reverse order.
- class fgpyo.sam.CigarElement(length: int, operator: fgpyo.sam.CigarOp)
Represents an element in a Cigar
- - length
the length of the element
- Type
- - operator
the operator of the element
- Type
- class fgpyo.sam.CigarOp(value)
Enumeration of operators that can appear in a Cigar string.
- D = (2, 'D', False, True)
Deletion versus the reference
- EQ = (7, '=', True, True)
Matches the reference
- H = (5, 'H', False, False)
Hard clip
- I = (1, 'I', True, False)
E741
- Type
Insertion versus the reference # noqa
- M = (0, 'M', True, True)
Match or Mismatch the reference
- N = (3, 'N', False, True)
Skipped region from the reference
- P = (6, 'P', False, False)
Padding
- S = (4, 'S', True, False)
Soft clip
- X = (8, 'X', True, True)
Mismatches the reference
- static from_character(character: str) fgpyo.sam.CigarOp
Returns the operator from the single character.
- static from_code(code: int) fgpyo.sam.CigarOp
Returns the operator from the given operator code.
Note: this is mainly used to get the operator from
pysam
.
- exception fgpyo.sam.CigarParsingException
The exception raised specific to parsing a cigar.
- class fgpyo.sam.ReadEditInfo(matches: int, mismatches: int, insertions: int, inserted_bases: int, deletions: int, deleted_bases: int, nm: int)
Counts various stats about how a read compares to a reference sequence.
- mismatches
the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.
- Type
- insertions
the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.
- Type
- deletions
the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.
- Type
- deleted_bases
the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).
- Type
- class fgpyo.sam.SamFileType(value)
Enumeration of valid SAM/BAM/CRAM file types.
- classmethod from_path(path: Union[pathlib.Path, str]) fgpyo.sam.SamFileType
Infers the file type based on the file extension.
- Parameters
path – the path to the SAM/BAM/CRAM to read or write.
- class fgpyo.sam.SamOrder(value)
Enumerations of possible sort orders for a SAM file.
- Coordinate = 'coordinate'
coordinate sorted
- QueryName = 'queryname'
queryname sorted
- Unsorted = 'unsorted'
the SAM / BAM / CRAM is unsorted
- fgpyo.sam.SamPath
The valid base classes for opening a SAM/BAM/CRAM file.
alias of
Union
[IO
[Any
],pathlib.Path
,str
]
- class fgpyo.sam.SupplementaryAlignment(reference_name: str, start: int, is_forward: bool, cigar: fgpyo.sam.Cigar, mapq: int, nm: int)
Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.
- cigar
the cigar for the alignment
- Type
- classmethod from_read(read: pysam.libcalignedsegment.AlignedSegment) List[fgpyo.sam.SupplementaryAlignment]
Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.
- Parameters
read – An alignment. The presence of the “SA” tag is not required.
- Returns
A list of all SupplementaryAlignments present in the SA tag. If the SA tag is not present, or it is empty, an empty list will be returned.
- static parse(string: str) fgpyo.sam.SupplementaryAlignment
Returns a supplementary alignment parsed from the given string. The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4)
- static parse_sa_tag(tag: str) List[fgpyo.sam.SupplementaryAlignment]
Parses an SA tag of supplementary alignments from a BAM file. If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.
- class fgpyo.sam.Template(name: str, r1: Optional[pysam.libcalignedsegment.AlignedSegment], r2: Optional[pysam.libcalignedsegment.AlignedSegment], r1_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r2_supplementals: List[pysam.libcalignedsegment.AlignedSegment], r1_secondaries: List[pysam.libcalignedsegment.AlignedSegment], r2_secondaries: List[pysam.libcalignedsegment.AlignedSegment])
A container for alignment records corresponding to a single sequenced template or insert.
It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.
- r1
Primary alignment for read 1, or None if there is none
- Type
Optional[pysam.libcalignedsegment.AlignedSegment]
- r2
Primary alignment for read 2, or None if there is none
- Type
Optional[pysam.libcalignedsegment.AlignedSegment]
- r1_supplementals
Supplementary alignments for read 1
- Type
List[pysam.libcalignedsegment.AlignedSegment]
- r2_supplementals
Supplementary alignments for read 2
- Type
List[pysam.libcalignedsegment.AlignedSegment]
- r1_secondaries
Secondary (non-primary) alignments for read 1
- Type
List[pysam.libcalignedsegment.AlignedSegment]
- r2_secondaries
Secondary (non-primary) alignments for read 2
- Type
List[pysam.libcalignedsegment.AlignedSegment]
- all_recs() Iterator[pysam.libcalignedsegment.AlignedSegment]
Returns a list with all the records for the template.
- static build(recs: Iterable[pysam.libcalignedsegment.AlignedSegment], validate: bool = True) fgpyo.sam.Template
Build a template from a set of records all with the same queryname.
- static iterator(alns: Iterator[pysam.libcalignedsegment.AlignedSegment]) Iterator[fgpyo.sam.Template]
Returns an iterator over templates. Assumes the input iterable is queryname grouped, and gathers consecutive runs of records sharing a common query name into templates.
- primary_recs() Iterator[pysam.libcalignedsegment.AlignedSegment]
Returns a list with all the primary records for the template.
- class fgpyo.sam.TemplateIterator(iterator: Iterator[pysam.libcalignedsegment.AlignedSegment])
An iterator that converts an iterator over query-grouped reads into an iterator over templates.
- fgpyo.sam.calculate_edit_info(rec: pysam.libcalignedsegment.AlignedSegment, reference_sequence: str, reference_offset: Optional[int] = None) fgpyo.sam.ReadEditInfo
Constructs a ReadEditInfo instance giving summary stats about how the read aligns to the reference. Computes the number of mismatches, indels, indel bases and the SAM NM tag. The read must be aligned.
- Parameters
rec – the read/record for which to calculate values
reference_sequence – the reference sequence (or fragment thereof) that the read is aligned to
reference_offset – if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start
- Returns
a ReadEditInfo with information about how the read differs from the reference
- fgpyo.sam.isize(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment) int
Computes the insert size for a pair of records.
- fgpyo.sam.reader(path: Union[IO[Any], pathlib.Path, str], file_type: Optional[fgpyo.sam.SamFileType] = None, unmapped: bool = False) pysam.libcalignmentfile.AlignmentFile
Opens a SAM/BAM/CRAM for reading.
- Parameters
path – a file handle or path to the SAM/BAM/CRAM to read or write.
file_type – the file type to assume when opening the file. If None, then the file type will be auto-detected.
unmapped – True if the file is unmapped and has no sequence dictionary, False otherwise.
- fgpyo.sam.set_pair_info(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment, proper_pair: bool = True) None
Resets mate pair information between reads in a pair. Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.
- Parameters
r1 – read 1
r2 – read 2 with the same queryname as r1
- fgpyo.sam.writer(path: Union[IO[Any], pathlib.Path, str], header: Union[str, Dict[str, Any], pysam.libcalignmentfile.AlignmentHeader], file_type: Optional[fgpyo.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile
Opens a SAM/BAM/CRAM for writing.
- Parameters
path – a file handle or path to the SAM/BAM/CRAM to read or write.
header – Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
file_type – the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object.
Classes for generating SAM and BAM files and records for testing
This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.
The module contains the following public classes:
SamBuilder
– A builder class that allows the accumulationof alignment records and access as a list and writing to file.
- class fgpyo.sam.builder.SamBuilder(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: fgpyo.sam.SamOrder = SamOrder.Coordinate)
Builder for constructing one or more sam records (AlignmentSegments in pysam terms).
Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.
A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.
Records are then added using the
add_pair()
method. Once accumulated the records can be accessed in the order in which they were created through theto_unsorted_list()
function, or in a list sorted by coordinate order viato_sorted_list()
. The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file usingto_path()
.- add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: Optional[str] = None, chrom1: Optional[str] = None, chrom2: Optional[str] = None, start1: int = - 1, start2: int = - 1, cigar1: Optional[str] = None, cigar2: Optional[str] = None, mapq1: Optional[int] = None, mapq2: Optional[int] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) Tuple[pysam.libcalignedsegment.AlignedSegment, pysam.libcalignedsegment.AlignedSegment]
Generates a new pair of reads, adds them to the internal collection, and returns them.
Most fields are optional.
Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:
add_pair(chrom, start1, start2) will create a mapped pair where both reads map to the same contig (chrom).
add_pair(chrom1, start1, chrom2, start2) will create a mapped pair where the reads map to different contigs (chrom1 and chrom2).
A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.
add_pair(chrom, start1)
add_pair(chrom1, start1)
add_pair(chrom, start2)
add_pair(chrom2, start2)
An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.
For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
- Parameters
name – The name of the template. If None is given a unique name will be auto-generated.
bases1 – The bases for R1. If None is given a random sequence is generated.
bases2 – The bases for R2. If None is given a random sequence is generated.
quals1 – The list of int qualities for R1. If None, the default base quality is used.
quals2 – The list of int qualities for R2. If None, the default base quality is used.
chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.
chrom1 – The chromosome to which R1 is mapped. If None, chrom is used.
chrom2 – The chromosome to which R2 is mapped. If None, chrom is used.
start1 – The start position of R1. Defaults to the unmapped value.
start2 – The start position of R2. Defaults to the unmapped value.
cigar1 – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
cigar2 – The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
mapq1 – Mapping quality for R1. Defaults to self.mapping_quality if None.
mapq2 – Mapping quality for R2. Defaults to self.mapping_quality if None.
strand1 – The strand for R1, either “+” or “-”. Defaults to “+”.
strand2 – The strand for R2, either “+” or “-”. Defaults to “-“.
attrs – An optional dictionary of SAM attribute to place on both R1 and R2.
- Raises
ValueError – if either strand field is not “+” or “-”
ValueError – if bases/quals/cigar are set in a way that is not self-consistent
- Returns
The pair of records created, R1 then R2.
- Return type
Tuple[AlignedSegment, AlignedSegment]
- add_single(*, name: Optional[str] = None, read_num: Optional[int] = None, bases: Optional[str] = None, quals: Optional[List[int]] = None, chrom: str = '*', start: int = - 1, cigar: Optional[str] = None, mapq: Optional[int] = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: Optional[Dict[str, Any]] = None) pysam.libcalignedsegment.AlignedSegment
Generates a new single reads, adds them to the internal collection, and returns it.
Most fields are optional.
If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it’s paired flag set and read number flags set.
An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.
A mapped read is created by providing chrom and start.
The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
- Parameters
name – The name of the template. If None is given a unique name will be auto-generated.
read_num – Either None, 1 for R1 or 2 for R2
bases – The bases for the read. If None is given a random sequence is generated.
quals – The list of qualities for the read. If None, the default base quality is used.
chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.
start – The start position of the read. Defaults to the unmapped value.
cigar – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
mapq – Mapping quality for the read. Default to self.mapping_quality if not given.
strand – The strand for R1, either “+” or “-”. Defaults to “+”.
secondary – If true the read will be flagged as secondary
supplementary – If true the read will be flagged as supplementary
attrs – An optional dictionary of SAM attribute to place on both R1 and R2.
- Raises
ValueError – if strand field is not “+” or “-”
ValueError – if read_num is not None, 1 or 2
ValueError – if bases/quals/cigar are set in a way that is not self-consistent
- Returns
The record created
- Return type
AlignedSegment
- static default_rg() Dict[str, str]
Returns the default read group used by the SamBuilder, as a dictionary.
- static default_sd() List[Dict[str, Any]]
Generates the sequence dictionary that is used by default by SamBuilder.
Matches the names and lengths of the HG19 reference in use in production.
- Returns
A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
- property header: pysam.libcalignmentfile.AlignmentHeader
Returns the builder’s SAM header.
- to_path(path: Optional[pathlib.Path] = None, index: bool = True, pred: Callable[[pysam.libcalignedsegment.AlignedSegment], bool] = <function SamBuilder.<lambda>>) pathlib.Path
Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.
- Parameters
path – a path at which to write the file, otherwise a temp file is used.
index – if True and sort_order is Coordinate index is generated, otherwise not.
pred – optional predicate to specify which reads should be output
- Returns
The path to the sorted (and possibly indexed) file.
- Return type
Path
- to_sorted_list() List[pysam.libcalignedsegment.AlignedSegment]
Returns the accumulated records in coordinate order.
- to_unsorted_list() List[pysam.libcalignedsegment.AlignedSegment]
Returns the accumulated records in the order they were created.
Utility Functions for Soft-Clipping records in SAM/BAM Files
This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:
softclip_start_of_alignment_by_query()
clips the start of the alignment in terms of query bases
softclip_end_of_alignment_by_query()
clips the end of the alignment in terms of query bases
softclip_start_of_alignment_by_ref()
clips the start of the alignment in terms of reference bases
softclip_end_of_alignment_by_ref()
clips the end of the alignment in terms of reference bases
The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.
Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.
For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:
>>> from fgpyo.sam import reader, clipping
>>> with reader("/path/to/sample.sam") as fh:
... for rec in fh:
... clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
... print(rec.cigarstring)
It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:
Mate pairs may have incorrect information about their mate’s positions
Even if the input was coordinate sorted, the output may be out of order
To rectify these problems it is necessary to do the equivalent of:
cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd
- class fgpyo.sam.clipping.ClippingInfo(query_bases_clipped: int, ref_bases_clipped: int)
Named tuple holding the number of bases clipped on the query and reference respectively.
- fgpyo.sam.clipping.softclip_end_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo
Adds soft-clipping to the end of a read’s alignment.
Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- fgpyo.sam.clipping.softclip_end_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo
Soft-clips the end of an alignment by bases_to_clip bases on the reference.
Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- fgpyo.sam.clipping.softclip_start_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo
Adds soft-clipping to the start of a read’s alignment.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- fgpyo.sam.clipping.softclip_start_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) fgpyo.sam.clipping.ClippingInfo
Soft-clips the start of an alignment by bases_to_clip bases on the reference.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
VCF/BCF files
Classes for generating VCF and records for testing
This module contains utility classes for the generation of VCF files and variant records, for use in testing.
The module contains the following public classes:
VariantBuilder
– A builder class that allows theaccumulation of variant records and access as a list and writing to file.
Examples
Typically, we have VariantRecord
records obtained from reading from a VCF file.
The VariantBuilder
class builds such records.
Variants are added with the add()
method, which
returns a pysam.VariantRecord
.
>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>> contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
>>> qual=40, filter=["PASS"]
>>> )
VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs
>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
>>> samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
>>> )
The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the
to_path()
method:
>>> from pathlib import Path
>>> path_to_vcf: Path = builder.to_path()
The variants may also be retrieved in the order they were added via the
to_unsorted_list()
method and in coordinate sorted
order via the to_sorted_list()
method.
- fgpyo.vcf.reader(path: Union[pathlib.Path, str, TextIO]) Generator[pysam.libcbcf.VariantFile, None, None]
Opens the given path for VCF reading
- Parameters
path – the path to a VCF, or an open file handle
- fgpyo.vcf.writer(path: Union[pathlib.Path, str, TextIO], header: pysam.libcbcf.VariantHeader) Generator[pysam.libcbcf.VariantFile, None, None]
Opens the given path for VCF writing.
- Parameters
path – the path to a VCF, or an open filehandle
header – the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header
Classes for generating VCF and records for testing
- class fgpyo.vcf.builder.VariantBuilder(sample_ids: Optional[Iterable[str]] = None, sd: Optional[Dict[str, Dict[str, Any]]] = None)
Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF. The VCF can be sites-only, single-sample, or multi-sample.
Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.
A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS (“sample_ids”) must be provided to the VariantBuilder constructor.
Variants are then added using the
add()
method. Once accumulated the variants can be accessed in the order in which they were created through theto_unsorted_list()
function, or in a list sorted by coordinate order viato_sorted_list()
. Lastly, the records can be written to a temporary file usingto_path()
.- sd
sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain “ID” (the same as contig_name) and “length”, the contig length. Other values will be added to the VCF header line for that contig.
- records
the list of variant records
- Type
List[pysam.libcbcf.VariantRecord]
- header
the pysam header
- Type
pysam.libcbcf.VariantHeader
- add(contig: Optional[str] = None, pos: int = 1000, id: str = '.', ref: str = 'A', alts: Union[None, str, Iterable[str]] = ('.',), qual: int = 60, filter: Union[None, str, Iterable[str]] = None, info: Optional[Dict[str, Any]] = None, samples: Optional[Dict[str, Dict[str, Any]]] = None) pysam.libcbcf.VariantRecord
Generates a new variant and adds it to the internal collection.
Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property “pos”, which is 1-based. pysam will also report “pos” as 1-based, so that is the property that should be accessed when using the records produced by this function (not “start”).
- Parameters
contig – the chromosome name. If None, will use the first contig in the sequence dictionary.
pos – the 1-based position of the variant
id – the variant id
ref – the reference allele
alts – the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.
qual – the variant quality
filter – the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.
info – the dictionary of INFO key-value pairs
samples – the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (“.”) for samples that don’t have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.
- add_filter_header(name: str, description: Optional[str] = None) None
Add a FILTER header field to the VCF header.
- Parameters
name – the name of the field
description – the description of the field
- add_format_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = VcfFieldNumber.NUM_GENOTYPES, description: Optional[str] = None) None
Add a FORMAT header field to the VCF header.
- Parameters
name – the name of the field
field_type – the field_type of the field
number – the number of the field
description – the description of the field
- add_info_header(name: str, field_type: fgpyo.vcf.builder.VcfFieldType, number: Union[int, fgpyo.vcf.builder.VcfFieldNumber] = 1, description: Optional[str] = None, source: Optional[str] = None, version: Optional[str] = None) None
Add an INFO header field to the VCF header.
- Parameters
name – the name of the field
field_type – the field_type of the field
number – the number of the field
description – the description of the field
source – the source of the field
version – the version of the field
- classmethod default_sd() Dict[str, Dict[str, Any]]
Generates the sequence dictionary that is used by default by VariantBuilder. Re-uses the dictionary from SamBuilder for consistency.
- Returns
A new copy of the sequence dictionary as a map of contig name to dictionary, one per contig.
- to_path(path: Optional[pathlib.Path] = None) pathlib.Path
Returns a path to a VCF for variants added to this builder. :param path: optional path to the VCF
- to_sorted_list() List[pysam.libcbcf.VariantRecord]
Returns the accumulated records in coordinate order.
- to_unsorted_list() List[pysam.libcbcf.VariantRecord]
Returns the accumulated records in the order they were created.
- class fgpyo.vcf.builder.VcfFieldNumber(value)
Special codes for VCF field numbers
- class fgpyo.vcf.builder.VcfFieldType(value)
Codes for VCF field types
FASTA files
Classes for generating fasta files and records for testing
This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).
Examples of creating sets of contigs for writing to fasta
Writing a FASTA with two contigs each with 100 bases:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
>>> builder.add("chr11").add("GGGGGGGGGG", 10)
>>> builder.to_file(path = pathlib.Path("test.fasta"))
Writing a FASTA with one contig with 100 A’s and 50 T’s:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
>>> builder.to_file(path = pathlib.Path("test.fasta"))
Add bases to existing contig:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
>>> contig_one.bases
'AAAAAAAAAANNN'
- class fgpyo.fasta.builder.ContigBuilder(name: str, assembly: str, species: str)
Builder for constructing new contigs, and adding bases to existing contigs. Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.
- name
Unique contig ID, ie., “chr10”
- assembly
Assembly information, if None default is ‘testassembly’
- species
Species information, if None default is ‘testspecies’
- bases
The bases to be added to the contig ex “A”
- add(bases: str, times: int = 1) fgpyo.fasta.builder.ContigBuilder
Method for adding bases to a new or existing instance of ContigBuilder.
- Parameters
bases – The bases to be added to the contig
times – The number of times the bases should be repeated
Example add(“AAA”, 2) results in the following bases -> “AAAAAA”
- class fgpyo.fasta.builder.FastaBuilder(assembly: str = 'testassembly', species: str = 'testspecies', line_length: int = 80)
Builder for constructing sets of one or more contigs.
Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.
A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.
Contigs are added to FastaBuilder using:
add()
Bases are added to existing contigs using:
add()
Once accumulated the contigs can be written to a file using:
to_file()
Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).
- assembly
Assembly information, if None default is ‘testassembly’
- species
Species, if None default is ‘testspecies’
- line_length
Desired line length, if None default is 80
- contig_builders
Private dictionary of contig names and instances of ContigBuilder
- add(name: str, assembly: Optional[str] = None, species: Optional[str] = None) fgpyo.fasta.builder.ContigBuilder
Creates and returns a new ContigBuilder for a contig with the provided name. Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.
- Parameters
name – Unique contig ID, ie., “chr10”
assembly – Assembly information, if None default is ‘testassembly’
species – Species information, if None default is ‘testspecies’
- to_file(path: pathlib.Path) None
Writes out the set of accumulated contigs to a FASTA file at the path given. Also generates the accompanying fasta index file (.fa.fai) and sequence dictionary file (.dict).
Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.
- Parameters
path – Path to write files to.
Example: FastaBuilder.to_file(path = pathlib.Path(“my_fasta.fa”))
FASTX files
Zipping FASTX Files
Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics
and can be achieved with the FastxZipped
context manager. The context manager
facilitates opening of all input FASTA/FASTQ files and closing them after iteration is complete.
For every iteration of FastxZipped
, a tuple of the next FASTX records are
returned (of type FastxRecord
). An exception will be raised if any of the input
files are malformed or truncated and if record names are not equivalent and in sync.
Importantly, this context manager is optimized for fast streaming read-only usage and, by default,
any previous records saved while advancing the iterator will not be correct as the underlying
pointer in memory will refer to the most recent record only, and not any past records. To preserve
the state of all previously iterated records, set the parameter persist
to True
.
>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:
... for (r1, r2) in zipped:
... print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT
- class fgpyo.fastx.FastxZipped(*paths: Union[pathlib.Path, str], persist: bool = False)
A context manager that will lazily zip over any number of FASTA/FASTQ files.
- Parameters
paths – Paths to the FASTX files to zip over.
persist – Whether to persit the state of previous records during iteration.
- close() None
Close the
FastxZipped
context manager by closing all FASTX files.
Metric files
Metrics
Module for storing, reading, and writing metric-like tab-delimited information.
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R. For example, a row per person, with columns for age, gender, and address.
The Metric
class makes it easy to read, write, and store one or metrics
of the same type, all the while preserving types for each value in a metric. It is an abstract
base class decorated by @dataclass, or
@attr.s, with attributes storing one or more
typed values. If using multiple layers of inheritance, keep in mind that it’s not possible to mix
these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately
initialize the values of the attr superclass.
Examples
Defining a new metric class:
>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
... name: str
... age: int
or using attr:
>>> from fgpyo.util.metric import Metric
>>> import attr
>>> @attr.s(auto_attribs=True, frozen=True)
... class Person(Metric["Person"]):
... name: str
... age: int
Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.
>>> Person.header()
['name', 'age']
Getting the values from a metric class instance. The values are in the same order as the header.
>>> list(Person(name="Alice", age=47).values())
["Alice", 47]
Writing a list of metrics to a file:
>>> metrics = [
... Person(name="Alice", age=47),
... Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)
Then the contents of the written metrics file:
$ column -t /path/to/metrics.txt
name age
Alice 47
Bob 24
Reading the metrics file back in:
>>> list(Person.read(Path("/path/to/metrics.txt")))
[Person(name='Alice', age=47, address=None), Person(name='Bob', age=24, address='North Pole')]
Formatting and parsing the values for custom types is supported by overriding the
_parsers()
and
format_value()
methods.
>>> @dataclasses.dataclass(frozen=True)
... class Name:
... first: str
... last: str
... @classmethod
... def parse(cls, value: str) -> "Name":
... fields = value.split(" ")
... return Name(first=fields[0], last=fields[1])
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
... name: Name
... age: int
... def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
... return {Name: lambda value: Name.parse(value=value)}
... @classmethod
... def format_value(cls, value: Any) -> str:
... if isinstance(value, (Name)):
... return f"{value.first} {value.last}"
... else:
... return super().format_value(value=value)
>>> Person.parse(fields=["john doe", "42"])
Person(name=Name(first='john', last='doe'), age=42)
>>> Person(name=Name(first='john', last='doe'), age=42, address=None).formatted_values()
["first last", "42"]
- class fgpyo.util.metric.Metric(*args, **kwds)
Abstract base class for all metric-like tab-delimited files
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R.
Sub-classes of
Metric
can support parsing and formatting custom types with_parsers()
andformat_value()
.- classmethod format_value(value: Any) str
The default method to format values of a given type.
By default, this method will comma-delimit list, tuple, and set types, and apply str to all others.
Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.
In addition, lists will be flanked with ‘[]’, tuples with ‘()’ and sets and dictionaries with ‘{}’
- Parameters
value – the value to format.
- formatted_values() List[str]
An iterator over formatted attribute values in the same order as the header.
- classmethod parse(fields: List[str]) Any
Parses the string-representation of this metric. One string per attribute should be given.
- classmethod read(path: pathlib.Path, ignore_extra_fields: bool = True) Iterator[Any]
Reads in zero or more metrics from the given path.
The metric file must contain a matching header.
Columns that are not present in the file but are optional in the metric class will be default values.
- Parameters
path – the path to the metrics file.
ignore_extra_fields – True to ignore any extra columns, False to raise an exception.
- values() Iterator[Any]
An iterator over attribute values in the same order as the header.
- classmethod write(path: pathlib.Path, *values: fgpyo.util.metric.MetricType) None
Writes zero or more metrics to the given path.
The header will always be written.
- Parameters
path – Path to the output file.
values – Zero or more metrics.
- fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) fgpyo.util.inspect._AttrFromType
Builds an attr or dataclasses class from key-word arguments
- Parameters
cls – the attr or dataclasses class to be built
kwargs – a dictionary of keyword arguments
parsers – a dictionary of parser functions to apply to specific types
See also
- https://docs.python.org/3/library/dataclasses.html
Documentation for the dataclasses standard module
https://www.attrs.org/en/stable/examples.html
The attrs website for bringing back the joy to writing classes.
Read Structures
Classes for representing Read Structures
A Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the –use-bases-mask in Illumina’s bcltofastq software, but provides some additional capabilities.
A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].
See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
Examples
>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
'75T', '8B', '75T'
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
... ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
... print(str(ex))
...
Read structure missing length information: 23T2T[T]23T
Module Contents
- The module contains the following public classes:
ReadStructure
– Describes the structure of a give readReadSegment
– Describes all the information about a segment within a read structureSegmentType
– The type of segments that can show up in a read structureSubReadWithoutQuals
– Contains the bases that correspond to the given read segmentSubReadWithQuals
– Contains the bases and qualities that correspond to the given read segment
- fgpyo.read_structure.ANY_LENGTH_CHAR: str = '+'
A character that can be put in place of a number in a read structure to mean “0 or more bases”.
- class fgpyo.read_structure.ReadSegment(*, offset: int, length: Optional[int], kind: fgpyo.read_structure.SegmentType)
Encapsulates all the information about a segment within a read structure. A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.
- kind
The kind of read segment.
- extract(bases: str) fgpyo.read_structure.SubReadWithoutQuals
Gets the bases associated with this read segment.
- extract_with_quals(bases: str, quals: str) fgpyo.read_structure.SubReadWithQuals
Gets the bases and qualities associated with this read segment.
- class fgpyo.read_structure.ReadStructure(*, segments: Tuple[fgpyo.read_structure.ReadSegment, ...])
Describes the structure of a give read. A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.
- segments
The segments composing the read structure
- Type
Tuple[fgpyo.read_structure.ReadSegment, …]
- extract(bases: str) Tuple[fgpyo.read_structure.SubReadWithoutQuals, ...]
Splits the given bases into tuples with its associated read segment.
- extract_with_quals(bases: str, quals: str) Tuple[fgpyo.read_structure.SubReadWithQuals, ...]
Splits the given bases and qualities into triples with its associated read segment.
- property fixed_length: int
The fixed length if there is one. Throws an exception on segments without fixed lengths!
- classmethod from_segments(segments: Tuple[fgpyo.read_structure.ReadSegment, ...], reset_offsets: bool = False) fgpyo.read_structure.ReadStructure
Creates a new ReadStructure, optionally resetting the offsets on each of the segments
- property length: int
Length is defined as the number of segments (not bases!) in the read structure
- segments_by_kind(kind: fgpyo.read_structure.SegmentType) Tuple[fgpyo.read_structure.ReadSegment, ...]
Returns just the segments of a given kind.
- with_variable_last_segment() fgpyo.read_structure.ReadStructure
Generates a new ReadStructure that is the same as this one except that the last segment has undefined length
- class fgpyo.read_structure.SegmentType(value)
The type of segments that can show up in a read structure
- MolecularBarcode = 'M'
The segment type for molecular barcode bases.
- SampleBarcode = 'B'
The segment type for sample barcode bases.
- Skip = 'S'
The segment type for bases that need to be skipped.
- Template = 'T'
The segment type for template bases.
- class fgpyo.read_structure.SubReadWithQuals(*, bases: str, quals: str, segment: ReadSegment)
Contains the bases and qualities that correspond to the given read segment
- property kind: fgpyo.read_structure.SegmentType
The kind of read segment that corresponds to this sub-read.
- segment: fgpyo.read_structure.ReadSegment
The segment of the read structure that describes this sub-read.
- class fgpyo.read_structure.SubReadWithoutQuals(*, bases: str, segment: ReadSegment)
Contains the bases that correspond to the given read segment.
- property kind: fgpyo.read_structure.SegmentType
The kind of read segment that corresponds to this sub-read.
- segment: fgpyo.read_structure.ReadSegment
The segment of the read structure that describes this sub-read.
See also
https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
The fgbio wiki for an explanation of Read Structures.
Manipulating DNA and RNA Sequences
Utility Functions for Manipulating DNA and RNA sequences.
This module contains utility functions for manipulating DNA and RNA sequences.
Collections
Functions for Working with Collections
This module contains classes and functions for working with collections and iterators.
Examples of a “Peekable” Iterator
“Peekable” iterators are useful to “peek” at the next item in an iterator without consuming it.
For example, this is useful when consuming items in iterator while a predicate is true, and not
consuming the first element where the element is not true. See the
takewhile()
and
dropwhile()
methods.
An empty peekable iterator throws StopIteration:
>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
StopIteration
A peekable iterator will return the next item before consuming it.
>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]
The can_peek() function can be used to determine if the iterator can be peeked without StopIteration being thrown:
>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
StopIteration
PeekableIterator’s constructor supports creation from iterable objects as well as iterators.
Module Contents
The module contains the following public classes:
PeekableIterator
– Iterator that allows you to peek at thenext value before calling next
- class fgpyo.collections.PeekableIterator(source: Union[Iterator[fgpyo.collections.IterType], Iterable[fgpyo.collections.IterType]])
A peekable iterator wrapping an iterator or iterable.
This allows returning the next item without consuming it.
- Parameters
source – an iterator over the objects
- dropwhile(pred: Callable[[fgpyo.collections.IterType], bool]) fgpyo.collections.PeekableIterator[fgpyo.collections.IterType]
Drops elements from the iterator while the predicate is true.
Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.
- Parameters
pred (Callable[[V], bool]) – a function that takes a value from the iterator
false. (and returns true or) –
- Returns
a reference to this iterator, so calls can be chained
- Return type
- peek() fgpyo.collections.IterType
Returns the next element without consuming it, or StopIteration otherwise.
- takewhile(pred: Callable[[fgpyo.collections.IterType], bool]) List[fgpyo.collections.IterType]
Consumes from the iterator while pred is true, and returns the result as a List.
The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.
- Parameters
pred – a function that takes the next value from the iterator and returns true or false.
- Returns
A list of the values from the iterator, in order, up until and excluding the first value that does not match the predicate.
- Return type
List[V]
Logging
Methods for setting up logging for tools.
Progress Logging Examples
Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging
progress is useful to not only log how many inputs have been consumed, but also their genomic
coordinate. ProgressLogger
can log progress every fixed number of
records. Logging can be written to logging.Logger
as well as custom print method.
>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
... printer=lambda s: logged_lines.append(s),
... verb="recorded",
... noun="items",
... unit=2
... )
>>> progress.record(reference_name="chr1", position=1) # does not log
False
>>> progress.record(reference_name="chr1", position=2) # logs
True
>>> progress.record(reference_name="chr1", position=3) # does not log
False
>>> progress.log_last() # will log the last recorded item, if not previously logged
True
>>> logged_lines # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']
- class fgpyo.util.logging.ProgressLogger(printer: Union[logging.Logger, Callable[[str], Any]], noun: str = 'records', verb: str = 'Read', unit: int = 100000)
A little class to track progress.
This will output a log message every unit number times recorded.
- printer
either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string
- noun
the noun to use in the log message
- verb
the verb to use in the log message
- unit
the number of items for every log message
- count
the total count of items recorded
- fgpyo.util.logging.setup_logging(level: str = 'INFO', name: str = 'fgpyo') None
Globally configure logging for all modules
Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.
- Parameters
level – the default level for the logger
name – the name of the logger
IO
Module for reading and writing files
The functions in this module make it easy to:
check if a file exists and is writeable
check if a file and its parent directories exist and are writeable
check if a file exists and is readable
check if a path exists and is a directory
open an appropriate reader or writer based on the file extension
write items to a file, one per line
read lines from a file
fgpyo.io Examples:
>>> import fgpyo.io as fio
>>> from pathlib import Path
Assert that a path exists and is readable
>>> path_flat: Path = Path("example.txt")
>>> path_compressed: Path = Path("example.txt.gz")
>>> fio.path_is_readable(path_flat)
AssertionError: Cannot read non-existent path: example.txt
>>> fio.path_is_readable(compressed_file)
AssertionError: Cannot read non-existent path: example.txt.gz
Write to and read from path
>>> write_lines(path = path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path = path_compressed, lines_to_write=["gzip file", 10])
Read lines from a path into a generator
>>> lines = read_lines(path = path_flat)
>>> next(lines)
"flat file"
>>> next(lines)
"10"
>>> lines = read_lines(path = path_compressed)
>>> next(lines)
"gzip file"
>>> next(lines)
"10"
- fgpyo.io.assert_directory_exists(path: pathlib.Path) None
Asserts that a path exist and is a directory
- Parameters
path – Path to check
Example
assert_directory_exists(path = Path(“/example/directory/”))
- fgpyo.io.assert_path_is_readable(path: pathlib.Path) None
Checks that file exists and returns True, else raises AssertionError
- Parameters
path – a Path to check
Example
assert_file_exists(path = Path(“some_file.csv”))
- fgpyo.io.assert_path_is_writeable(path: pathlib.Path, parent_must_exist: bool = True) None
Asserts the following: If the file exists then it must also be writeable. Else if the path is not a file and parent_must_exist is true then assert that the parent directory exists and is writeable. Else if the path is not a directory and parent_must_exist is false then look at each parent directory until one is found that exists and is writeable.
- Parameters
path – Path to check
parent_must_exist – True/False the parent directory must exist, the default is True
Example
assert_path_is_writeable(path = Path(“example.txt”))
- fgpyo.io.read_lines(path: pathlib.Path, strip: bool = False) Iterator[str]
Takes a path and reads each line into a generator, removing line terminators along the way. By default only line terminators (CR/LF) are stripped. The strip parameter may be used to strip both leading and trailing whitespace from each line.
- Parameters
path – Path to read from
strip – True to strip lines of all leading and trailing whitespace,
characters. (False to only remove trailing CR/LF) –
Example
read_back = fio.read_lines(path)
- fgpyo.io.redirect_to_dev_null(file_num: int) Generator[None, None, None]
A context manager that redirects output of file handle to /dev/null
- Parameters
file_num – number of filehandle to redirect.
- fgpyo.io.suppress_stderr() Generator[None, None, None]
A context manager that redirects output of stderr to /dev/null
- fgpyo.io.to_reader(path: pathlib.Path) Union[_io.TextIOWrapper, TextIO, IO[Any]]
Opens a Path for reading and based on extension uses open() or gzip.open()
- Parameters
path – Path to read from
Example
>>> reader = fio.to_reader(path = Path("reader.txt")) >>> reader.readlines() >>> reader.close()
- fgpyo.io.to_writer(path: pathlib.Path, append: bool = False) Union[IO[Any], _io.TextIOWrapper]
Opens a Path for writing (or appending) and based on extension uses open() or gzip.open()
- Parameters
path – Path to write (or append) to
Example
>>> writer = fio.to_writer(path = Path("writer.txt")) >>> writer.write(f'{something}\n') >>> writer.close()
- fgpyo.io.write_lines(path: pathlib.Path, lines_to_write: Iterable[Any], append: bool = False) None
Writes (or appends) a file with one line per item in provided iterable
- Parameters
path – Path to write (or append) to
lines_to_write – items to write (or append) to file
Example
lines: List[Any] = [“things to write”, 100] path_to_write_to: Path = Path(“file_to_write_to.txt”) fio.write_lines(path = path_to_write_to, lines_to_write = lines)
Utility Functions
- class fgpyo.util.inspect.AttrsInstance(*args, **kwargs)
- class fgpyo.util.inspect.DataclassesProtocol(*args, **kwargs)
- fgpyo.util.inspect.FieldType
TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass
alias of
Union
[dataclasses.Field
,attr._make.Attribute
]
- exception fgpyo.util.inspect.ParserNotFoundException
- fgpyo.util.inspect.attr_from(cls: Type[fgpyo.util.inspect._AttrFromType], kwargs: Dict[str, str], parsers: Optional[Dict[type, Callable[[str], Any]]] = None) fgpyo.util.inspect._AttrFromType
Builds an attr or dataclasses class from key-word arguments
- Parameters
cls – the attr or dataclasses class to be built
kwargs – a dictionary of keyword arguments
parsers – a dictionary of parser functions to apply to specific types
- fgpyo.util.inspect.dict_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial
Returns a function that parses a stringified dict into a Dict of the correct type.
- Parameters
cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –
- fgpyo.util.inspect.get_fields(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) Tuple[Union[dataclasses.Field, attr._make.Attribute], ...]
Get the fields tuple from either a dataclasses or attr dataclass (or instance)
- fgpyo.util.inspect.get_fields_dict(cls: Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance, Type[Union[fgpyo.util.inspect.DataclassesProtocol, fgpyo.util.inspect.AttrsInstance]]]) Mapping[str, Union[dataclasses.Field, attr._make.Attribute]]
Get the fields dict from either a dataclasses or attr dataclass (or instance)
- fgpyo.util.inspect.is_attr_class(cls: type) bool
Return True if the class is an attr class, and False otherwise
- fgpyo.util.inspect.list_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial
Returns a function that parses a stringified list into a List of the correct type.
- Parameters
cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –
- fgpyo.util.inspect.set_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial
Returns a function that parses a stringified set into a Set of the correct type.
- Parameters
cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –
- fgpyo.util.inspect.split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) List[str]
Splits a nested field by its outer-most level
Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth
Not currently smart enough to deal with fields enclosed in quotes (’’ or “”) - TODO
- fgpyo.util.inspect.tuple_parser(cls: Type, type_: typing_extensions.TypeAlias, parsers: Optional[Dict[type, Callable[[str], Any]]] = None) functools.partial
Returns a function that parses a stringified tuple into a Tuple of the correct type.
- Parameters
cls – the type of the class object this is being parsed for (used to get default val for
parsers) –
type – the type of the attribute to be parsed
parsers – an optional mapping from type to the function to use for parsing that type (allows
types) (for parsing of more complex) –
- fgpyo.util.string.column_it(rows: List[List[str]], delimiter: str = ' ') str
A simple version of Unix’s column utility. This assumes the table is NxM.
- Parameters
rows – the rows to adjust. Each row must have the same number of delimited fields.
delimiter – the delimiter for each field in a row.
- exception fgpyo.util.types.InspectException
- fgpyo.util.types.is_constructible_from_str(type_: type) bool
Returns true if the provided type can be constructed from a string
- fgpyo.util.types.is_list_like(type_: type) bool
Returns true if the value is a list or list like object
- fgpyo.util.types.make_enum_parser(enum: Type[fgpyo.util.types.EnumType]) functools.partial
Makes a parser function for enum classes
- fgpyo.util.types.make_literal_parser(literal: Type[fgpyo.util.types.LiteralType], parsers: Iterable[Callable[[str], fgpyo.util.types.LiteralType]]) functools.partial
Generates a parser function for a literal type object and a set of parsers for the possible parsers to that literal type object
- fgpyo.util.types.make_union_parser(union: Type[fgpyo.util.types.UnionType], parsers: Iterable[Callable[[str], type]]) functools.partial
Generates a parser function for a union type object and set of parsers for the possible parsers to that union type object