fgpyo

Classes¶

RequirementError ¶

Bases: Exception

Exception raised when a requirement is not satisfied.

Source code in fgpyo/_requirements.py

class RequirementError(Exception):
    """Exception raised when a requirement is not satisfied."""

Functions¶

require ¶

require(condition: bool, message: str | Callable[[], str] | None = None) -> None

Require a condition be satisfied.

Parameters:

Name	Type	Description	Default
`condition`	`bool`	The condition to satisfy.	required
`message`	`str \| Callable[[], str] \| None`	An optional message to include with the error when the condition is false. The message may be provided as either a string literal or a function returning a string. The function will not be evaluated unless the condition is false.	`None`

Raises:

Type	Description
`RequirementError`	If the condition is false.

Source code in fgpyo/_requirements.py

def require(condition: bool, message: str | Callable[[], str] | None = None) -> None:
    """
    Require a condition be satisfied.

    Args:
        condition: The condition to satisfy.
        message: An optional message to include with the error when the condition is false.
            The message may be provided as either a string literal or a function returning a string.
            The function will not be evaluated unless the condition is false.

    Raises:
        RequirementError: If the condition is false.
    """
    if not condition:
        if message is None:
            raise RequirementError()
        elif isinstance(message, str):
            raise RequirementError(message)
        else:
            raise RequirementError(message())

Modules¶

collections ¶

Custom Collections and Collection Functions.¶

This module contains classes and functions for working with collections and iterators.

Helpful Functions for Working with Collections¶

To test if an iterable is sorted or not:

>>> from fgpyo.collections import is_sorted
>>> is_sorted([])
True
>>> is_sorted([1])
True
>>> is_sorted([1, 2, 2, 3])
True
>>> is_sorted([1, 2, 4, 3])
False

Examples of a "Peekable" Iterator¶

"Peekable" iterators are useful to "peek" at the next item in an iterator without consuming it. For example, this is useful when consuming items in iterator while a predicate is true, and not consuming the first element where the element is not true. See the takewhile() and dropwhile() methods.

An empty peekable iterator throws a StopIteration:

>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
Traceback (most recent call last):
    ...
StopIteration

A peekable iterator will return the next item before consuming it.

>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]

The can_peek() function can be used to determine if the iterator can be peeked without a StopIteration from being thrown:

>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
Traceback (most recent call last):
    ...
StopIteration

PeekableIterator's constructor supports creation from iterable objects as well as iterators.

Attributes¶

LessThanOrEqualType `module-attribute` ¶

LessThanOrEqualType = TypeVar('LessThanOrEqualType', bound=SupportsLessThanOrEqual)

A type variable for an object that supports less-than-or-equal comparisons.

Classes¶

PeekableIterator ¶

Bases: Generic[IterType], Iterator[IterType]

A peekable iterator wrapping an iterator or iterable.

This allows returning the next item without consuming it.

Parameters:

Name	Type	Description	Default
`source`	`Iterator[IterType] \| Iterable[IterType]`	an iterator over the objects	required

Source code in fgpyo/collections/__init__.py

class PeekableIterator(Generic[IterType], Iterator[IterType]):
    """
    A peekable iterator wrapping an iterator or iterable.

    This allows returning the next item without consuming it.

    Args:
        source: an iterator over the objects
    """

    def __init__(self, source: Iterator[IterType] | Iterable[IterType]) -> None:
        """Initializes the PeekableIterator with the given source."""
        self._iter: Iterator[IterType] = iter(source)
        self._sentinel: Any = object()
        self.__update_peek()

    def __iter__(self) -> Iterator[IterType]:
        """Returns self as the iterator."""
        return self

    def __next__(self) -> IterType:
        """Returns the next item and advances the iterator."""
        to_return = self.peek()
        self.__update_peek()
        return to_return

    def __update_peek(self) -> None:
        self._peek = next(self._iter, self._sentinel)

    def can_peek(self) -> bool:
        """Returns true if there is a value that can be peeked at, false otherwise."""
        return self._peek is not self._sentinel

    def peek(self) -> IterType:
        """Returns the next element without consuming it, or StopIteration otherwise."""
        if self.can_peek():
            return self._peek
        else:
            raise StopIteration

    def takewhile(self, pred: Callable[[IterType], bool]) -> list[IterType]:
        """
        Consumes from the iterator while pred is true, and returns the result as a List.

        The iterator is left pointing at the first non-matching item, or if all items match
        then the iterator will be exhausted.

        Args:
            pred: a function that takes the next value from the iterator and returns
                  true or false.

        Returns:
            List[V]: A list of the values from the iterator, in order, up until and excluding
            the first value that does not match the predicate.
        """
        xs: list[IterType] = []
        while self.can_peek() and pred(self._peek):
            xs.append(next(self))
        return xs

    def dropwhile(self, pred: Callable[[IterType], bool]) -> "PeekableIterator[IterType]":
        """
        Drops elements from the iterator while the predicate is true.

        Updates the iterator to point at the first non-matching element, or exhausts the
        iterator if all elements match the predicate.

        Args:
            pred (Callable[[V], bool]): a function that takes a value from the iterator
                and returns true or false.

        Returns:
            PeekableIterator[V]: a reference to this iterator, so calls can be chained
        """
        while self.can_peek() and pred(self._peek):
            self.__update_peek()
        return self

Functions¶

__init__ ¶

__init__(source: Iterator[IterType] | Iterable[IterType]) -> None

Initializes the PeekableIterator with the given source.

Source code in fgpyo/collections/__init__.py

def __init__(self, source: Iterator[IterType] | Iterable[IterType]) -> None:
    """Initializes the PeekableIterator with the given source."""
    self._iter: Iterator[IterType] = iter(source)
    self._sentinel: Any = object()
    self.__update_peek()

__iter__ ¶

__iter__() -> Iterator[IterType]

Returns self as the iterator.

Source code in fgpyo/collections/__init__.py

def __iter__(self) -> Iterator[IterType]:
    """Returns self as the iterator."""
    return self

__next__ ¶

__next__() -> IterType

Returns the next item and advances the iterator.

Source code in fgpyo/collections/__init__.py

def __next__(self) -> IterType:
    """Returns the next item and advances the iterator."""
    to_return = self.peek()
    self.__update_peek()
    return to_return

can_peek ¶

can_peek() -> bool

Returns true if there is a value that can be peeked at, false otherwise.

Source code in fgpyo/collections/__init__.py

def can_peek(self) -> bool:
    """Returns true if there is a value that can be peeked at, false otherwise."""
    return self._peek is not self._sentinel

dropwhile ¶

dropwhile(pred: Callable[[IterType], bool]) -> PeekableIterator[IterType]

Drops elements from the iterator while the predicate is true.

Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.

Parameters:

Name	Type	Description	Default
`pred`	`Callable[[V], bool]`	a function that takes a value from the iterator and returns true or false.	required

Returns:

Type	Description
`PeekableIterator[IterType]`	PeekableIterator[V]: a reference to this iterator, so calls can be chained

Source code in fgpyo/collections/__init__.py

def dropwhile(self, pred: Callable[[IterType], bool]) -> "PeekableIterator[IterType]":
    """
    Drops elements from the iterator while the predicate is true.

    Updates the iterator to point at the first non-matching element, or exhausts the
    iterator if all elements match the predicate.

    Args:
        pred (Callable[[V], bool]): a function that takes a value from the iterator
            and returns true or false.

    Returns:
        PeekableIterator[V]: a reference to this iterator, so calls can be chained
    """
    while self.can_peek() and pred(self._peek):
        self.__update_peek()
    return self

peek ¶

peek() -> IterType

Returns the next element without consuming it, or StopIteration otherwise.

Source code in fgpyo/collections/__init__.py

def peek(self) -> IterType:
    """Returns the next element without consuming it, or StopIteration otherwise."""
    if self.can_peek():
        return self._peek
    else:
        raise StopIteration

takewhile ¶

takewhile(pred: Callable[[IterType], bool]) -> list[IterType]

Consumes from the iterator while pred is true, and returns the result as a List.

The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.

Parameters:

Name	Type	Description	Default
`pred`	`Callable[[IterType], bool]`	a function that takes the next value from the iterator and returns true or false.	required

Returns:

Type	Description
`list[IterType]`	List[V]: A list of the values from the iterator, in order, up until and excluding
`list[IterType]`	the first value that does not match the predicate.

Source code in fgpyo/collections/__init__.py

def takewhile(self, pred: Callable[[IterType], bool]) -> list[IterType]:
    """
    Consumes from the iterator while pred is true, and returns the result as a List.

    The iterator is left pointing at the first non-matching item, or if all items match
    then the iterator will be exhausted.

    Args:
        pred: a function that takes the next value from the iterator and returns
              true or false.

    Returns:
        List[V]: A list of the values from the iterator, in order, up until and excluding
        the first value that does not match the predicate.
    """
    xs: list[IterType] = []
    while self.can_peek() and pred(self._peek):
        xs.append(next(self))
    return xs

SupportsLessThanOrEqual ¶

Bases: Protocol

A structural type for objects that support less-than-or-equal comparison.

Source code in fgpyo/collections/__init__.py

class SupportsLessThanOrEqual(Protocol):
    """A structural type for objects that support less-than-or-equal comparison."""

    def __le__(self, other: Any) -> bool:
        """Return True if self is less than or equal to other."""
        ...

Functions¶

__le__ ¶

__le__(other: Any) -> bool

Return True if self is less than or equal to other.

Source code in fgpyo/collections/__init__.py

def __le__(self, other: Any) -> bool:
    """Return True if self is less than or equal to other."""
    ...

Functions¶

is_sorted ¶

is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool

Tests lazily if an iterable of comparable objects is sorted or not.

Parameters:

Name	Type	Description	Default
`iterable`	`Iterable[LessThanOrEqualType]`	An iterable of comparable objects.	required

Raises:

Type	Description
`TypeError`	If there is more than 1 element in `iterable` and any of the elements are not comparable.

Source code in fgpyo/collections/__init__.py

def is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool:
    """
    Tests lazily if an iterable of comparable objects is sorted or not.

    Args:
        iterable: An iterable of comparable objects.

    Raises:
        TypeError: If there is more than 1 element in ``iterable`` and any of the elements are not
            comparable.
    """
    return all(map(lambda pair: le(*pair), pairwise(iterable)))

fasta ¶

Modules¶

builder ¶

Classes for generating fasta files and records for testing.¶

This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).

Examples of creating sets of contigs for writing to fasta¶

Writing a FASTA with two contigs each with 100 bases:

>>> from pathlib import Path
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder = builder.add("chr11").add("GGGGGGGGGG", 10)
>>> fasta_path = Path(getfixture("tmp_path")) / "test.fasta"
>>> builder.to_file(path=fasta_path)

Writing a FASTA with one contig with 100 A's and 50 T's:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder.to_file(path=fasta_path)

Add bases to existing contig:

>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)  
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> contig_one.bases
'AAAAAAAAAANNN'

Classes¶

ContigBuilder ¶

Builder for constructing new contigs, and adding bases to existing contigs.

Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.

Attributes:

Name	Type	Description
`name`		Unique contig ID, ie., "chr10"
`assembly`		Assembly information, if None default is 'testassembly'
`species`		Species information, if None default is 'testspecies'
`bases`		The bases to be added to the contig ex "A"

Source code in fgpyo/fasta/builder.py

class ContigBuilder:
    """
    Builder for constructing new contigs, and adding bases to existing contigs.

    Existing contigs cannot be overwritten, each contig name in FastaBuilder must
    be unique. Instances of ContigBuilders should be created using FastaBuilder.add(),
    where species and assembly are optional parameters and will defualt to
    FastaBuilder.assembly and FastaBuilder.species.

    Attributes:
        name: Unique contig ID, ie., "chr10"
        assembly: Assembly information, if None default is 'testassembly'
        species: Species information, if None default is 'testspecies'
        bases:  The bases to be added to the contig ex "A"

    """

    def __init__(
        self,
        name: str,
        assembly: str,
        species: str,
    ):
        """Initializes a ContigBuilder with the given name, assembly, and species."""
        self.name = name
        self.assembly = assembly
        self.species = species
        self.bases = ""

    def add(self, bases: str, times: int = 1) -> "ContigBuilder":
        """
        Method for adding bases to a new or existing instance of ContigBuilder.

        Args:
            bases: The bases to be added to the contig
            times: The number of times the bases should be repeated

        Example:
        add("AAA", 2) results in the following bases -> "AAAAAA"
        """
        # Remove any spaces in string and enforce upper case format
        bases = bases.replace(" ", "").upper()
        self.bases += str(bases * times)
        return self

Functions¶

__init__ ¶

__init__(name: str, assembly: str, species: str)

Initializes a ContigBuilder with the given name, assembly, and species.

Source code in fgpyo/fasta/builder.py

def __init__(
    self,
    name: str,
    assembly: str,
    species: str,
):
    """Initializes a ContigBuilder with the given name, assembly, and species."""
    self.name = name
    self.assembly = assembly
    self.species = species
    self.bases = ""

add ¶

add(bases: str, times: int = 1) -> ContigBuilder

Method for adding bases to a new or existing instance of ContigBuilder.

Parameters:

Name	Type	Description	Default
`bases`	`str`	The bases to be added to the contig	required
`times`	`int`	The number of times the bases should be repeated	`1`

Example: add("AAA", 2) results in the following bases -> "AAAAAA"

Source code in fgpyo/fasta/builder.py

def add(self, bases: str, times: int = 1) -> "ContigBuilder":
    """
    Method for adding bases to a new or existing instance of ContigBuilder.

    Args:
        bases: The bases to be added to the contig
        times: The number of times the bases should be repeated

    Example:
    add("AAA", 2) results in the following bases -> "AAAAAA"
    """
    # Remove any spaces in string and enforce upper case format
    bases = bases.replace(" ", "").upper()
    self.bases += str(bases * times)
    return self

FastaBuilder ¶

Builder for constructing sets of one or more contigs.

Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.

A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.

Contigs are added to FastaBuilder using: add()

Bases are added to existing contigs using: add()

Once accumulated the contigs can be written to a file using: to_file()

Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

Attributes:

Name	Type	Description
`assembly`	`str`	Assembly information, if None default is 'testassembly'
`species`	`str`	Species, if None default is 'testspecies'
`line_length`	`int`	Desired line length, if None default is 80
`contig_builders`	`int`	Private dictionary of contig names and instances of ContigBuilder

Source code in fgpyo/fasta/builder.py

class FastaBuilder:
    """
    Builder for constructing sets of one or more contigs.

    Provides the ability to manufacture sets of contigs from minimal input, and automatically
    generates the information necessary for writing the FASTA file, index, and dictionary.

    A builder is constructed from an assembly, species, and line length. All attributes have
    defaults, however these can be overwritten.

    Contigs are added to FastaBuilder using:
    [`add()`][fgpyo.fasta.builder.FastaBuilder.add]

    Bases are added to existing contigs using:
    [`add()`][fgpyo.fasta.builder.ContigBuilder.add]

    Once accumulated the contigs can be written to a file using:
    [`to_file()`][fgpyo.fasta.builder.FastaBuilder.to_file]

    Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).

    Attributes:
        assembly: Assembly information, if None default is 'testassembly'
        species: Species, if None default is 'testspecies'
        line_length: Desired line length, if None default is 80
        contig_builders: Private dictionary of contig names and instances of ContigBuilder
    """

    def __init__(
        self,
        assembly: str = "testassembly",
        species: str = "testspecies",
        line_length: int = 80,
    ):
        """Initializes a FastaBuilder with the given assembly, species, and line length."""
        self.assembly: str = assembly
        self.species: str = species
        self.line_length: int = line_length
        self.__contig_builders: dict[str, ContigBuilder] = {}

    def __getitem__(self, key: str) -> ContigBuilder:
        """Access instance of ContigBuilder by name."""
        return self.__contig_builders[key]

    def add(
        self,
        name: str,
        assembly: str | None = None,
        species: str | None = None,
    ) -> ContigBuilder:
        """
        Creates and returns a new ContigBuilder for a contig with the provided name.

        Contig names must be unique, attempting to create two seperate contigs with the same
        name will result in an error.

        Args:
            name: Unique contig ID, ie., "chr10"
            assembly: Assembly information, if None default is 'testassembly'
            species: Species information, if None default is 'testspecies'
        """
        # Asign self.species and self.assembly to assembly and species if parameter is None
        assembly = assembly if assembly is not None else self.assembly
        species = species if species is not None else self.species

        # Assert that the provided name does not already exist
        assert name not in self.__contig_builders, (
            f"The contig {name} already exists, see docstring for methods on "
            f"adding bases to existing contigs"
        )
        builder: ContigBuilder = ContigBuilder(name=name, assembly=assembly, species=species)
        self.__contig_builders[name] = builder
        return builder

    def to_file(
        self,
        path: Path,
    ) -> None:
        """
        Writes out the set of accumulated contigs to a FASTA file at the `path` given.

        Also generates the accompanying fasta index file (`.fa.fai`) and sequence
        dictionary file (`.dict`).

        Contigs are emitted in the order they were added to the builder.  Sequence
        lines in the FASTA file are wrapped to the line length given when the builder
        was constructed.

        Args:
            path: Path to write files to.

        Example:
        FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
        """
        assert_path_is_writable(path)

        with path.open("w") as writer:
            for contig in self.__contig_builders.values():
                try:
                    writer.write(f">{contig.name}")
                    writer.write("\n")
                    for line in textwrap.wrap(contig.bases, self.line_length):
                        writer.write(line)
                        writer.write("\n")
                except OSError as error:
                    raise Exception(f"Could not write to {writer}") from error

        # Index fasta
        pysam_faidx(str(path))

        # Write dictionary
        pysam_dict(
            assembly=self.assembly,
            species=self.species,
            output_path=str(f"{path}.dict"),
            input_path=str(path),
        )

    @contextmanager
    def to_fasta_file_handle(self, path: Path) -> Iterator[FastaFile]:
        """
        Writes out the set of accumulated contigs to a FASTA file and returns an open FastaFile.

        This is a convenience method that combines `to_file()` with opening the resulting
        file as a `pysam.FastaFile`.

        Args:
            path: Path to which to write the FASTA file.

        Yields:
            An open `pysam.FastaFile` for the written FASTA.
        """
        self.to_file(path)
        with FastaFile(f"{path}") as fasta:
            yield fasta

Functions¶

__getitem__ ¶

__getitem__(key: str) -> ContigBuilder

Access instance of ContigBuilder by name.

Source code in fgpyo/fasta/builder.py

def __getitem__(self, key: str) -> ContigBuilder:
    """Access instance of ContigBuilder by name."""
    return self.__contig_builders[key]

__init__ ¶

__init__(assembly: str = 'testassembly', species: str = 'testspecies', line_length: int = 80)

Initializes a FastaBuilder with the given assembly, species, and line length.

Source code in fgpyo/fasta/builder.py

def __init__(
    self,
    assembly: str = "testassembly",
    species: str = "testspecies",
    line_length: int = 80,
):
    """Initializes a FastaBuilder with the given assembly, species, and line length."""
    self.assembly: str = assembly
    self.species: str = species
    self.line_length: int = line_length
    self.__contig_builders: dict[str, ContigBuilder] = {}

add ¶

add(name: str, assembly: str | None = None, species: str | None = None) -> ContigBuilder

Creates and returns a new ContigBuilder for a contig with the provided name.

Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique contig ID, ie., "chr10"	required
`assembly`	`str \| None`	Assembly information, if None default is 'testassembly'	`None`
`species`	`str \| None`	Species information, if None default is 'testspecies'	`None`

Source code in fgpyo/fasta/builder.py

def add(
    self,
    name: str,
    assembly: str | None = None,
    species: str | None = None,
) -> ContigBuilder:
    """
    Creates and returns a new ContigBuilder for a contig with the provided name.

    Contig names must be unique, attempting to create two seperate contigs with the same
    name will result in an error.

    Args:
        name: Unique contig ID, ie., "chr10"
        assembly: Assembly information, if None default is 'testassembly'
        species: Species information, if None default is 'testspecies'
    """
    # Asign self.species and self.assembly to assembly and species if parameter is None
    assembly = assembly if assembly is not None else self.assembly
    species = species if species is not None else self.species

    # Assert that the provided name does not already exist
    assert name not in self.__contig_builders, (
        f"The contig {name} already exists, see docstring for methods on "
        f"adding bases to existing contigs"
    )
    builder: ContigBuilder = ContigBuilder(name=name, assembly=assembly, species=species)
    self.__contig_builders[name] = builder
    return builder

to_fasta_file_handle ¶

to_fasta_file_handle(path: Path) -> Iterator[FastaFile]

Writes out the set of accumulated contigs to a FASTA file and returns an open FastaFile.

This is a convenience method that combines to_file() with opening the resulting file as a pysam.FastaFile.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to which to write the FASTA file.	required

Yields:

Type	Description
`FastaFile`	An open `pysam.FastaFile` for the written FASTA.

Source code in fgpyo/fasta/builder.py

@contextmanager
def to_fasta_file_handle(self, path: Path) -> Iterator[FastaFile]:
    """
    Writes out the set of accumulated contigs to a FASTA file and returns an open FastaFile.

    This is a convenience method that combines `to_file()` with opening the resulting
    file as a `pysam.FastaFile`.

    Args:
        path: Path to which to write the FASTA file.

    Yields:
        An open `pysam.FastaFile` for the written FASTA.
    """
    self.to_file(path)
    with FastaFile(f"{path}") as fasta:
        yield fasta

to_file ¶

to_file(path: Path) -> None

Writes out the set of accumulated contigs to a FASTA file at the path given.

Also generates the accompanying fasta index file (.fa.fai) and sequence dictionary file (.dict).

Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to write files to.	required

Example: FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))

Source code in fgpyo/fasta/builder.py

def to_file(
    self,
    path: Path,
) -> None:
    """
    Writes out the set of accumulated contigs to a FASTA file at the `path` given.

    Also generates the accompanying fasta index file (`.fa.fai`) and sequence
    dictionary file (`.dict`).

    Contigs are emitted in the order they were added to the builder.  Sequence
    lines in the FASTA file are wrapped to the line length given when the builder
    was constructed.

    Args:
        path: Path to write files to.

    Example:
    FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
    """
    assert_path_is_writable(path)

    with path.open("w") as writer:
        for contig in self.__contig_builders.values():
            try:
                writer.write(f">{contig.name}")
                writer.write("\n")
                for line in textwrap.wrap(contig.bases, self.line_length):
                    writer.write(line)
                    writer.write("\n")
            except OSError as error:
                raise Exception(f"Could not write to {writer}") from error

    # Index fasta
    pysam_faidx(str(path))

    # Write dictionary
    pysam_dict(
        assembly=self.assembly,
        species=self.species,
        output_path=str(f"{path}.dict"),
        input_path=str(path),
    )

Functions¶

pysam_dict ¶

pysam_dict(assembly: str, species: str, output_path: str, input_path: str) -> None

Calls pysam.dict and writes the sequence dictionary to the provided output path.

Parameters:

Name	Type	Description	Default
`assembly`	`str`	Assembly	required
`species`	`str`	Species	required
`output_path`	`str`	File path to write dictionary to	required
`input_path`	`str`	Path to fasta file	required

Source code in fgpyo/fasta/builder.py

def pysam_dict(assembly: str, species: str, output_path: str, input_path: str) -> None:
    """
    Calls pysam.dict and writes the sequence dictionary to the provided output path.

    Args:
        assembly: Assembly
        species: Species
        output_path: File path to write dictionary to
        input_path: Path to fasta file
    """
    samtools_dict("-a", assembly, "-s", species, "-o", output_path, input_path)

pysam_faidx ¶

pysam_faidx(input_path: str) -> None

Calls pysam.faidx and writes fasta index in the same file location as the fasta file.

Parameters:

Name	Type	Description	Default
`input_path`	`str`	Path to fasta file	required

Source code in fgpyo/fasta/builder.py

def pysam_faidx(input_path: str) -> None:
    """
    Calls pysam.faidx and writes fasta index in the same file location as the fasta file.

    Args:
        input_path: Path to fasta file
    """
    samtools_faidx(input_path)

sequence_dictionary ¶

Classes for representing sequencing dictionaries.¶

Examples of building and using sequence dictionaries¶

Building a sequence dictionary from a pysam.AlignmentHeader:

>>> import pysam
>>> from fgpyo.fasta.sequence_dictionary import SequenceDictionary
>>> sd: SequenceDictionary
>>> with pysam.AlignmentFile("./tests/fgpyo/sam/data/valid.sam") as fh:
...     sd = SequenceDictionary.from_sam(fh.header)
>>> print(sd)  
@SQ     SN:chr1 LN:101
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202

Query based on index:

>>> print(sd[3])  
@SQ     SN:chr4 LN:101

Query based on name:

>>> print(sd["chr6"])  
@SQ     SN:chr6 LN:101

Add, get, and delete attributes:

>>> from fgpyo.fasta.sequence_dictionary import Keys
>>> meta = sd[0]
>>> print(meta)  
@SQ     SN:chr1 LN:101
>>> meta[Keys.ASSEMBLY] = "hg38"
>>> print(meta)  
@SQ     SN:chr1 LN:101  AS:hg38
>>> meta.get(Keys.ASSEMBLY)
'hg38'
>>> meta.get(Keys.SPECIES) is None
True
>>> Keys.MD5 in meta
False
>>> del meta[Keys.ASSEMBLY]
>>> print(meta)  
@SQ     SN:chr1 LN:101

Get a sequence based on one of its aliases

>>> meta[Keys.ALIASES] = "foo,bar,car"
>>> sd = SequenceDictionary(infos=[meta] + sd.infos[1:])
>>> print(sd)  
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202
>>> print(sd["chr1"])  
@SQ     SN:chr1 LN:101  AN:foo,bar,car
>>> print(sd["bar"])  
@SQ     SN:chr1 LN:101  AN:foo,bar,car

Create a pysam.AlignmentHeader from a sequence dictionary:

>>> sd.to_sam_header()  
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header())  
@HD     VN:1.5
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202

Create a pysam.AlignmentHeader from a sequence dictionary with extra header items:

>>> sd.to_sam_header(
...     extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... )  
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header(
...     extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... ))  
@HD     VN:1.5
@SQ     SN:chr1 LN:101  AN:foo,bar,car
@SQ     SN:chr2 LN:101
@SQ     SN:chr3 LN:101
@SQ     SN:chr4 LN:101
@SQ     SN:chr5 LN:101
@SQ     SN:chr6 LN:101
@SQ     SN:chr7 LN:404
@SQ     SN:chr8 LN:202
@RG     ID:A    LB:a-library
@RG     ID:B    LB:b-library

Attributes¶

SEQUENCE_NAME_PATTERN module-attribute ¶

SEQUENCE_NAME_PATTERN: Pattern = compile('^[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*$')

Regular expression for valid reference sequence names according to the SAM spec

Classes¶

AlternateLocus dataclass ¶

Stores an alternate locus for an associated sequence (1-based inclusive).

Source code in fgpyo/fasta/sequence_dictionary.py

@dataclass(frozen=True, init=True)
class AlternateLocus:
    """Stores an alternate locus for an associated sequence (1-based inclusive)."""

    name: str
    start: int
    end: int

    def __post_init__(self) -> None:
        """Any post initialization validation should go here."""
        if self.start > self.end:
            raise ValueError(f"start > end: {self.start} > {self.end}")
        if self.start < 1:
            raise ValueError(f"start < 1: {self.start}")

    def __str__(self) -> str:
        """Returns the string representation as name:start-end."""
        return f"{self.name}:{self.start}-{self.end}"

    def __len__(self) -> int:
        """Returns the length of the genomic span."""
        return self.end - self.start + 1

    @staticmethod
    def parse(value: str) -> "AlternateLocus":
        """Parse the genomic interval of format: `<contig>:<start>-<end>`."""
        name, rest = value.split(":", maxsplit=1)
        start, end = rest.split("-", maxsplit=1)
        return AlternateLocus(name=name, start=int(start), end=int(end))

Functions¶

__len__ ¶

__len__() -> int

Returns the length of the genomic span.

Source code in fgpyo/fasta/sequence_dictionary.py

def __len__(self) -> int:
    """Returns the length of the genomic span."""
    return self.end - self.start + 1

__post_init__ ¶

__post_init__() -> None

Any post initialization validation should go here.

Source code in fgpyo/fasta/sequence_dictionary.py

def __post_init__(self) -> None:
    """Any post initialization validation should go here."""
    if self.start > self.end:
        raise ValueError(f"start > end: {self.start} > {self.end}")
    if self.start < 1:
        raise ValueError(f"start < 1: {self.start}")

__str__ ¶

__str__() -> str

Returns the string representation as name:start-end.

Source code in fgpyo/fasta/sequence_dictionary.py

def __str__(self) -> str:
    """Returns the string representation as name:start-end."""
    return f"{self.name}:{self.start}-{self.end}"

parse staticmethod ¶

parse(value: str) -> AlternateLocus

Parse the genomic interval of format: <contig>:<start>-<end>.

Source code in fgpyo/fasta/sequence_dictionary.py

@staticmethod
def parse(value: str) -> "AlternateLocus":
    """Parse the genomic interval of format: `<contig>:<start>-<end>`."""
    name, rest = value.split(":", maxsplit=1)
    start, end = rest.split("-", maxsplit=1)
    return AlternateLocus(name=name, start=int(start), end=int(end))

Keys ¶

Bases: StrEnum

Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line).

Source code in fgpyo/fasta/sequence_dictionary.py

@unique
class Keys(StrEnum):
    """Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line)."""

    ALIASES = "AN"
    ALTERNATE_LOCUS = "AH"
    ASSEMBLY = "AS"
    DESCRIPTION = "DS"
    SEQUENCE_LENGTH = "LN"
    MD5 = "M5"
    SEQUENCE_NAME = "SN"
    SPECIES = "SP"
    TOPOLOGY = "TP"
    URI = "UR"

    @staticmethod
    def attributes() -> list[str]:
        """
        The list of keys that are allowed to be attributes in `SequenceMetadata`.

        Notably, `SEQUENCE_LENGTH` and `SEQUENCE_NAME` are not allowed.
        """
        return [key for key in Keys if key != Keys.SEQUENCE_NAME and key != Keys.SEQUENCE_LENGTH]

Functions¶

attributes staticmethod ¶

attributes() -> list[str]

The list of keys that are allowed to be attributes in SequenceMetadata.

Notably, SEQUENCE_LENGTH and SEQUENCE_NAME are not allowed.

Source code in fgpyo/fasta/sequence_dictionary.py

@staticmethod
def attributes() -> list[str]:
    """
    The list of keys that are allowed to be attributes in `SequenceMetadata`.

    Notably, `SEQUENCE_LENGTH` and `SEQUENCE_NAME` are not allowed.
    """
    return [key for key in Keys if key != Keys.SEQUENCE_NAME and key != Keys.SEQUENCE_LENGTH]

SequenceDictionary dataclass ¶

Bases: Mapping[str | int, SequenceMetadata]

Contains an ordered collection of sequences.

A specific SequenceMetadata may be retrieved by name (str) or index (int), either by using the generic get method or by the correspondingly named by_name and by_index methods. The latter methods provide faster retrieval when the type is known.

This mapping collection iterates over the keys. To iterate over each SequenceMetadata, either use the typical values() method or access the metadata directly with infos.

Attributes:

Name	Type	Description
`infos`	`list[SequenceMetadata]`	the ordered collection of sequence metadata

Source code in fgpyo/fasta/sequence_dictionary.py

@dataclass(frozen=True, init=True)
class SequenceDictionary(Mapping[str | int, SequenceMetadata]):
    """
    Contains an ordered collection of sequences.

    A specific `SequenceMetadata` may be retrieved by name (`str`) or index (`int`), either by
    using the generic `get` method or by the correspondingly named `by_name` and `by_index` methods.
    The latter methods provide faster retrieval when the type is known.

    This _mapping_ collection iterates over the _keys_.  To iterate over each `SequenceMetadata`,
    either use the typical `values()` method or access the metadata directly with `infos`.

    Attributes:
        infos: the ordered collection of sequence metadata
    """

    infos: list[SequenceMetadata]
    _dict: dict[str, SequenceMetadata] = field(init=False, repr=False)

    def __post_init__(self) -> None:
        """Builds the internal name-to-metadata lookup dictionary."""
        # Initialize a mapping from sequence name to the sequence metadata for all names
        self_dict: dict[str, SequenceMetadata] = {}
        for index, info in enumerate(self.infos):
            if info.index != index:
                raise ValueError(
                    "Infos must be given with index set correctly."
                    + f"  See ${index}th with name: {info.name}"
                )
            for name in info.all_names:
                if name in self_dict:
                    raise ValueError(f"Found duplicate sequence name: {name}")
                self_dict[name] = info
        object.__setattr__(self, "_dict", self_dict)

    def same_as(self, other: "SequenceDictionary") -> bool:
        """
        Returns True if all sequences in the two dictionaries are the same.

        Sequences are considered the same if they share a common reference name (including
        aliases), have the same length, and have the same MD5 (if both have MD5s).
        """
        if len(self) != len(other):
            return False
        return all(this.same_as(that) for this, that in zip(self.infos, other.infos, strict=True))

    def to_sam(self) -> list[dict[str, Any]]:
        """Converts the list of dictionaries, one per sequence."""
        return [meta.to_sam() for meta in self.infos]

    def to_sam_header(
        self,
        extra_header: dict[str, Any] | None = None,
    ) -> pysam.AlignmentHeader:
        """
        Converts the sequence dictionary to a `pysam.AlignmentHeader`.

        Args:
            extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                          `:~pysam.AlignmentHeader` for more details.
        """
        header_dict: dict[str, Any] = {
            "HD": {"VN": "1.5"},
            "SQ": self.to_sam(),
        }
        if extra_header is not None:
            header_dict = {**header_dict, **extra_header}
        return pysam.AlignmentHeader.from_dict(header_dict=header_dict)

    @staticmethod
    @overload
    def from_sam(data: Path) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: pysam.AlignmentFile) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: pysam.AlignmentHeader) -> "SequenceDictionary": ...

    @staticmethod
    @overload
    def from_sam(data: list[dict[str, Any]]) -> "SequenceDictionary": ...

    @staticmethod
    def from_sam(
        data: Path | pysam.AlignmentFile | pysam.AlignmentHeader | list[dict[str, Any]],
    ) -> "SequenceDictionary":
        """
        Creates a `SequenceDictionary` from a SAM file or its header.

        Args:
            data: The input may be any of:
                - a path to a SAM file
                - an open `pysam.AlignmentFile`
                - the `pysam.AlignmentHeader` associated with a `pysam.AlignmentFile`
                - the contents of a header's `SQ` fields, as returned by `AlignmentHeader.to_dict()`
        Returns:
            A `SequenceDictionary` mapping refrence names to their metadata.
        """
        seq_dict: SequenceDictionary
        if isinstance(data, pysam.AlignmentHeader):
            seq_dict = SequenceDictionary.from_sam(data.to_dict()["SQ"])
        elif isinstance(data, pysam.AlignmentFile):
            seq_dict = SequenceDictionary.from_sam(data.header.to_dict()["SQ"])
        elif isinstance(data, Path):
            with sam.reader(data) as fh:
                seq_dict = SequenceDictionary.from_sam(fh.header)
        else:  # assuming `data` is a `list[dict[str, Any]]`
            try:
                infos: list[SequenceMetadata] = [
                    SequenceMetadata.from_sam(meta=meta, index=index)
                    for index, meta in enumerate(data)
                ]
                seq_dict = SequenceDictionary(infos=infos)
            except Exception as e:
                raise ValueError(f"Could not parse sequence information from data: {data}") from e

        return seq_dict

    def __getitem__(self, key: str | int) -> SequenceMetadata:
        """Returns the SequenceMetadata by name or index."""
        return self._dict[key] if isinstance(key, str) else self.infos[key]

    def get_by_name(self, name: str) -> SequenceMetadata | None:
        """
        Gets a `SequenceMetadata` explicitly by `name`.

        Returns:
            The corresponding SequenceMetadata.
            None if the name does not exist in this dictionary.
        """
        return self._dict.get(name)

    def by_name(self, name: str) -> SequenceMetadata:
        """Gets a `SequenceMetadata` explicitly by `name`.  The name must exist."""
        return self._dict[name]

    def by_index(self, index: int) -> SequenceMetadata:
        """
        Gets a `SequenceMetadata` explicitly by `name`.

        Raises:
            IndexError: if the index is out of bounds.
        """
        return self.infos[index]

    def __iter__(self) -> Iterator[str]:
        """Iterates over the sequence names."""
        return iter(self._dict)

    def __len__(self) -> int:
        """Returns the number of sequences in the dictionary."""
        return len(self.infos)

    def __str__(self) -> str:
        """Returns the SAM-formatted string of all sequences."""
        return "\n".join(f"{info}" for info in self.infos)

Functions¶

__getitem__ ¶

__getitem__(key: str | int) -> SequenceMetadata

Returns the SequenceMetadata by name or index.

Source code in fgpyo/fasta/sequence_dictionary.py

def __getitem__(self, key: str | int) -> SequenceMetadata:
    """Returns the SequenceMetadata by name or index."""
    return self._dict[key] if isinstance(key, str) else self.infos[key]

__iter__ ¶

__iter__() -> Iterator[str]

Iterates over the sequence names.

Source code in fgpyo/fasta/sequence_dictionary.py

def __iter__(self) -> Iterator[str]:
    """Iterates over the sequence names."""
    return iter(self._dict)

__len__ ¶

__len__() -> int

Returns the number of sequences in the dictionary.

Source code in fgpyo/fasta/sequence_dictionary.py

def __len__(self) -> int:
    """Returns the number of sequences in the dictionary."""
    return len(self.infos)

__post_init__ ¶

__post_init__() -> None

Builds the internal name-to-metadata lookup dictionary.

Source code in fgpyo/fasta/sequence_dictionary.py

def __post_init__(self) -> None:
    """Builds the internal name-to-metadata lookup dictionary."""
    # Initialize a mapping from sequence name to the sequence metadata for all names
    self_dict: dict[str, SequenceMetadata] = {}
    for index, info in enumerate(self.infos):
        if info.index != index:
            raise ValueError(
                "Infos must be given with index set correctly."
                + f"  See ${index}th with name: {info.name}"
            )
        for name in info.all_names:
            if name in self_dict:
                raise ValueError(f"Found duplicate sequence name: {name}")
            self_dict[name] = info
    object.__setattr__(self, "_dict", self_dict)

__str__ ¶

__str__() -> str

Returns the SAM-formatted string of all sequences.

Source code in fgpyo/fasta/sequence_dictionary.py

def __str__(self) -> str:
    """Returns the SAM-formatted string of all sequences."""
    return "\n".join(f"{info}" for info in self.infos)

by_index ¶

by_index(index: int) -> SequenceMetadata

Gets a SequenceMetadata explicitly by name.

Raises:

Type	Description
`IndexError`	if the index is out of bounds.

Source code in fgpyo/fasta/sequence_dictionary.py

def by_index(self, index: int) -> SequenceMetadata:
    """
    Gets a `SequenceMetadata` explicitly by `name`.

    Raises:
        IndexError: if the index is out of bounds.
    """
    return self.infos[index]

by_name ¶

by_name(name: str) -> SequenceMetadata

Gets a SequenceMetadata explicitly by name. The name must exist.

Source code in fgpyo/fasta/sequence_dictionary.py

def by_name(self, name: str) -> SequenceMetadata:
    """Gets a `SequenceMetadata` explicitly by `name`.  The name must exist."""
    return self._dict[name]

from_sam staticmethod ¶

from_sam(data: Path) -> SequenceDictionary

from_sam(data: AlignmentFile) -> SequenceDictionary

from_sam(data: AlignmentHeader) -> SequenceDictionary

from_sam(data: list[dict[str, Any]]) -> SequenceDictionary

from_sam(data: Path | AlignmentFile | AlignmentHeader | list[dict[str, Any]]) -> SequenceDictionary

Creates a SequenceDictionary from a SAM file or its header.

Parameters:

Name	Type	Description	Default
`data`	`Path \| AlignmentFile \| AlignmentHeader \| list[dict[str, Any]]`	The input may be any of: - a path to a SAM file - an open `pysam.AlignmentFile` - the `pysam.AlignmentHeader` associated with a `pysam.AlignmentFile` - the contents of a header's `SQ` fields, as returned by `AlignmentHeader.to_dict()`	required

Returns: A SequenceDictionary mapping refrence names to their metadata.

Source code in fgpyo/fasta/sequence_dictionary.py

@staticmethod
def from_sam(
    data: Path | pysam.AlignmentFile | pysam.AlignmentHeader | list[dict[str, Any]],
) -> "SequenceDictionary":
    """
    Creates a `SequenceDictionary` from a SAM file or its header.

    Args:
        data: The input may be any of:
            - a path to a SAM file
            - an open `pysam.AlignmentFile`
            - the `pysam.AlignmentHeader` associated with a `pysam.AlignmentFile`
            - the contents of a header's `SQ` fields, as returned by `AlignmentHeader.to_dict()`
    Returns:
        A `SequenceDictionary` mapping refrence names to their metadata.
    """
    seq_dict: SequenceDictionary
    if isinstance(data, pysam.AlignmentHeader):
        seq_dict = SequenceDictionary.from_sam(data.to_dict()["SQ"])
    elif isinstance(data, pysam.AlignmentFile):
        seq_dict = SequenceDictionary.from_sam(data.header.to_dict()["SQ"])
    elif isinstance(data, Path):
        with sam.reader(data) as fh:
            seq_dict = SequenceDictionary.from_sam(fh.header)
    else:  # assuming `data` is a `list[dict[str, Any]]`
        try:
            infos: list[SequenceMetadata] = [
                SequenceMetadata.from_sam(meta=meta, index=index)
                for index, meta in enumerate(data)
            ]
            seq_dict = SequenceDictionary(infos=infos)
        except Exception as e:
            raise ValueError(f"Could not parse sequence information from data: {data}") from e

    return seq_dict

get_by_name ¶

get_by_name(name: str) -> SequenceMetadata | None

Gets a SequenceMetadata explicitly by name.

Returns:

Type	Description
`SequenceMetadata \| None`	The corresponding SequenceMetadata.
`SequenceMetadata \| None`	None if the name does not exist in this dictionary.

Source code in fgpyo/fasta/sequence_dictionary.py

def get_by_name(self, name: str) -> SequenceMetadata | None:
    """
    Gets a `SequenceMetadata` explicitly by `name`.

    Returns:
        The corresponding SequenceMetadata.
        None if the name does not exist in this dictionary.
    """
    return self._dict.get(name)

same_as ¶

same_as(other: SequenceDictionary) -> bool

Returns True if all sequences in the two dictionaries are the same.

Sequences are considered the same if they share a common reference name (including aliases), have the same length, and have the same MD5 (if both have MD5s).

Source code in fgpyo/fasta/sequence_dictionary.py

def same_as(self, other: "SequenceDictionary") -> bool:
    """
    Returns True if all sequences in the two dictionaries are the same.

    Sequences are considered the same if they share a common reference name (including
    aliases), have the same length, and have the same MD5 (if both have MD5s).
    """
    if len(self) != len(other):
        return False
    return all(this.same_as(that) for this, that in zip(self.infos, other.infos, strict=True))

to_sam ¶

to_sam() -> list[dict[str, Any]]

Converts the list of dictionaries, one per sequence.

Source code in fgpyo/fasta/sequence_dictionary.py

def to_sam(self) -> list[dict[str, Any]]:
    """Converts the list of dictionaries, one per sequence."""
    return [meta.to_sam() for meta in self.infos]

to_sam_header ¶

to_sam_header(extra_header: dict[str, Any] | None = None) -> AlignmentHeader

Converts the sequence dictionary to a pysam.AlignmentHeader.

Parameters:

Name	Type	Description	Default
`extra_header`	`dict[str, Any] \| None`	a dictionary of extra values to add to the header, None otherwise. See `:~pysam.AlignmentHeader` for more details.	`None`

Source code in fgpyo/fasta/sequence_dictionary.py

def to_sam_header(
    self,
    extra_header: dict[str, Any] | None = None,
) -> pysam.AlignmentHeader:
    """
    Converts the sequence dictionary to a `pysam.AlignmentHeader`.

    Args:
        extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                      `:~pysam.AlignmentHeader` for more details.
    """
    header_dict: dict[str, Any] = {
        "HD": {"VN": "1.5"},
        "SQ": self.to_sam(),
    }
    if extra_header is not None:
        header_dict = {**header_dict, **extra_header}
    return pysam.AlignmentHeader.from_dict(header_dict=header_dict)

SequenceMetadata dataclass ¶

Bases: MutableMapping[Keys | str, str]

Stores information about a single Sequence (ex. chromosome, contig).

Implements the mutable mapping interface, which provides access to the attributes of this sequence, including name, length, but not index. When using the mapping interface, for example getting, setting, deleting, as well as iterating over keys, values, and items, the values will always be strings (str type). For example, the length will be an str when accessing via get; access the length directly or use len to return an int. Similarly, use the alias property to return a List[str] of aliases, use the alternate property to return an AlternativeLocus-typed instance, and topology property to return a Toplogy-typed instance.

All attributes except name and length may be set. Use dataclasses.replace to create a new copy in such cases.

Important: The len method returns the length of the sequence, not the length of the attributes. Use len(meta.attributes) for the latter.

Attributes:

Name	Type	Description
`name`	`str`	the primary name of the sequence
`length`	`int`	the length of the sequence, or zero if unknown
`index`	`int`	the index in the sequence dictionary
`attributes`	`dict[Keys \| str, str]`	attributes of this sequence

Source code in fgpyo/fasta/sequence_dictionary.py

@dataclass(frozen=True, init=True)
class SequenceMetadata(MutableMapping[Keys | str, str]):
    """
    Stores information about a single Sequence (ex. chromosome, contig).

    Implements the mutable mapping interface, which provides access to the attributes of this
    sequence, including name, length, but not index.  When using the mapping interface, for example
    getting, setting, deleting, as well as iterating over keys, values, and items, the _values_ will
    always be strings (`str` type).  For example, the length will be an `str` when accessing via
    `get`; access the length directly or use `len` to return an `int`.  Similarly, use the
    `alias` property to return a `List[str]` of aliases, use the `alternate` property to return
    an `AlternativeLocus`-typed instance, and `topology` property to return a `Toplogy`-typed
    instance.

    All attributes except name and length may be set.  Use `dataclasses.replace` to create a new
    copy in such cases.

    Important: The `len` method returns the length of the sequence, not the length of the
    attributes.  Use `len(meta.attributes)` for the latter.

    Attributes:
      name: the primary name of the sequence
      length: the length of the sequence, or zero if unknown
      index: the index in the sequence dictionary
      attributes: attributes of this sequence
    """

    name: str
    length: int
    index: int
    attributes: dict[Keys | str, str] = field(default_factory=dict)

    def __post_init__(self) -> None:
        """Any post initialization validation should go here."""
        if self.length < 0:
            raise ValueError(f"Length must be >= 0 for '{self.name}'")
        if re.search(SEQUENCE_NAME_PATTERN, self.name) is None:
            raise ValueError(f"Illegal name: '{self.name}'")
        if Keys.SEQUENCE_NAME in self.attributes:
            raise ValueError(f"'{Keys.SEQUENCE_NAME}' should not given in the list of attributes")
        if Keys.SEQUENCE_LENGTH in self.attributes:
            raise ValueError(f"'{Keys.SEQUENCE_LENGTH}' should not given in the list of attributes")

    @property
    def aliases(self) -> list[str]:
        """The aliases (not including the primary) name."""
        aliases = self.attributes.get(Keys.ALIASES)
        return [] if aliases is None else aliases.split(",")

    @property
    def all_names(self) -> list[str]:
        """A list of all names, including the primary name and aliases, in that order."""
        return [self.name] + self.aliases

    @property
    def alternate(self) -> AlternateLocus | None:
        """Gets the alternate locus for this sequence."""
        if Keys.ALTERNATE_LOCUS not in self.attributes:
            return None
        value = self.attributes[Keys.ALTERNATE_LOCUS]
        if value == "*":
            return None
        locus = AlternateLocus.parse(value)
        if locus.name == "=":
            locus = replace(locus, name=self.name)
        return locus

    @property
    def is_alternate(self) -> bool:
        """True if there is an alternate locus defined, False otherwise."""
        return self.alternate is not None

    @property
    def md5(self) -> str | None:
        """Returns the MD5 checksum of the sequence, or None."""
        return self.get(Keys.MD5)

    @property
    def assembly(self) -> str | None:
        """Returns the assembly name, or None."""
        return self.get(Keys.ASSEMBLY)

    @property
    def uri(self) -> str | None:
        """Returns the URI of the sequence, or None."""
        return self.get(Keys.URI)

    @property
    def species(self) -> str | None:
        """Returns the species name, or None."""
        return self.get(Keys.SPECIES)

    @property
    def description(self) -> str | None:
        """Returns the description, or None."""
        return self.get(Keys.DESCRIPTION)

    @property
    def topology(self) -> Topology | None:
        """Returns the topology (linear or circular), or None."""
        value = self.get(Keys.TOPOLOGY)
        return None if value is None else Topology[value]

    def same_as(self, other: "SequenceMetadata") -> bool:
        """
        Returns True if the two sequences are the same.

        Sequences are considered the same if they share a common reference name (including aliases),
        have the same length, and have the same MD5 (if both have MD5s).
        """
        if self.length != other.length:
            return False
        elif self.name != other.name and other.name not in self.all_names:
            return False
        self_m5 = self.md5
        other_m5 = other.md5
        if self_m5 is None or other_m5 is None:
            return True
        else:
            return self_m5 == other_m5

    def to_sam(self) -> dict[str, Any]:
        """
        Converts the sequence metadata to a SAM-formatted dictionary.

        Equivalent to one item in the list of sequences from
        `pysam.AlignmentHeader#to_dict()["SQ"]`.
        """
        meta_dict: dict[str, Any] = {
            f"{Keys.SEQUENCE_NAME}": self.name,
            f"{Keys.SEQUENCE_LENGTH}": self.length,
        }
        if len(self.attributes) > 0:
            meta_dict = {**meta_dict, **self.attributes}

        return meta_dict

    @staticmethod
    def from_sam(meta: dict[Keys | str, Any], index: int) -> "SequenceMetadata":
        """
        Builds a `SequenceMetadata` from a dictionary.

        The keys must include the sequence name (`Keys.SEQUENCE_NAME`) and length
        (`Keys.SEQUENCE_LENGTH`). All other keys from `Keys` will be stored in the resulting
        attributes.

        Args:
            meta: the python dictionary with keys from `Keys`.  This is typically the dictionary
                  stored in the `"SQ"` level of the two-level dictionary returned by the
                  `pysam.AlignmentHeader#to_dict()` method.
            index: the 0-based index to use for this sequence
        """
        name = meta[Keys.SEQUENCE_NAME]
        length = meta[Keys.SEQUENCE_LENGTH]
        attributes = copy.deepcopy(meta)
        del attributes[Keys.SEQUENCE_NAME]
        del attributes[Keys.SEQUENCE_LENGTH]
        return SequenceMetadata(name=name, length=length, index=index, attributes=attributes)

    def __getitem__(self, key: Keys | str) -> Any:
        """Returns the value for the given key."""
        if key == Keys.SEQUENCE_NAME.value:
            return self.name
        elif key == Keys.SEQUENCE_LENGTH.value:
            return f"{self.length}"
        return self.attributes[key]

    def __setitem__(self, key: Keys | str, value: str) -> None:
        """Sets the value for the given attribute key."""
        if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
            raise KeyError(f"Cannot set '{key}' on SequenceMetadata with name '{self.name}'")
        self.attributes[key] = value

    def __delitem__(self, key: Keys | str) -> None:
        """Deletes the given attribute key."""
        if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
            raise KeyError(f"Cannot delete '{key}' on SequenceMetadata with name '{self.name}'")
        del self.attributes[key]

    def __iter__(self) -> Iterator[Keys | str]:
        """Iterates over all keys, starting with name and length."""
        pre_iter = iter((Keys.SEQUENCE_NAME, Keys.SEQUENCE_LENGTH))
        return itertools.chain(pre_iter, iter(self.attributes))

    def __len__(self) -> int:
        """Returns the sequence length."""
        return self.length

    def __str__(self) -> str:
        """Returns the SAM-formatted @SQ line."""
        return "@SQ\t" + "\t".join(f"{key}:{value}" for key, value in self.to_sam().items())

    def __index__(self) -> int:
        """Returns the index of this sequence in the dictionary."""
        return self.index

Attributes¶

aliases property ¶

aliases: list[str]

The aliases (not including the primary) name.

all_names property ¶

all_names: list[str]

A list of all names, including the primary name and aliases, in that order.

alternate property ¶

alternate: AlternateLocus | None

Gets the alternate locus for this sequence.

assembly property ¶

assembly: str | None

Returns the assembly name, or None.

description property ¶

description: str | None

Returns the description, or None.

is_alternate property ¶

is_alternate: bool

True if there is an alternate locus defined, False otherwise.

md5 property ¶

md5: str | None

Returns the MD5 checksum of the sequence, or None.

species property ¶

species: str | None

Returns the species name, or None.

topology property ¶

topology: Topology | None

Returns the topology (linear or circular), or None.

uri property ¶

uri: str | None

Returns the URI of the sequence, or None.

Functions¶

__delitem__ ¶

__delitem__(key: Keys | str) -> None

Deletes the given attribute key.

Source code in fgpyo/fasta/sequence_dictionary.py

def __delitem__(self, key: Keys | str) -> None:
    """Deletes the given attribute key."""
    if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
        raise KeyError(f"Cannot delete '{key}' on SequenceMetadata with name '{self.name}'")
    del self.attributes[key]

__getitem__ ¶

__getitem__(key: Keys | str) -> Any

Returns the value for the given key.

Source code in fgpyo/fasta/sequence_dictionary.py

def __getitem__(self, key: Keys | str) -> Any:
    """Returns the value for the given key."""
    if key == Keys.SEQUENCE_NAME.value:
        return self.name
    elif key == Keys.SEQUENCE_LENGTH.value:
        return f"{self.length}"
    return self.attributes[key]

__index__ ¶

__index__() -> int

Returns the index of this sequence in the dictionary.

Source code in fgpyo/fasta/sequence_dictionary.py

def __index__(self) -> int:
    """Returns the index of this sequence in the dictionary."""
    return self.index

__iter__ ¶

__iter__() -> Iterator[Keys | str]

Iterates over all keys, starting with name and length.

Source code in fgpyo/fasta/sequence_dictionary.py

def __iter__(self) -> Iterator[Keys | str]:
    """Iterates over all keys, starting with name and length."""
    pre_iter = iter((Keys.SEQUENCE_NAME, Keys.SEQUENCE_LENGTH))
    return itertools.chain(pre_iter, iter(self.attributes))

__len__ ¶

__len__() -> int

Returns the sequence length.

Source code in fgpyo/fasta/sequence_dictionary.py

def __len__(self) -> int:
    """Returns the sequence length."""
    return self.length

__post_init__ ¶

__post_init__() -> None

Any post initialization validation should go here.

Source code in fgpyo/fasta/sequence_dictionary.py

def __post_init__(self) -> None:
    """Any post initialization validation should go here."""
    if self.length < 0:
        raise ValueError(f"Length must be >= 0 for '{self.name}'")
    if re.search(SEQUENCE_NAME_PATTERN, self.name) is None:
        raise ValueError(f"Illegal name: '{self.name}'")
    if Keys.SEQUENCE_NAME in self.attributes:
        raise ValueError(f"'{Keys.SEQUENCE_NAME}' should not given in the list of attributes")
    if Keys.SEQUENCE_LENGTH in self.attributes:
        raise ValueError(f"'{Keys.SEQUENCE_LENGTH}' should not given in the list of attributes")

__setitem__ ¶

__setitem__(key: Keys | str, value: str) -> None

Sets the value for the given attribute key.

Source code in fgpyo/fasta/sequence_dictionary.py

def __setitem__(self, key: Keys | str, value: str) -> None:
    """Sets the value for the given attribute key."""
    if key == Keys.SEQUENCE_NAME or key == Keys.SEQUENCE_LENGTH:
        raise KeyError(f"Cannot set '{key}' on SequenceMetadata with name '{self.name}'")
    self.attributes[key] = value

__str__ ¶

__str__() -> str

Returns the SAM-formatted @SQ line.

Source code in fgpyo/fasta/sequence_dictionary.py

def __str__(self) -> str:
    """Returns the SAM-formatted @SQ line."""
    return "@SQ\t" + "\t".join(f"{key}:{value}" for key, value in self.to_sam().items())

from_sam staticmethod ¶

from_sam(meta: dict[Keys | str, Any], index: int) -> SequenceMetadata

Builds a SequenceMetadata from a dictionary.

The keys must include the sequence name (Keys.SEQUENCE_NAME) and length (Keys.SEQUENCE_LENGTH). All other keys from Keys will be stored in the resulting attributes.

Parameters:

Name	Type	Description	Default
`meta`	`dict[Keys \| str, Any]`	the python dictionary with keys from `Keys`. This is typically the dictionary stored in the `"SQ"` level of the two-level dictionary returned by the `pysam.AlignmentHeader#to_dict()` method.	required
`index`	`int`	the 0-based index to use for this sequence	required

Source code in fgpyo/fasta/sequence_dictionary.py

@staticmethod
def from_sam(meta: dict[Keys | str, Any], index: int) -> "SequenceMetadata":
    """
    Builds a `SequenceMetadata` from a dictionary.

    The keys must include the sequence name (`Keys.SEQUENCE_NAME`) and length
    (`Keys.SEQUENCE_LENGTH`). All other keys from `Keys` will be stored in the resulting
    attributes.

    Args:
        meta: the python dictionary with keys from `Keys`.  This is typically the dictionary
              stored in the `"SQ"` level of the two-level dictionary returned by the
              `pysam.AlignmentHeader#to_dict()` method.
        index: the 0-based index to use for this sequence
    """
    name = meta[Keys.SEQUENCE_NAME]
    length = meta[Keys.SEQUENCE_LENGTH]
    attributes = copy.deepcopy(meta)
    del attributes[Keys.SEQUENCE_NAME]
    del attributes[Keys.SEQUENCE_LENGTH]
    return SequenceMetadata(name=name, length=length, index=index, attributes=attributes)

same_as ¶

same_as(other: SequenceMetadata) -> bool

Returns True if the two sequences are the same.

Sequences are considered the same if they share a common reference name (including aliases), have the same length, and have the same MD5 (if both have MD5s).

Source code in fgpyo/fasta/sequence_dictionary.py

def same_as(self, other: "SequenceMetadata") -> bool:
    """
    Returns True if the two sequences are the same.

    Sequences are considered the same if they share a common reference name (including aliases),
    have the same length, and have the same MD5 (if both have MD5s).
    """
    if self.length != other.length:
        return False
    elif self.name != other.name and other.name not in self.all_names:
        return False
    self_m5 = self.md5
    other_m5 = other.md5
    if self_m5 is None or other_m5 is None:
        return True
    else:
        return self_m5 == other_m5

to_sam ¶

to_sam() -> dict[str, Any]

Converts the sequence metadata to a SAM-formatted dictionary.

Equivalent to one item in the list of sequences from pysam.AlignmentHeader#to_dict()["SQ"].

Source code in fgpyo/fasta/sequence_dictionary.py

def to_sam(self) -> dict[str, Any]:
    """
    Converts the sequence metadata to a SAM-formatted dictionary.

    Equivalent to one item in the list of sequences from
    `pysam.AlignmentHeader#to_dict()["SQ"]`.
    """
    meta_dict: dict[str, Any] = {
        f"{Keys.SEQUENCE_NAME}": self.name,
        f"{Keys.SEQUENCE_LENGTH}": self.length,
    }
    if len(self.attributes) > 0:
        meta_dict = {**meta_dict, **self.attributes}

    return meta_dict

Topology ¶

Bases: StrEnum

Enumeration for the topology of reference sequences (SAM @SQ.TP).

Source code in fgpyo/fasta/sequence_dictionary.py

@unique
class Topology(StrEnum):
    """Enumeration for the topology of reference sequences (SAM @SQ.TP)."""

    LINEAR = "LINEAR"
    CIRCULAR = "CIRCULAR"

Modules¶

fastx ¶

Zipping FASTX Files.¶

Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics and can be achieved with the FastxZipped() context manager. The context manager facilitates opening of all input FASTA/FASTQ files and closing them after iteration is complete. For every iteration of FastxZipped(), a tuple of the next FASTX records are returned (of type pysam.FastxRecord()). An exception will be raised if any of the input files are malformed or truncated and if record names are not equivalent and in sync.

Importantly, this context manager is optimized for fast streaming read-only usage and, by default, any previous records saved while advancing the iterator will not be correct as the underlying pointer in memory will refer to the most recent record only, and not any past records. To preserve the state of all previously iterated records, set the parameter persist to True.

>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:  
...    for (r1, r2) in zipped:
...         print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT

Classes¶

FastxZipped ¶

Bases: AbstractContextManager, Iterator[tuple[FastxRecord, ...]]

A context manager that will lazily zip over any number of FASTA/FASTQ files.

Parameters:

Name	Type	Description	Default
`paths`	`Path \| str`	Paths to the FASTX files to zip over.	`()`
`persist`	`bool`	Whether to persist the state of previous records during iteration.	`False`

Source code in fgpyo/fastx/__init__.py

class FastxZipped(AbstractContextManager, Iterator[tuple[FastxRecord, ...]]):
    """
    A context manager that will lazily zip over any number of FASTA/FASTQ files.

    Args:
        paths: Paths to the FASTX files to zip over.
        persist: Whether to persist the state of previous records during iteration.

    """

    def __init__(self, *paths: Path | str, persist: bool = False) -> None:
        """Instantiate a `FastxZipped` context manager and iterator."""
        if len(paths) <= 0:
            raise ValueError(f"Must provide at least one FASTX to {self.__class__.__name__}")
        self._persist: bool = persist
        self._paths: tuple[Path | str, ...] = paths
        self._fastx = tuple(FastxFile(str(path), persist=self._persist) for path in self._paths)

    @staticmethod
    def _name_minus_ordinal(name: str) -> str:
        """Return the name of the FASTX record minus its ordinal suffix (e.g. "/1" or "/2")."""
        return name[: len(name) - 2] if len(name) >= 2 and name[-2] == "/" else name

    def __next__(self) -> tuple[FastxRecord, ...]:
        """Return the next set of FASTX records from the zipped FASTX files."""
        records = tuple(next(handle, None) for handle in self._fastx)

        if all(record is None for record in records):
            raise StopIteration
        elif not all_not_none(records):
            non_none_names: list[str | None] = [
                record.name for record in records if record is not None
            ]
            assert all_not_none(non_none_names)  # type narrowing
            # We know there is at least one non-None record because the previous conditional
            # covers the case where all records are None, so it is safe to index into the first
            # element of non_none_names.
            sequence_name: str = non_none_names[0]
            raise ValueError(
                "One or more of the FASTX files is truncated for sequence "
                + f"{self._name_minus_ordinal(sequence_name)}:\n\t"
                + "\n\t".join(
                    str(self._paths[i]) for i, record in enumerate(records) if record is None
                )
            )

        names_with_ordinals: list[str | None] = [record.name for record in records]
        assert all_not_none(names_with_ordinals)  # type narrowing
        record_names: list[str] = [self._name_minus_ordinal(name) for name in names_with_ordinals]
        if len(set(record_names)) != 1:
            raise ValueError(f"FASTX record names do not all match, found: {record_names}")

        return records

    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_val: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> bool | None:
        """Exit the `FastxZipped` context manager by closing all FASTX files."""
        self.close()
        return None

    def close(self) -> None:
        """Close the `FastxZipped` context manager by closing all FASTX files."""
        for fastx in self._fastx:
            fastx.close()

Functions¶

__exit__ ¶

__exit__(exc_type: type[BaseException] | None, exc_val: BaseException | None, exc_tb: TracebackType | None) -> bool | None

Exit the FastxZipped context manager by closing all FASTX files.

Source code in fgpyo/fastx/__init__.py

def __exit__(
    self,
    exc_type: type[BaseException] | None,
    exc_val: BaseException | None,
    exc_tb: TracebackType | None,
) -> bool | None:
    """Exit the `FastxZipped` context manager by closing all FASTX files."""
    self.close()
    return None

__init__ ¶

__init__(*paths: Path | str, persist: bool = False) -> None

Instantiate a FastxZipped context manager and iterator.

Source code in fgpyo/fastx/__init__.py

def __init__(self, *paths: Path | str, persist: bool = False) -> None:
    """Instantiate a `FastxZipped` context manager and iterator."""
    if len(paths) <= 0:
        raise ValueError(f"Must provide at least one FASTX to {self.__class__.__name__}")
    self._persist: bool = persist
    self._paths: tuple[Path | str, ...] = paths
    self._fastx = tuple(FastxFile(str(path), persist=self._persist) for path in self._paths)

__next__ ¶

__next__() -> tuple[FastxRecord, ...]

Return the next set of FASTX records from the zipped FASTX files.

Source code in fgpyo/fastx/__init__.py

def __next__(self) -> tuple[FastxRecord, ...]:
    """Return the next set of FASTX records from the zipped FASTX files."""
    records = tuple(next(handle, None) for handle in self._fastx)

    if all(record is None for record in records):
        raise StopIteration
    elif not all_not_none(records):
        non_none_names: list[str | None] = [
            record.name for record in records if record is not None
        ]
        assert all_not_none(non_none_names)  # type narrowing
        # We know there is at least one non-None record because the previous conditional
        # covers the case where all records are None, so it is safe to index into the first
        # element of non_none_names.
        sequence_name: str = non_none_names[0]
        raise ValueError(
            "One or more of the FASTX files is truncated for sequence "
            + f"{self._name_minus_ordinal(sequence_name)}:\n\t"
            + "\n\t".join(
                str(self._paths[i]) for i, record in enumerate(records) if record is None
            )
        )

    names_with_ordinals: list[str | None] = [record.name for record in records]
    assert all_not_none(names_with_ordinals)  # type narrowing
    record_names: list[str] = [self._name_minus_ordinal(name) for name in names_with_ordinals]
    if len(set(record_names)) != 1:
        raise ValueError(f"FASTX record names do not all match, found: {record_names}")

    return records

close ¶

close() -> None

Close the FastxZipped context manager by closing all FASTX files.

Source code in fgpyo/fastx/__init__.py

def close(self) -> None:
    """Close the `FastxZipped` context manager by closing all FASTX files."""
    for fastx in self._fastx:
        fastx.close()

Functions¶

io ¶

Module for reading and writing files.¶

The functions in this module make it easy to:

check if a file exists and is writable
check if a file and its parent directories exist and are writable
check if a file exists and is readable
check if a path exists and is a directory
open an appropriate reader or writer based on the file extension
write items to a file, one per line
read lines from a file

fgpyo.io Examples:¶

>>> import fgpyo.io as fio
>>> from fgpyo.io import write_lines, read_lines
>>> from pathlib import Path

Assert that a path exists and is readable:

>>> tmp_dir = Path(getfixture("tmp_path"))
>>> path_flat: Path = tmp_dir / "example.txt"
>>> fio.assert_path_is_readable(path_flat)  
Traceback (most recent call last):
    ...
AssertionError: Cannot read non-existent path: ...

Write to and read from path:

>>> path_flat = tmp_dir / "example.txt"
>>> path_compressed = tmp_dir / "example.txt.gz"
>>> write_lines(path=path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path=path_compressed, lines_to_write=["gzip file", 10])

Read lines from a path into a generator:

>>> lines = read_lines(path=path_flat)
>>> next(lines)
'flat file'
>>> next(lines)
'10'
>>> lines = read_lines(path=path_compressed)
>>> next(lines)
'gzip file'
>>> next(lines)
'10'

Functions¶

assert_directory_exists ¶

assert_directory_exists(path: Path) -> None

Asserts that a path exist and is a directory.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to check	required

Example

assert_directory_exists(path = Path("/example/directory/"))

Source code in fgpyo/io/__init__.py

def assert_directory_exists(path: Path) -> None:
    """
    Asserts that a path exist and is a directory.

    Args:
        path: Path to check

    Example:
        assert_directory_exists(path = Path("/example/directory/"))
    """
    assert path.exists(), f"Path does not exist: {path}"
    assert path.is_dir(), f"Path exists but is not a directory: {path}"

assert_fasta_indexed ¶

assert_fasta_indexed(fasta: Path, /, dictionary: bool = False, bwa: bool = False) -> None

Verify that a FASTA is readable and has the expected index files.

The existence of the FASTA index generated by samtools faidx will always be verified. The existence of the index files generated by samtools dict and bwa index may be optionally verified.

Parameters:

Name	Type	Description	Default
`fasta`	`Path`	Path to the FASTA file.	required
`dictionary`	`bool`	If True, check for the index file generated by `samtools dict` (`{fasta}.dict`).	`False`
`bwa`	`bool`	If True, check for the index files generated by `bwa index` (`{fasta}.{suffix}`, for all suffixes in ["amb", "ann", "bwt", "pac", "sa"]).	`False`

Raises:

Type	Description
`AssertionError`	If the FASTA or any of the expected index files are missing or not readable.

Source code in fgpyo/io/__init__.py

def assert_fasta_indexed(
    fasta: Path,
    /,
    dictionary: bool = False,
    bwa: bool = False,
) -> None:
    """
    Verify that a FASTA is readable and has the expected index files.

    The existence of the FASTA index generated by `samtools faidx` will always be verified. The
    existence of the index files generated by `samtools dict` and `bwa index` may be optionally
    verified.

    Args:
        fasta: Path to the FASTA file.
        dictionary: If True, check for the index file generated by `samtools dict` (`{fasta}.dict`).
        bwa: If True, check for the index files generated by `bwa index` (`{fasta}.{suffix}`, for
            all suffixes in ["amb", "ann", "bwt", "pac", "sa"]).

    Raises:
        AssertionError: If the FASTA or any of the expected index files are missing or not readable.
    """
    fai_index = Path(f"{fasta}.fai")
    assert_path_is_readable(fai_index)

    if dictionary:
        dict_index = Path(f"{fasta}.dict")
        assert_path_is_readable(dict_index)

    if bwa:
        suffixes = ["amb", "ann", "bwt", "pac", "sa"]
        for suffix in suffixes:
            bwa_index = Path(f"{fasta}.{suffix}")
            assert_path_is_readable(bwa_index)

assert_path_is_readable ¶

assert_path_is_readable(path: Path) -> None

Checks that file exists and returns True, else raises AssertionError.

Parameters:

Name	Type	Description	Default
`path`	`Path`	a Path to check	required

Example

assert_file_exists(path = Path("some_file.csv"))

Source code in fgpyo/io/__init__.py

def assert_path_is_readable(path: Path) -> None:
    """
    Checks that file exists and returns True, else raises AssertionError.

    Args:
        path: a Path to check

    Example:
        assert_file_exists(path = Path("some_file.csv"))
    """
    # stdin is readable
    if path == Path("/dev/stdin"):
        return

    assert path.exists(), f"Cannot read non-existent path: {path}"
    assert path.is_file(), f"Cannot read path because it is not a file: {path}"
    assert os.access(path, os.R_OK), f"Path exists but is not readable: {path}"

assert_path_is_writable ¶

assert_path_is_writable(path: Path, parent_must_exist: bool = True) -> None

Assert that a filepath is writable.

Specifically: - If the file exists then it must also be writable. - Else if the path is not a file and parent_must_exist is true, then assert that the parent directory exists and is writable. - Else if the path is not a directory and parent_must_exist is false, then look at each parent directory until one is found that exists and is writable.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to check	required
`parent_must_exist`	`bool`	If True, the file's parent directory must exist. Otherwise, at least one directory in the path's components must exist.	`True`

Raises:

Type	Description
`AssertionError`	If any of the above conditions are not met.

Example

assert_path_is_writable(path = Path("example.txt"))

Source code in fgpyo/io/__init__.py

def assert_path_is_writable(path: Path, parent_must_exist: bool = True) -> None:
    """
    Assert that a filepath is writable.

    Specifically:
    - If the file exists then it must also be writable.
    - Else if the path is not a file and `parent_must_exist` is true, then assert that the parent
      directory exists and is writable.
    - Else if the path is not a directory and `parent_must_exist` is false, then look at each parent
      directory until one is found that exists and is writable.

    Args:
        path: Path to check
        parent_must_exist: If True, the file's parent directory must exist. Otherwise, at least one
            directory in the path's components must exist.

    Raises:
        AssertionError: If any of the above conditions are not met.

    Example:
        assert_path_is_writable(path = Path("example.txt"))
    """
    # stdout is writable
    if path == Path("/dev/stdout"):
        return

    # If path exists, it must be a writable file
    if path.exists():
        assert path.is_file(), f"Cannot read path because it is not a file: {path}"
        assert os.access(path, os.W_OK), f"File exists but is not writable: {path}"

    # Else if file doesn't exist and parent_must_exist is True then check
    # that path.absolute().parent exists, is a directory and is writable
    elif parent_must_exist:
        parent = path.absolute().parent
        assert parent.exists(), f"Parent directory does not exist: {parent}"
        assert parent.is_dir(), f"Parent directory exists but is not a directory: {parent}"
        assert os.access(parent, os.W_OK), f"Parent directory exists but is not writable: {parent}"

    # Else if file doesn't exist and parent_must_exist is False, test parent until
    # you find the first extant path, and check that it is a directory and is writable.
    else:
        for parent in path.absolute().parents:
            if parent.exists():
                assert os.access(parent, os.W_OK), f"Parent directory is not writable: {parent}"
                break
        else:
            raise AssertionError(f"No parent directories exist for: {path}")

assert_path_is_writeable ¶

assert_path_is_writeable(path: Path, parent_must_exist: bool = True) -> None

A deprecated alias for assert_path_is_writable().

Source code in fgpyo/io/__init__.py

def assert_path_is_writeable(path: Path, parent_must_exist: bool = True) -> None:
    """A deprecated alias for `assert_path_is_writable()`."""
    warnings.warn(
        "assert_path_is_writeable is deprecated, use assert_path_is_writable instead",
        DeprecationWarning,
        stacklevel=2,
    )

    assert_path_is_writable(path=path, parent_must_exist=parent_must_exist)

read_lines ¶

read_lines(path: Path, strip: bool = False, threads: int | None = None) -> Iterator[str]

Reads each line from a path into a generator, removing line terminators.

By default, only line terminators (CR/LF) are stripped. The strip parameter may be used to strip both leading and trailing whitespace from each line.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to read from	required
`strip`	`bool`	True to strip lines of all leading and trailing whitespace, False to only remove trailing CR/LF characters.	`False`
`threads`	`int \| None`	the number of threads to use when decompressing gzip files	`None`

Example

import fgpyo.io as fio read_back = fio.read_lines(path)

Source code in fgpyo/io/__init__.py

def read_lines(path: Path, strip: bool = False, threads: int | None = None) -> Iterator[str]:
    """
    Reads each line from a path into a generator, removing line terminators.

    By default, only line terminators (CR/LF) are stripped.  The `strip`
    parameter may be used to strip both leading and trailing whitespace from each line.

    Args:
        path: Path to read from
        strip: True to strip lines of all leading and trailing whitespace,
            False to only remove trailing CR/LF characters.
        threads: the number of threads to use when decompressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> read_back = fio.read_lines(path)

    """
    with to_reader(path=path, threads=threads) as reader:
        if strip:
            for line in reader:
                yield line.strip()
        else:
            for line in reader:
                yield line.rstrip("\r\n")

redirect_to_dev_null ¶

redirect_to_dev_null(file_num: int) -> Generator[None, None, None]

A context manager that redirects output of file handle to /dev/null.

Parameters:

Name	Type	Description	Default
`file_num`	`int`	number of filehandle to redirect.	required

Source code in fgpyo/io/__init__.py

@contextmanager
def redirect_to_dev_null(file_num: int) -> Generator[None, None, None]:
    """
    A context manager that redirects output of file handle to /dev/null.

    Args:
        file_num: number of filehandle to redirect.
    """
    f_devnull = save_fd = None
    try:
        # open /dev/null for writing
        f_devnull = os.open(os.devnull, os.O_RDWR)
        # save old file descriptor and redirect stderr to /dev/null
        save_fd = os.dup(file_num)
        os.dup2(f_devnull, file_num)
        yield
    finally:
        # restore file descriptor and close devnull
        if save_fd is not None:
            os.dup2(save_fd, file_num)
            os.close(save_fd)
        if f_devnull is not None:
            os.close(f_devnull)

suppress_stderr ¶

suppress_stderr() -> Generator[None, None, None]

A context manager that redirects output of stderr to /dev/null.

Source code in fgpyo/io/__init__.py

@contextmanager
def suppress_stderr() -> Generator[None, None, None]:
    """A context manager that redirects output of stderr to /dev/null."""
    with redirect_to_dev_null(file_num=sys.stderr.fileno()):
        yield

to_reader ¶

to_reader(path: Path, threads: int | None = None) -> TextIOWrapper

Opens a Path for reading and based on extension uses open() or gzip_ng.open().

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to read from	required
`threads`	`int \| None`	the number of threads to use when decompressing gzip files	`None`

Example

import fgpyo.io as fio reader = fio.to_reader(path=Path("reader.txt")).readlines().close()

Source code in fgpyo/io/__init__.py

def to_reader(path: Path, threads: int | None = None) -> TextIOWrapper:
    """
    Opens a Path for reading and based on extension uses open() or gzip_ng.open().

    Args:
        path: Path to read from
        threads: the number of threads to use when decompressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> reader = fio.to_reader(path=Path("reader.txt"))
        >>> reader.readlines()
        >>> reader.close()

    """
    if path.suffix in COMPRESSED_FILE_EXTENSIONS:
        if threads is None:
            reader = gzip_ng.open(path, mode="rb")  # type: ignore[no-untyped-call]
        else:
            reader = gzip_ng_threaded.open(path, mode="rb", threads=threads)  # type: ignore[no-untyped-call]
        return TextIOWrapper(cast(IO[bytes], reader), encoding="utf-8")
    else:
        return path.open(mode="r")

to_writer ¶

to_writer(path: Path, append: bool = False, threads: int | None = None) -> TextIOWrapper

Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open().

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to write (or append) to	required
`append`	`bool`	open the file for appending	`False`
`threads`	`int \| None`	the number of threads to use when compressing gzip files	`None`

Example

import fgpyo.io as fio writer = fio.to_writer(path=Path("writer.txt")).write("something\n").close()

Source code in fgpyo/io/__init__.py

def to_writer(path: Path, append: bool = False, threads: int | None = None) -> TextIOWrapper:
    r"""
    Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open().

    Args:
        path: Path to write (or append) to
        append: open the file for appending
        threads: the number of threads to use when compressing gzip files

    Example:
        >>> import fgpyo.io as fio
        >>> writer = fio.to_writer(path=Path("writer.txt"))
        >>> writer.write("something\\n")
        >>> writer.close()

    """
    mode_prefix: str = "a" if append else "w"

    if path.suffix in COMPRESSED_FILE_EXTENSIONS:
        if threads is None:
            reader = gzip_ng.open(path, mode=mode_prefix + "b")  # type: ignore[no-untyped-call]
        else:
            reader = gzip_ng_threaded.open(path, mode=mode_prefix + "b", threads=threads)  # type: ignore[no-untyped-call]
        return TextIOWrapper(
            cast(IO[bytes], reader),
            encoding="utf-8",
        )
    else:
        # NB: the `cast` here is necessary because `path.open()` may return
        # other types, depending on the specified `mode`.
        # Within the scope of this function, `mode_prefix` is guaranteed to be
        # either "w" or "a", both of which result in a `TextIOWrapper`, but
        # mypy can't follow that logic.
        return cast(TextIOWrapper, path.open(mode=mode_prefix))

write_lines ¶

write_lines(path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: int | None = None) -> None

Writes (or appends) a file with one line per item in provided iterable.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to write (or append) to	required
`lines_to_write`	`Iterable[Any]`	items to write (or append) to file	required
`append`	`bool`	open the file for appending	`False`
`threads`	`int \| None`	the number of threads to use when compressing gzip files	`None`

Example

lines: List[Any] = ["things to write", 100] path_to_write_to: Path = Path("file_to_write_to.txt") fio.write_lines(path = path_to_write_to, lines_to_write = lines)

Source code in fgpyo/io/__init__.py

def write_lines(
    path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: int | None = None
) -> None:
    """
    Writes (or appends) a file with one line per item in provided iterable.

    Args:
        path: Path to write (or append) to
        lines_to_write: items to write (or append) to file
        append: open the file for appending
        threads: the number of threads to use when compressing gzip files

    Example:
        lines: List[Any] = ["things to write", 100]
        path_to_write_to: Path = Path("file_to_write_to.txt")
        fio.write_lines(path = path_to_write_to, lines_to_write = lines)
    """
    with to_writer(path=path, append=append, threads=threads) as writer:
        for line in lines_to_write:
            writer.write(str(line))
            writer.write("\n")

platform ¶

Modules¶

illumina ¶

Methods for working with Illumina-specific UMIs in SAM files.

The functions in this module make it easy to:

check whether a UMI is valid
extract UMI(s) from an Illumina-style read name
copy a UMI from an alignment's read name to its RX SAM tag

Attributes¶

SAM_UMI_DELIMITER module-attribute ¶

SAM_UMI_DELIMITER: str = '-'

Multiple UMI delimiter, which SAM specification recommends should be a hyphen; see specification here: https://samtools.github.io/hts-specs/SAMtags.pdf

Functions¶

copy_umi_from_read_name ¶

copy_umi_from_read_name(rec: AlignedSegment, strict: bool = False, remove_umi: bool = False, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER) -> bool

Copy a UMI from an alignment's read name to its RX SAM tag.

The UMI will not be copied to RX tag if it is invalid.

strict, read_name_delimiter, and umi_delimiter are forwarded to extract_umis_from_read_name — see that function for their semantics.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	The alignment record to update.	required
`strict`	`bool`	If `True` and UMI invalid, will throw an exception.	`False`
`remove_umi`	`bool`	If `True`, the UMI will be removed from the read name after copying.	`False`
`read_name_delimiter`	`str`	The delimiter separating the components of the read name. Also used to strip the UMI segment when `remove_umi` is `True`.	`_ILLUMINA_READ_NAME_DELIMITER`
`umi_delimiter`	`str`	The delimiter separating multiple UMIs.	`_ILLUMINA_UMI_DELIMITER`

Returns:

Type	Description
`bool`	`True` if the UMI was successfully extracted, False if otherwise.

Raises:

Type	Description
`ValueError`	If the read name does not end with a valid UMI.
`ValueError`	If the record already has a populated `RX` SAM tag.

Source code in fgpyo/platform/illumina.py

def copy_umi_from_read_name(
    rec: AlignedSegment,
    strict: bool = False,
    remove_umi: bool = False,
    read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER,
    umi_delimiter: str = _ILLUMINA_UMI_DELIMITER,
) -> bool:
    """
    Copy a UMI from an alignment's read name to its `RX` SAM tag.

    The UMI will not be copied to RX tag if it is invalid.

    `strict`, `read_name_delimiter`, and `umi_delimiter` are forwarded to
    [`extract_umis_from_read_name`][fgpyo.platform.illumina.extract_umis_from_read_name] — see
    that function for their semantics.

    Args:
        rec: The alignment record to update.
        strict: If `True` and UMI invalid, will throw an exception.
        remove_umi: If `True`, the UMI will be removed from the read name after copying.
        read_name_delimiter: The delimiter separating the components of the read name.
            Also used to strip the UMI segment when `remove_umi` is `True`.
        umi_delimiter: The delimiter separating multiple UMIs.

    Returns:
        `True` if the UMI was successfully extracted, False if otherwise.

    Raises:
        ValueError: If the read name does not end with a valid UMI.
        ValueError: If the record already has a populated `RX` SAM tag.
    """
    # NB: Keep the signature of this function in sync with `extract_umis_from_read_name`.
    assert rec.query_name is not None, "Alignment record must have a query name"

    umi = extract_umis_from_read_name(
        read_name=rec.query_name,
        strict=strict,
        read_name_delimiter=read_name_delimiter,
        umi_delimiter=umi_delimiter,
    )
    if umi is not None:
        if rec.has_tag("RX"):
            raise ValueError(f"Record {rec.query_name} already has a populated RX tag")
        rec.set_tag(tag="RX", value=umi)
        if remove_umi:
            last_index = rec.query_name.rfind(read_name_delimiter)
            rec.query_name = rec.query_name[:last_index] if last_index != -1 else rec.query_name
        return True
    elif strict:
        raise ValueError(f"Invalid UMI {umi} extracted from {rec.query_name}")
    else:
        return False

extract_umis_from_read_name ¶

extract_umis_from_read_name(read_name: str, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER, strict: bool = False) -> str | None

Extract UMI(s) from an Illumina-style read name.

The UMI is expected to be the final component of the read name, delimited by the read_name_delimiter. Multiple UMIs may be present, delimited by the umi_delimiter. This delimiter will be replaced by the SAM-standard -.

Parameters:

Name	Type	Description	Default
`read_name`	`str`	The read name to extract the UMI from.	required
`read_name_delimiter`	`str`	The delimiter separating the components of the read name.	`_ILLUMINA_READ_NAME_DELIMITER`
`umi_delimiter`	`str`	The delimiter separating multiple UMIs.	`_ILLUMINA_UMI_DELIMITER`
`strict`	`bool`	If `strict` is `True`, the read name must contain either 7 or 8 colon-separated segments. The UMI is assumed to be the last one in the case of 8 segments and `None` in the case of 7 segments. `strict` requires the UMI to be valid and consistent with Illumina's allowed UMI characters. If `strict` is `False`, the last segment is returned so long as it appears to be a valid UMI.	`False`

Returns:

Type	Description
`str \| None`	The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are
`str \| None`	returned in a single string, separated by a hyphen (`-`).

Raises:

Type	Description
`ValueError`	If the read name does not end with a valid UMI.

Source code in fgpyo/platform/illumina.py

def extract_umis_from_read_name(
    read_name: str,
    read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER,
    umi_delimiter: str = _ILLUMINA_UMI_DELIMITER,
    strict: bool = False,
) -> str | None:
    """
    Extract UMI(s) from an Illumina-style read name.

    The UMI is expected to be the final component of the read name, delimited by the
    `read_name_delimiter`. Multiple UMIs may be present, delimited by the `umi_delimiter`. This
    delimiter will be replaced by the SAM-standard `-`.

    Args:
        read_name: The read name to extract the UMI from.
        read_name_delimiter: The delimiter separating the components of the read name.
        umi_delimiter: The delimiter separating multiple UMIs.
        strict: If `strict` is `True`, the read name must contain either 7 or 8 colon-separated
            segments. The UMI is assumed to be the last one in the case of 8 segments and `None`
            in the case of 7 segments. `strict` requires the UMI to be valid and consistent with
            Illumina's allowed UMI characters. If `strict` is `False`, the last segment is returned
            so long as it appears to be a valid UMI.

    Returns:
        The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are
        returned in a single string, separated by a hyphen (`-`).

    Raises:
        ValueError: If the read name does not end with a valid UMI.
    """
    if strict:
        colons = read_name.count(":")
        if colons == 6:  # number of fields is 7
            return None
        elif colons != 7:
            raise ValueError(
                f"Trying to extract UMIs from read with {colons + 1} parts "
                f"(7 or 8 expected): {read_name}"
            )
    raw_umi = read_name.split(read_name_delimiter)[-1]
    # Check each UMI individually
    umis = raw_umi.split(umi_delimiter)
    # Strip the "r" from rev-comped UMIs
    # (NB: for consistency with UMI_tools, the UMI is not revcomped)
    umis = [umi.lstrip("r") for umi in umis]

    invalid_umis = [umi for umi in umis if not _is_valid_umi(umi)]
    if len(invalid_umis) == 0:
        return SAM_UMI_DELIMITER.join(umis)
    elif strict:
        raise ValueError(
            f"Invalid UMIs found in read name: {read_name}",
            f"  (Invalid UMIs: {', '.join(invalid_umis)})",
        )
    else:
        return None

read_structure ¶

Classes for representing Read Structures.¶

A Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

Examples¶

>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
['75T', '8B', '75T']
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
...   ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
...   print(str(ex))
Read structure missing length information: 23T2T[T]23T

Attributes¶

ANY_LENGTH_CHAR `module-attribute` ¶

ANY_LENGTH_CHAR: str = '+'

A character that can be put in place of a number in a read structure to mean "0 or more bases".

Classes¶

ReadSegment ¶

Encapsulates all the information about a segment within a read structure.

A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.

Attributes:

Name	Type	Description
`offset`	`int`	The offset of the read segment in the read.
`length`	`int \| None`	The length of the segment, or None if it is variable length.
`kind`	`SegmentType`	The kind of read segment.

Source code in fgpyo/read_structure.py

@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class ReadSegment:
    """
    Encapsulates all the information about a segment within a read structure.

    A segment can either have a definite length, in which case length must be Some(Int), or an
    indefinite length (can be any length, 0 or more) in which case length must be None.

    Attributes:
        offset: The offset of the read segment in the read.
        length: The length of the segment, or None if it is variable length.
        kind: The kind of read segment.

    """

    offset: int
    length: int | None
    kind: SegmentType

    @property
    def has_fixed_length(self) -> bool:
        """True if the read segment has a defined length."""
        return self.length is not None

    @property
    def fixed_length(self) -> int:
        """
        The fixed length of this segment.

        Raises:
            AttributeError: If the segment does not have a fixed length.
        """
        if not self.has_fixed_length:
            raise AttributeError(f"fixed_length called on a variable length segment: {self}")

        assert self.length is not None  # type narrowing

        return self.length

    def extract(self, bases: str) -> SubReadWithoutQuals:
        """Gets the bases associated with this read segment."""
        end = self._calculate_end(bases)
        return SubReadWithoutQuals(bases=bases[self.offset : end], segment=self._resized(end))

    def extract_with_quals(self, bases: str, quals: str) -> SubReadWithQuals:
        """Gets the bases and qualities associated with this read segment."""
        assert len(bases) == len(quals), f"Bases and quals differ in length: {bases} {quals}"
        end = self._calculate_end(bases)
        return SubReadWithQuals(
            bases=bases[self.offset : end],
            quals=quals[self.offset : end],
            segment=self._resized(end),
        )

    def _calculate_end(self, bases: str) -> int:
        """
        Calculates the end position for the segment for the given read.

        Checks that the read is long enough to contain this segment.
        """
        bases_len = len(bases)
        assert bases_len >= self.offset, f"Read ends before the segment starts: {self}"
        assert self.length is None or bases_len >= self.offset + self.length, (
            f"Read ends before end of segment: {self}"
        )
        if self.has_fixed_length:
            return min(self.offset + self.fixed_length, bases_len)
        else:
            return bases_len

    def _resized(self, end: int) -> "ReadSegment":
        new_length = end - self.offset
        if self.has_fixed_length and self.fixed_length == new_length:
            return self
        else:
            return attr.evolve(self, length=new_length)

    def __str__(self) -> str:
        """Returns the string representation of this segment (e.g. '10T' or '+T')."""
        if self.has_fixed_length:
            return f"{self.length}{self.kind.value}"
        else:
            return f"{ANY_LENGTH_CHAR}{self.kind.value}"

Attributes¶

fixed_length property ¶

fixed_length: int

The fixed length of this segment.

Raises:

Type	Description
`AttributeError`	If the segment does not have a fixed length.

has_fixed_length property ¶

has_fixed_length: bool

True if the read segment has a defined length.

Functions¶

__str__ ¶

__str__() -> str

Returns the string representation of this segment (e.g. '10T' or '+T').

Source code in fgpyo/read_structure.py

def __str__(self) -> str:
    """Returns the string representation of this segment (e.g. '10T' or '+T')."""
    if self.has_fixed_length:
        return f"{self.length}{self.kind.value}"
    else:
        return f"{ANY_LENGTH_CHAR}{self.kind.value}"

extract ¶

extract(bases: str) -> SubReadWithoutQuals

Gets the bases associated with this read segment.

Source code in fgpyo/read_structure.py

def extract(self, bases: str) -> SubReadWithoutQuals:
    """Gets the bases associated with this read segment."""
    end = self._calculate_end(bases)
    return SubReadWithoutQuals(bases=bases[self.offset : end], segment=self._resized(end))

extract_with_quals ¶

extract_with_quals(bases: str, quals: str) -> SubReadWithQuals

Gets the bases and qualities associated with this read segment.

Source code in fgpyo/read_structure.py

def extract_with_quals(self, bases: str, quals: str) -> SubReadWithQuals:
    """Gets the bases and qualities associated with this read segment."""
    assert len(bases) == len(quals), f"Bases and quals differ in length: {bases} {quals}"
    end = self._calculate_end(bases)
    return SubReadWithQuals(
        bases=bases[self.offset : end],
        quals=quals[self.offset : end],
        segment=self._resized(end),
    )

ReadStructure ¶

Bases: Iterable[ReadSegment]

Describes the structure of a given read.

A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.

Attributes:

Name	Type	Description
`segments`	`tuple[ReadSegment, ...]`	The segments composing the read structure

Source code in fgpyo/read_structure.py

@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class ReadStructure(Iterable[ReadSegment]):
    """
    Describes the structure of a given read.

    A read contains one or more read segments. A read segment describes a contiguous stretch of
    bases of the same type (ex. template bases) of some length and some offset from the start
    of the read.

    Attributes:
         segments: The segments composing the read structure

    """

    segments: tuple[ReadSegment, ...]

    @property
    def _min_length(self) -> int:
        """The minimum length read that this read structure can process."""
        return sum(segment.length for segment in self.segments if segment.has_fixed_length)  # type: ignore[misc]

    @property
    def has_fixed_length(self) -> bool:
        """True if the ReadStructure has a fixed (i.e. non-variable) length."""
        return self.segments[-1].has_fixed_length

    @property
    def fixed_length(self) -> int:
        """
        The fixed length of this read structure.

        Raises:
            AttributeError: If the read structure does not have a fixed length.
        """
        if not self.has_fixed_length:
            raise AttributeError(f"fixed_length called on a variable length read structure: {self}")
        return self._min_length

    @property
    def length(self) -> int:
        """Length is defined as the number of segments (not bases!) in the read structure."""
        return len(self.segments)

    def with_variable_last_segment(self) -> "ReadStructure":
        """Returns a copy with the last segment changed to undefined length."""
        last_segment = self.segments[-1]
        if not last_segment.has_fixed_length:
            return self
        else:
            last_segment = attr.evolve(last_segment, length=None)
            return ReadStructure(segments=self.segments[:-1] + (last_segment,))

    def extract(self, bases: str) -> tuple[SubReadWithoutQuals, ...]:
        """Splits the given bases into tuples with its associated read segment."""
        return tuple([segment.extract(bases=bases) for segment in self])

    def extract_with_quals(self, bases: str, quals: str) -> tuple[SubReadWithQuals, ...]:
        """Splits the given bases and qualities into triples with its associated read segment."""
        return tuple([segment.extract_with_quals(bases=bases, quals=quals) for segment in self])

    def segments_by_kind(self, kind: SegmentType) -> tuple[ReadSegment, ...]:
        """Returns just the segments of a given kind."""
        return tuple([segment for segment in self if segment.kind == kind])

    def template_segments(self) -> tuple[ReadSegment, ...]:
        """Returns segments of kind Template."""
        return self.segments_by_kind(kind=SegmentType.Template)

    def sample_barcode_segments(self) -> tuple[ReadSegment, ...]:
        """Returns segments of kind SampleBarcode."""
        return self.segments_by_kind(kind=SegmentType.SampleBarcode)

    def molecular_barcode_segments(self) -> tuple[ReadSegment, ...]:
        """Returns segments of kind MolecularBarcode."""
        return self.segments_by_kind(kind=SegmentType.MolecularBarcode)

    def cell_barcode_segments(self) -> tuple[ReadSegment, ...]:
        """Returns segments of kind CellBarcode."""
        return self.segments_by_kind(kind=SegmentType.CellBarcode)

    def skip_segments(self) -> tuple[ReadSegment, ...]:
        """Returns segments of kind Skip."""
        return self.segments_by_kind(kind=SegmentType.Skip)

    def __iter__(self) -> Iterator[ReadSegment]:
        """Iterates over the read segments."""
        return iter(self.segments)

    def __str__(self) -> str:
        """Returns the string representation of the full read structure."""
        return "".join(str(s) for s in self.segments)

    def __len__(self) -> int:
        """Returns the total length of the read structure."""
        return self.length

    def __getitem__(self, index: int) -> ReadSegment:
        """Returns the segment at the given index."""
        return self.segments[index]

    @classmethod
    def from_segments(
        cls, segments: tuple[ReadSegment, ...], reset_offsets: bool = False
    ) -> "ReadStructure":
        """Creates a new ReadStructure, optionally resetting the offsets on each of the segments."""
        # Check that none but the last segment has an indefinite length
        assert all(s.has_fixed_length for s in segments[:-1]), (
            f"Variable length ({ANY_LENGTH_CHAR}) can only be used in the last segment: "
            + "".join(str(s) for s in segments)
        )

        if reset_offsets:
            off = 0
            segs = []
            for seg in segments:
                seg = attr.evolve(seg, offset=off)
                off += seg.length if seg.has_fixed_length else 0  # type: ignore[operator]

                segs.append(seg)
            segments = tuple(segs)

        assert all(s.length is None or s.length > 0 for s in segments), (
            "Read structure contained zero length segments" + "".join(str(s) for s in segments)
        )

        return ReadStructure(segments=segments)

    @classmethod
    def from_string(cls, segments: str) -> "ReadStructure":
        """Parses a read structure from its string representation."""
        # Check that none but the last segment has an indefinite length
        tidied = "".join(ch for ch in segments.upper() if not ch.isspace())
        return cls.from_segments(segments=cls._from_string(string=tidied), reset_offsets=True)

    @classmethod
    def _from_string(cls, string: str) -> tuple[ReadSegment, ...]:
        index = 0
        segments: list[ReadSegment] = []
        while index < len(string):
            # tash the beginning position of our parsing so we can highlight what we're having
            # trouble with
            parse_index = index

            seg_length: int | None = None
            # Parse out the length segment which many be 1 or more digits or the AnyLengthChar
            if string[index] == ANY_LENGTH_CHAR:
                index += 1
                seg_length = None
            elif string[index].isdigit():
                seg_length = 0
                while index < len(string) and string[index].isdigit():
                    seg_length = (seg_length * 10) + int(string[index])
                    index += 1
            else:
                cls._invalid(
                    msg="Read structure missing length information",
                    rs=string,
                    start=parse_index,
                    end=parse_index + 1,
                )

            # Parse out the operator and make a segment
            if index == len(string):
                cls._invalid(
                    msg="Read structure with invalid segment",
                    rs=string,
                    start=parse_index,
                    end=index,
                )
            code = string[index]
            index += 1
            kind: SegmentType
            try:
                kind = SegmentType(code)
            except ValueError:
                cls._invalid(
                    msg="Read structure segment had unknown type",
                    rs=string,
                    start=parse_index,
                    end=parse_index + 1,
                )
            segments.append(ReadSegment(offset=0, length=seg_length, kind=kind))

        return tuple(segments)

    @classmethod
    def _invalid(cls, msg: str, rs: str, start: int, end: int) -> None:
        """Inserts square brackets around the error-causing characters in the read structure."""
        prefix = rs[:start]
        error = rs[start:end]
        suffix = "" if end == len(rs) else rs[end:]
        raise ValueError(f"{msg}: {prefix}[{error}]{suffix}")

Attributes¶

fixed_length property ¶

fixed_length: int

The fixed length of this read structure.

Raises:

Type	Description
`AttributeError`	If the read structure does not have a fixed length.

has_fixed_length property ¶

has_fixed_length: bool

True if the ReadStructure has a fixed (i.e. non-variable) length.

length property ¶

length: int

Length is defined as the number of segments (not bases!) in the read structure.

Functions¶

__getitem__ ¶

__getitem__(index: int) -> ReadSegment

Returns the segment at the given index.

Source code in fgpyo/read_structure.py

def __getitem__(self, index: int) -> ReadSegment:
    """Returns the segment at the given index."""
    return self.segments[index]

__iter__ ¶

__iter__() -> Iterator[ReadSegment]

Iterates over the read segments.

Source code in fgpyo/read_structure.py

def __iter__(self) -> Iterator[ReadSegment]:
    """Iterates over the read segments."""
    return iter(self.segments)

__len__ ¶

__len__() -> int

Returns the total length of the read structure.

Source code in fgpyo/read_structure.py

def __len__(self) -> int:
    """Returns the total length of the read structure."""
    return self.length

__str__ ¶

__str__() -> str

Returns the string representation of the full read structure.

Source code in fgpyo/read_structure.py

def __str__(self) -> str:
    """Returns the string representation of the full read structure."""
    return "".join(str(s) for s in self.segments)

cell_barcode_segments ¶

cell_barcode_segments() -> tuple[ReadSegment, ...]

Returns segments of kind CellBarcode.

Source code in fgpyo/read_structure.py

def cell_barcode_segments(self) -> tuple[ReadSegment, ...]:
    """Returns segments of kind CellBarcode."""
    return self.segments_by_kind(kind=SegmentType.CellBarcode)

extract ¶

extract(bases: str) -> tuple[SubReadWithoutQuals, ...]

Splits the given bases into tuples with its associated read segment.

Source code in fgpyo/read_structure.py

def extract(self, bases: str) -> tuple[SubReadWithoutQuals, ...]:
    """Splits the given bases into tuples with its associated read segment."""
    return tuple([segment.extract(bases=bases) for segment in self])

extract_with_quals ¶

extract_with_quals(bases: str, quals: str) -> tuple[SubReadWithQuals, ...]

Splits the given bases and qualities into triples with its associated read segment.

Source code in fgpyo/read_structure.py

def extract_with_quals(self, bases: str, quals: str) -> tuple[SubReadWithQuals, ...]:
    """Splits the given bases and qualities into triples with its associated read segment."""
    return tuple([segment.extract_with_quals(bases=bases, quals=quals) for segment in self])

from_segments classmethod ¶

from_segments(segments: tuple[ReadSegment, ...], reset_offsets: bool = False) -> ReadStructure

Creates a new ReadStructure, optionally resetting the offsets on each of the segments.

Source code in fgpyo/read_structure.py

@classmethod
def from_segments(
    cls, segments: tuple[ReadSegment, ...], reset_offsets: bool = False
) -> "ReadStructure":
    """Creates a new ReadStructure, optionally resetting the offsets on each of the segments."""
    # Check that none but the last segment has an indefinite length
    assert all(s.has_fixed_length for s in segments[:-1]), (
        f"Variable length ({ANY_LENGTH_CHAR}) can only be used in the last segment: "
        + "".join(str(s) for s in segments)
    )

    if reset_offsets:
        off = 0
        segs = []
        for seg in segments:
            seg = attr.evolve(seg, offset=off)
            off += seg.length if seg.has_fixed_length else 0  # type: ignore[operator]

            segs.append(seg)
        segments = tuple(segs)

    assert all(s.length is None or s.length > 0 for s in segments), (
        "Read structure contained zero length segments" + "".join(str(s) for s in segments)
    )

    return ReadStructure(segments=segments)

from_string classmethod ¶

from_string(segments: str) -> ReadStructure

Parses a read structure from its string representation.

Source code in fgpyo/read_structure.py

@classmethod
def from_string(cls, segments: str) -> "ReadStructure":
    """Parses a read structure from its string representation."""
    # Check that none but the last segment has an indefinite length
    tidied = "".join(ch for ch in segments.upper() if not ch.isspace())
    return cls.from_segments(segments=cls._from_string(string=tidied), reset_offsets=True)

molecular_barcode_segments ¶

molecular_barcode_segments() -> tuple[ReadSegment, ...]

Returns segments of kind MolecularBarcode.

Source code in fgpyo/read_structure.py

def molecular_barcode_segments(self) -> tuple[ReadSegment, ...]:
    """Returns segments of kind MolecularBarcode."""
    return self.segments_by_kind(kind=SegmentType.MolecularBarcode)

sample_barcode_segments ¶

sample_barcode_segments() -> tuple[ReadSegment, ...]

Returns segments of kind SampleBarcode.

Source code in fgpyo/read_structure.py

def sample_barcode_segments(self) -> tuple[ReadSegment, ...]:
    """Returns segments of kind SampleBarcode."""
    return self.segments_by_kind(kind=SegmentType.SampleBarcode)

segments_by_kind ¶

segments_by_kind(kind: SegmentType) -> tuple[ReadSegment, ...]

Returns just the segments of a given kind.

Source code in fgpyo/read_structure.py

def segments_by_kind(self, kind: SegmentType) -> tuple[ReadSegment, ...]:
    """Returns just the segments of a given kind."""
    return tuple([segment for segment in self if segment.kind == kind])

skip_segments ¶

skip_segments() -> tuple[ReadSegment, ...]

Returns segments of kind Skip.

Source code in fgpyo/read_structure.py

def skip_segments(self) -> tuple[ReadSegment, ...]:
    """Returns segments of kind Skip."""
    return self.segments_by_kind(kind=SegmentType.Skip)

template_segments ¶

template_segments() -> tuple[ReadSegment, ...]

Returns segments of kind Template.

Source code in fgpyo/read_structure.py

def template_segments(self) -> tuple[ReadSegment, ...]:
    """Returns segments of kind Template."""
    return self.segments_by_kind(kind=SegmentType.Template)

with_variable_last_segment ¶

with_variable_last_segment() -> ReadStructure

Returns a copy with the last segment changed to undefined length.

Source code in fgpyo/read_structure.py

def with_variable_last_segment(self) -> "ReadStructure":
    """Returns a copy with the last segment changed to undefined length."""
    last_segment = self.segments[-1]
    if not last_segment.has_fixed_length:
        return self
    else:
        last_segment = attr.evolve(last_segment, length=None)
        return ReadStructure(segments=self.segments[:-1] + (last_segment,))

SegmentType ¶

Bases: Enum

The type of segments that can show up in a read structure.

Source code in fgpyo/read_structure.py

@enum.unique
class SegmentType(enum.Enum):
    """The type of segments that can show up in a read structure."""

    Template = "T"
    """The segment type for template bases."""

    SampleBarcode = "B"
    """The segment type for sample barcode bases."""

    MolecularBarcode = "M"
    """The segment type for molecular barcode bases."""

    CellBarcode = "C"
    """The segment type for cell barcode bases."""

    Skip = "S"
    """The segment type for bases that need to be skipped."""

    def __str__(self) -> str:
        """Returns the single-character value of this segment type."""
        return self.value

Attributes¶

CellBarcode class-attribute instance-attribute ¶

CellBarcode = 'C'

The segment type for cell barcode bases.

MolecularBarcode class-attribute instance-attribute ¶

MolecularBarcode = 'M'

The segment type for molecular barcode bases.

SampleBarcode class-attribute instance-attribute ¶

SampleBarcode = 'B'

The segment type for sample barcode bases.

Skip class-attribute instance-attribute ¶

Skip = 'S'

The segment type for bases that need to be skipped.

Template class-attribute instance-attribute ¶

Template = 'T'

The segment type for template bases.

Functions¶

__str__ ¶

__str__() -> str

Returns the single-character value of this segment type.

Source code in fgpyo/read_structure.py

def __str__(self) -> str:
    """Returns the single-character value of this segment type."""
    return self.value

SubReadWithQuals ¶

Contains the bases and qualities that correspond to the given read segment.

Source code in fgpyo/read_structure.py

@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class SubReadWithQuals:
    """Contains the bases and qualities that correspond to the given read segment."""

    bases: str
    """The sub-read bases that correspond to the given read segment."""

    quals: str
    """The sub-read base qualities that correspond to the given read segment."""

    segment: "ReadSegment"
    """The segment of the read structure that describes this sub-read."""

    @property
    def kind(self) -> SegmentType:
        """The kind of read segment that corresponds to this sub-read."""
        return self.segment.kind

Attributes¶

bases instance-attribute ¶

bases: str

The sub-read bases that correspond to the given read segment.

kind property ¶

kind: SegmentType

The kind of read segment that corresponds to this sub-read.

quals instance-attribute ¶

quals: str

The sub-read base qualities that correspond to the given read segment.

segment instance-attribute ¶

segment: ReadSegment

The segment of the read structure that describes this sub-read.

SubReadWithoutQuals ¶

Contains the bases that correspond to the given read segment.

Source code in fgpyo/read_structure.py

@attr.s(frozen=True, kw_only=True, auto_attribs=True)
class SubReadWithoutQuals:
    """Contains the bases that correspond to the given read segment."""

    bases: str
    """The sub-read bases that correspond to the given read segment."""

    segment: "ReadSegment"
    """The segment of the read structure that describes this sub-read."""

    @property
    def kind(self) -> SegmentType:
        """The kind of read segment that corresponds to this sub-read."""
        return self.segment.kind

Attributes¶

bases instance-attribute ¶

bases: str

The sub-read bases that correspond to the given read segment.

kind property ¶

kind: SegmentType

The kind of read segment that corresponds to this sub-read.

segment instance-attribute ¶

segment: ReadSegment

The segment of the read structure that describes this sub-read.

sam ¶

Utility Classes and Methods for SAM/BAM.¶

This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.

Motivation for Reader and Writer methods¶

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
Makes the requirement to provide a header when opening a file for writing more explicit.
Adds support for pathlib.Path.
Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.

Examples of Opening a SAM/BAM for Reading or Writing¶

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType() for the supported file types.

>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader("/path/to/sample.bam") as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:  
...     for record in fh:
...         print(record.query_name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:  
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.query_name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }  
>>> with writer(path="/path/to/sample.bam", header=header) as fh:  
...     pass  # do something

Examples of Manipulating Cigars¶

Creating a Cigar from a pysam.AlignedSegment.

>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:  
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str().

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
Traceback (most recent call last):
    ...
fgpyo.sam.CigarParsingException: Malformed cigar: 10M5[U]

The cigar contains a tuple of CigarElement()s. Each element contains the cigar operator (CigarOp()) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Any particular element can be accessed directly via .elements with its index (and works with negative indexes and slices):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> cigar.elements[0].length
50
>>> cigar.elements[1].operator
<CigarOp.D: (2, 'D', False, True)>
>>> cigar.elements[-1].operator
<CigarOp.S: (4, 'S', True, False)>
>>> tuple(x.operator.character for x in cigar.elements[1:3])
('D', 'M')
>>> tuple(x.operator.character for x in cigar.elements[-2:])
('I', 'S')

Examples of parsing the SA tag and individual supplementary alignments¶

>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1'
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']

Attributes¶

DefaultProperlyPairedOrientations `module-attribute` ¶

DefaultProperlyPairedOrientations: set[PairOrientation] = {FR}

The default orientations for properly paired reads.

NO_QUERY_BASES `module-attribute` ¶

NO_QUERY_BASES: str = '*'

The string to use for a SAM record with missing query bases.

NO_QUERY_QUALITIES `module-attribute` ¶

NO_QUERY_QUALITIES: array = cast(array, qualitystring_to_array(STRING_PLACEHOLDER))

The quality array corresponding to an unavailable query quality string ("*").

NO_REF_INDEX `module-attribute` ¶

NO_REF_INDEX: int = -1

The reference index to use to indicate no reference in SAM/BAM.

NO_REF_NAME `module-attribute` ¶

NO_REF_NAME: str = STRING_PLACEHOLDER

The reference name to use to indicate no reference in SAM/BAM.

NO_REF_POS `module-attribute` ¶

NO_REF_POS: int = -1

The reference position to use to indicate no position in SAM/BAM.

STRING_PLACEHOLDER `module-attribute` ¶

STRING_PLACEHOLDER: str = '*'

The value to use when a string field's information is unavailable.

SamPath `module-attribute` ¶

SamPath = IO[Any] | Path | str

The valid base classes for opening a SAM/BAM/CRAM file.

Classes¶

Cigar ¶

Class representing a cigar string.

Attributes:

Name	Type	Description
`-`	`elements (Tuple[CigarElement, ...]`	zero or more cigar elements

Source code in fgpyo/sam/__init__.py

@attr.s(frozen=True, slots=True, auto_attribs=True)
class Cigar:
    """
    Class representing a cigar string.

    Attributes:
        - elements (Tuple[CigarElement, ...]): zero or more cigar elements
    """

    elements: tuple[CigarElement, ...] = ()

    @classmethod
    def from_cigartuples(cls, cigartuples: list[tuple[int, int]] | None) -> "Cigar":
        """
        Returns a Cigar from a list of tuples returned by pysam.

        Each tuple denotes the operation and length.  See
        [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
        various operators.  If None is given, returns an empty Cigar.
        """
        if cigartuples is None or cigartuples == []:
            return Cigar()
        try:
            elements = []
            for code, length in cigartuples:
                operator = CigarOp.from_code(code)
                elements.append(CigarElement(length, operator))
            return Cigar(tuple(elements))
        except Exception as ex:
            raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex

    @classmethod
    def _pretty_cigarstring_exception(cls, cigarstring: str, index: int) -> CigarParsingException:
        """Raises an exception highlighting the malformed character."""
        prefix = cigarstring[:index]
        character = cigarstring[index] if index < len(cigarstring) else ""
        suffix = cigarstring[index + 1 :]
        pretty_cigarstring = f"{prefix}[{character}]{suffix}"
        message = f"Malformed cigar: {pretty_cigarstring}"
        return CigarParsingException(message)

    @classmethod
    def from_cigarstring(cls, cigarstring: str) -> "Cigar":
        """
        Constructs a Cigar from a string returned by pysam.

        If "*" is given, returns an empty Cigar.
        """
        if cigarstring == "*":
            return Cigar()

        cigarstring_length = len(cigarstring)
        if cigarstring_length == 0:
            raise CigarParsingException("Cigar string was empty")

        elements = []
        i = 0
        while i < cigarstring_length:
            if not cigarstring[i].isdigit():
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            length = int(cigarstring[i])
            i += 1
            while i < cigarstring_length and cigarstring[i].isdigit():
                length = (length * 10) + int(cigarstring[i])
                i += 1
            if i == cigarstring_length:
                raise cls._pretty_cigarstring_exception(cigarstring, i)
            try:
                operator = CigarOp.from_character(cigarstring[i])
                elements.append(CigarElement(length, operator))
            except KeyError as ex:
                # cigar operator was not valid
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            except IndexError as ex:
                # missing cigar operator (i == len(cigarstring))
                raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
            i += 1
        return Cigar(tuple(elements))

    def __str__(self) -> str:
        """Returns the CIGAR string, or '*' if empty."""
        if self.elements:
            return "".join([str(e) for e in self.elements])
        else:
            return "*"

    def reversed(self) -> "Cigar":
        """Returns a copy of the Cigar with the elements in reverse order."""
        return Cigar(tuple(reversed(self.elements)))

    def length_on_query(self) -> int:
        """Returns the length of the alignment on the query sequence."""
        return sum([elem.length_on_query for elem in self.elements])

    def length_on_target(self) -> int:
        """Returns the length of the alignment on the target sequence."""
        return sum([elem.length_on_target for elem in self.elements])

    def coalesce(self) -> "Cigar":
        """
        Returns a new Cigar with adjacent elements of the same operator merged.

        For example, ``Cigar.from_cigarstring("10M10M")`` would be coalesced to
        ``Cigar.from_cigarstring("20M")``.

        Returns:
            A new Cigar with adjacent same-operator elements merged, or this Cigar if
            no coalescing is needed.

        Examples:
            >>> str(Cigar.from_cigarstring("10M10M").coalesce())
            '20M'
            >>> str(Cigar.from_cigarstring("10M5I5I10M").coalesce())
            '10M10I10M'
        """
        if len(self.elements) <= 1:
            return self
        result: list[CigarElement] = []
        for elem in self.elements:
            if result and result[-1].operator == elem.operator:
                result[-1] = CigarElement(
                    length=result[-1].length + elem.length,
                    operator=elem.operator,
                )
            else:
                result.append(elem)
        coalesced = tuple(result)
        if coalesced == self.elements:
            return self
        return Cigar(coalesced)

    def query_alignment_offsets(self) -> tuple[int, int]:
        """
        Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

        The resulting range will contain the range of positions in the SEQ string for
        the bases that are aligned.
        If counting from the end of the query is desired, use
        `cigar.reversed().query_alignment_offsets()`

        Returns:
            A tuple (start, stop) containing the start and stop positions
                of the aligned part of the query. These offsets are 0-based and open-ended, with
                respect to the beginning of the query.

        Raises:
            ValueError: If according to the cigar, there are no aligned query bases.
        """
        start_offset: int = 0
        end_offset: int = 0
        element: CigarElement
        alignment_began = False
        for element in self.elements:
            if element.operator.is_clipping and not alignment_began:
                # We are in the clipping operators preceding the alignment
                # Note: hardclips have length-on-query=0
                start_offset += element.length_on_query
                end_offset += element.length_on_query
            elif not element.operator.is_clipping:
                # We are within the alignment
                alignment_began = True
                end_offset += element.length_on_query
            else:
                # We have exited the alignment and are in the clipping operators after the alignment
                break

        if start_offset == end_offset:
            raise ValueError(f"Cigar {self} has no aligned bases")
        return start_offset, end_offset

    def _truncate(self, length: int, should_count: Callable[[CigarElement], bool]) -> "Cigar":
        """
        Truncates the CIGAR to a specified length based on a predicate.

        This private helper method iterates through CIGAR elements and builds a new CIGAR
        that contains at most `length` bases from elements matching the predicate. Position
        tracking starts at 0 (0-based, Pythonic convention). Elements not matching should_count
        are included without counting. If an element would exceed the limit, it's clipped to
        fit exactly.

        Args:
            length: The maximum number of bases to keep (for elements matching should_count)
            should_count: A function that takes a CigarElement and returns True if its
                         bases should be counted toward the length limit

        Returns:
            A new Cigar truncated to the specified length
        """
        if length < 0:
            raise ValueError(f"length must be >= 0, got {length}")
        remaining = length
        builder: list[CigarElement] = []
        for elem in self.elements:
            if remaining <= 0:
                break
            if should_count(elem):
                take = min(elem.length, remaining)
                builder.append(CigarElement(length=take, operator=elem.operator))
                remaining -= take
            else:
                builder.append(elem)

        return Cigar(tuple(builder))

    def truncate_to_query_length(self, length: int) -> "Cigar":
        """
        Truncates the CIGAR to the specified query sequence length.

        Produces a new CIGAR that includes at most the specified number of bases
        from the query sequence. Only CIGAR operators that consume query bases
        (M, I, S, =, X) are counted toward the length limit.

        Args:
            length: The maximum number of query bases to include

        Returns:
            A new Cigar truncated to the specified query length

        Examples:
            >>> cigar = Cigar.from_cigarstring("10M5I10M")
            >>> str(cigar.truncate_to_query_length(15))
            '10M5I'
            >>> str(cigar.truncate_to_query_length(12))
            '10M2I'
        """
        return self._truncate(length, lambda e: e.operator.consumes_query)

    def truncate_to_target_length(self, length: int) -> "Cigar":
        """
        Truncates the CIGAR to the specified reference/target sequence length.

        Produces a new CIGAR that includes at most the specified number of bases
        from the reference/target sequence. Only CIGAR operators that consume
        reference bases (M, D, N, =, X) are counted toward the length limit.

        Args:
            length: The maximum number of reference/target bases to include

        Returns:
            A new Cigar truncated to the specified target length

        Examples:
            >>> cigar = Cigar.from_cigarstring("10M5D10M")
            >>> str(cigar.truncate_to_target_length(15))
            '10M5D'
            >>> str(cigar.truncate_to_target_length(12))
            '10M2D'
        """
        return self._truncate(length, lambda e: e.operator.consumes_reference)

Functions¶

__str__ ¶

__str__() -> str

Returns the CIGAR string, or '*' if empty.

Source code in fgpyo/sam/__init__.py

def __str__(self) -> str:
    """Returns the CIGAR string, or '*' if empty."""
    if self.elements:
        return "".join([str(e) for e in self.elements])
    else:
        return "*"

coalesce ¶

coalesce() -> Cigar

Returns a new Cigar with adjacent elements of the same operator merged.

For example, Cigar.from_cigarstring("10M10M") would be coalesced to Cigar.from_cigarstring("20M").

Returns:

Type	Description
`Cigar`	A new Cigar with adjacent same-operator elements merged, or this Cigar if
`Cigar`	no coalescing is needed.

Examples:

>>> str(Cigar.from_cigarstring("10M10M").coalesce())
'20M'
>>> str(Cigar.from_cigarstring("10M5I5I10M").coalesce())
'10M10I10M'

Source code in fgpyo/sam/__init__.py

def coalesce(self) -> "Cigar":
    """
    Returns a new Cigar with adjacent elements of the same operator merged.

    For example, ``Cigar.from_cigarstring("10M10M")`` would be coalesced to
    ``Cigar.from_cigarstring("20M")``.

    Returns:
        A new Cigar with adjacent same-operator elements merged, or this Cigar if
        no coalescing is needed.

    Examples:
        >>> str(Cigar.from_cigarstring("10M10M").coalesce())
        '20M'
        >>> str(Cigar.from_cigarstring("10M5I5I10M").coalesce())
        '10M10I10M'
    """
    if len(self.elements) <= 1:
        return self
    result: list[CigarElement] = []
    for elem in self.elements:
        if result and result[-1].operator == elem.operator:
            result[-1] = CigarElement(
                length=result[-1].length + elem.length,
                operator=elem.operator,
            )
        else:
            result.append(elem)
    coalesced = tuple(result)
    if coalesced == self.elements:
        return self
    return Cigar(coalesced)

from_cigarstring classmethod ¶

from_cigarstring(cigarstring: str) -> Cigar

Constructs a Cigar from a string returned by pysam.

If "*" is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py

@classmethod
def from_cigarstring(cls, cigarstring: str) -> "Cigar":
    """
    Constructs a Cigar from a string returned by pysam.

    If "*" is given, returns an empty Cigar.
    """
    if cigarstring == "*":
        return Cigar()

    cigarstring_length = len(cigarstring)
    if cigarstring_length == 0:
        raise CigarParsingException("Cigar string was empty")

    elements = []
    i = 0
    while i < cigarstring_length:
        if not cigarstring[i].isdigit():
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        length = int(cigarstring[i])
        i += 1
        while i < cigarstring_length and cigarstring[i].isdigit():
            length = (length * 10) + int(cigarstring[i])
            i += 1
        if i == cigarstring_length:
            raise cls._pretty_cigarstring_exception(cigarstring, i)
        try:
            operator = CigarOp.from_character(cigarstring[i])
            elements.append(CigarElement(length, operator))
        except KeyError as ex:
            # cigar operator was not valid
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        except IndexError as ex:
            # missing cigar operator (i == len(cigarstring))
            raise cls._pretty_cigarstring_exception(cigarstring, i) from ex
        i += 1
    return Cigar(tuple(elements))

from_cigartuples classmethod ¶

from_cigartuples(cigartuples: list[tuple[int, int]] | None) -> Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp() for more information on the various operators. If None is given, returns an empty Cigar.

Source code in fgpyo/sam/__init__.py

@classmethod
def from_cigartuples(cls, cigartuples: list[tuple[int, int]] | None) -> "Cigar":
    """
    Returns a Cigar from a list of tuples returned by pysam.

    Each tuple denotes the operation and length.  See
    [`CigarOp()`][fgpyo.sam.CigarOp] for more information on the
    various operators.  If None is given, returns an empty Cigar.
    """
    if cigartuples is None or cigartuples == []:
        return Cigar()
    try:
        elements = []
        for code, length in cigartuples:
            operator = CigarOp.from_code(code)
            elements.append(CigarElement(length, operator))
        return Cigar(tuple(elements))
    except Exception as ex:
        raise CigarParsingException(f"Malformed cigar tuples: {cigartuples}") from ex

length_on_query ¶

length_on_query() -> int

Returns the length of the alignment on the query sequence.

Source code in fgpyo/sam/__init__.py

def length_on_query(self) -> int:
    """Returns the length of the alignment on the query sequence."""
    return sum([elem.length_on_query for elem in self.elements])

length_on_target ¶

length_on_target() -> int

Returns the length of the alignment on the target sequence.

Source code in fgpyo/sam/__init__.py

def length_on_target(self) -> int:
    """Returns the length of the alignment on the target sequence."""
    return sum([elem.length_on_target for elem in self.elements])

query_alignment_offsets ¶

query_alignment_offsets() -> tuple[int, int]

Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

The resulting range will contain the range of positions in the SEQ string for the bases that are aligned. If counting from the end of the query is desired, use cigar.reversed().query_alignment_offsets()

Returns:

Type	Description
`tuple[int, int]`	A tuple (start, stop) containing the start and stop positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query.

Raises:

Type	Description
`ValueError`	If according to the cigar, there are no aligned query bases.

Source code in fgpyo/sam/__init__.py

def query_alignment_offsets(self) -> tuple[int, int]:
    """
    Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.

    The resulting range will contain the range of positions in the SEQ string for
    the bases that are aligned.
    If counting from the end of the query is desired, use
    `cigar.reversed().query_alignment_offsets()`

    Returns:
        A tuple (start, stop) containing the start and stop positions
            of the aligned part of the query. These offsets are 0-based and open-ended, with
            respect to the beginning of the query.

    Raises:
        ValueError: If according to the cigar, there are no aligned query bases.
    """
    start_offset: int = 0
    end_offset: int = 0
    element: CigarElement
    alignment_began = False
    for element in self.elements:
        if element.operator.is_clipping and not alignment_began:
            # We are in the clipping operators preceding the alignment
            # Note: hardclips have length-on-query=0
            start_offset += element.length_on_query
            end_offset += element.length_on_query
        elif not element.operator.is_clipping:
            # We are within the alignment
            alignment_began = True
            end_offset += element.length_on_query
        else:
            # We have exited the alignment and are in the clipping operators after the alignment
            break

    if start_offset == end_offset:
        raise ValueError(f"Cigar {self} has no aligned bases")
    return start_offset, end_offset

reversed ¶

reversed() -> Cigar

Returns a copy of the Cigar with the elements in reverse order.

Source code in fgpyo/sam/__init__.py

def reversed(self) -> "Cigar":
    """Returns a copy of the Cigar with the elements in reverse order."""
    return Cigar(tuple(reversed(self.elements)))

truncate_to_query_length ¶

truncate_to_query_length(length: int) -> Cigar

Truncates the CIGAR to the specified query sequence length.

Produces a new CIGAR that includes at most the specified number of bases from the query sequence. Only CIGAR operators that consume query bases (M, I, S, =, X) are counted toward the length limit.

Parameters:

Name	Type	Description	Default
`length`	`int`	The maximum number of query bases to include	required

Returns:

Type	Description
`Cigar`	A new Cigar truncated to the specified query length

Examples:

>>> cigar = Cigar.from_cigarstring("10M5I10M")
>>> str(cigar.truncate_to_query_length(15))
'10M5I'
>>> str(cigar.truncate_to_query_length(12))
'10M2I'

Source code in fgpyo/sam/__init__.py

def truncate_to_query_length(self, length: int) -> "Cigar":
    """
    Truncates the CIGAR to the specified query sequence length.

    Produces a new CIGAR that includes at most the specified number of bases
    from the query sequence. Only CIGAR operators that consume query bases
    (M, I, S, =, X) are counted toward the length limit.

    Args:
        length: The maximum number of query bases to include

    Returns:
        A new Cigar truncated to the specified query length

    Examples:
        >>> cigar = Cigar.from_cigarstring("10M5I10M")
        >>> str(cigar.truncate_to_query_length(15))
        '10M5I'
        >>> str(cigar.truncate_to_query_length(12))
        '10M2I'
    """
    return self._truncate(length, lambda e: e.operator.consumes_query)

truncate_to_target_length ¶

truncate_to_target_length(length: int) -> Cigar

Truncates the CIGAR to the specified reference/target sequence length.

Produces a new CIGAR that includes at most the specified number of bases from the reference/target sequence. Only CIGAR operators that consume reference bases (M, D, N, =, X) are counted toward the length limit.

Parameters:

Name	Type	Description	Default
`length`	`int`	The maximum number of reference/target bases to include	required

Returns:

Type	Description
`Cigar`	A new Cigar truncated to the specified target length

Examples:

>>> cigar = Cigar.from_cigarstring("10M5D10M")
>>> str(cigar.truncate_to_target_length(15))
'10M5D'
>>> str(cigar.truncate_to_target_length(12))
'10M2D'

Source code in fgpyo/sam/__init__.py

def truncate_to_target_length(self, length: int) -> "Cigar":
    """
    Truncates the CIGAR to the specified reference/target sequence length.

    Produces a new CIGAR that includes at most the specified number of bases
    from the reference/target sequence. Only CIGAR operators that consume
    reference bases (M, D, N, =, X) are counted toward the length limit.

    Args:
        length: The maximum number of reference/target bases to include

    Returns:
        A new Cigar truncated to the specified target length

    Examples:
        >>> cigar = Cigar.from_cigarstring("10M5D10M")
        >>> str(cigar.truncate_to_target_length(15))
        '10M5D'
        >>> str(cigar.truncate_to_target_length(12))
        '10M2D'
    """
    return self._truncate(length, lambda e: e.operator.consumes_reference)

CigarElement ¶

Represents an element in a Cigar.

Attributes:

Name	Type	Description
`-`	`length (int`	the length of the element
`-`	`operator (CigarOp`	the operator of the element

Source code in fgpyo/sam/__init__.py

@attr.s(frozen=True, slots=True, auto_attribs=True)
class CigarElement:
    """
    Represents an element in a Cigar.

    Attributes:
        - length (int): the length of the element
        - operator (CigarOp): the operator of the element
    """

    length: int
    operator: CigarOp

    def __attrs_post_init__(self) -> None:
        """Validates the length attribute is greater than zero."""
        if self.length <= 0:
            raise ValueError(f"Cigar element must have a length > 0, found {self.length}")

    @property
    def length_on_query(self) -> int:
        """Returns the length of the element on the query sequence."""
        return self.length if self.operator.consumes_query else 0

    @property
    def length_on_target(self) -> int:
        """Returns the length of the element on the target (often reference) sequence."""
        return self.length if self.operator.consumes_reference else 0

    def __str__(self) -> str:
        """Returns the string representation (e.g. '10M')."""
        return f"{self.length}{self.operator.character}"

Attributes¶

length_on_query property ¶

length_on_query: int

Returns the length of the element on the query sequence.

length_on_target property ¶

length_on_target: int

Returns the length of the element on the target (often reference) sequence.

Functions¶

__attrs_post_init__ ¶

__attrs_post_init__() -> None

Validates the length attribute is greater than zero.

Source code in fgpyo/sam/__init__.py

def __attrs_post_init__(self) -> None:
    """Validates the length attribute is greater than zero."""
    if self.length <= 0:
        raise ValueError(f"Cigar element must have a length > 0, found {self.length}")

__str__ ¶

__str__() -> str

Returns the string representation (e.g. '10M').

Source code in fgpyo/sam/__init__.py

def __str__(self) -> str:
    """Returns the string representation (e.g. '10M')."""
    return f"{self.length}{self.operator.character}"

CigarOp ¶

Bases: Enum

Enumeration of operators that can appear in a Cigar string.

Attributes:

Name	Type	Description
`code`	`int`	The `~pysam` cigar operator code.
`character`	`int`	The single character cigar operator.
`consumes_query`	`bool`	True if this operator consumes query bases, False otherwise.
`consumes_target`	`bool`	True if this operator consumes target bases, False otherwise.

Source code in fgpyo/sam/__init__.py

@enum.unique
class CigarOp(enum.Enum):
    """
    Enumeration of operators that can appear in a Cigar string.

    Attributes:
        code (int): The `~pysam` cigar operator code.
        character (int): The single character cigar operator.
        consumes_query (bool): True if this operator consumes query bases, False otherwise.
        consumes_target (bool): True if this operator consumes target bases, False otherwise.
    """

    M = (0, "M", True, True)  #: Match or Mismatch the reference
    I = (1, "I", True, False)  #: Insertion versus the reference  # noqa: E741
    D = (2, "D", False, True)  #: Deletion versus the reference
    N = (3, "N", False, True)  #: Skipped region from the reference
    S = (4, "S", True, False)  #: Soft clip
    H = (5, "H", False, False)  #: Hard clip
    P = (6, "P", False, False)  #: Padding
    EQ = (7, "=", True, True)  #: Matches the reference
    X = (8, "X", True, True)  #: Mismatches the reference

    def __init__(
        self, code: int, character: str, consumes_query: bool, consumes_reference: bool
    ) -> None:
        """Initializes the CIGAR operator with the given code, character, and consumption flags."""
        self.code = code
        self.character = character
        self.consumes_query = consumes_query
        self.consumes_reference = consumes_reference

    @staticmethod
    def from_character(character: str) -> "CigarOp":
        """Returns the operator from the single character."""
        if CigarOp.EQ.character == character:
            return CigarOp.EQ
        else:
            return CigarOp[character]

    @staticmethod
    def from_code(code: int) -> "CigarOp":
        """
        Returns the operator from the given operator code.

        Note: this is mainly used to get the operator from :py:mod:`~pysam`.
        """
        return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]

    @property
    def is_indel(self) -> bool:
        """Returns true if the operator is an indel, false otherwise."""
        return self == CigarOp.I or self == CigarOp.D

    @property
    def is_clipping(self) -> bool:
        """Returns true if the operator is a soft/hard clip, false otherwise."""
        return self == CigarOp.S or self == CigarOp.H

Attributes¶

is_clipping property ¶

is_clipping: bool

Returns true if the operator is a soft/hard clip, false otherwise.

is_indel property ¶

is_indel: bool

Returns true if the operator is an indel, false otherwise.

Functions¶

__init__ ¶

__init__(code: int, character: str, consumes_query: bool, consumes_reference: bool) -> None

Initializes the CIGAR operator with the given code, character, and consumption flags.

Source code in fgpyo/sam/__init__.py

def __init__(
    self, code: int, character: str, consumes_query: bool, consumes_reference: bool
) -> None:
    """Initializes the CIGAR operator with the given code, character, and consumption flags."""
    self.code = code
    self.character = character
    self.consumes_query = consumes_query
    self.consumes_reference = consumes_reference

from_character staticmethod ¶

from_character(character: str) -> CigarOp

Returns the operator from the single character.

Source code in fgpyo/sam/__init__.py

@staticmethod
def from_character(character: str) -> "CigarOp":
    """Returns the operator from the single character."""
    if CigarOp.EQ.character == character:
        return CigarOp.EQ
    else:
        return CigarOp[character]

from_code staticmethod ¶

from_code(code: int) -> CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from :py:mod:~pysam.

Source code in fgpyo/sam/__init__.py

@staticmethod
def from_code(code: int) -> "CigarOp":
    """
    Returns the operator from the given operator code.

    Note: this is mainly used to get the operator from :py:mod:`~pysam`.
    """
    return CigarOp[_CigarOpUtil.CODE_TO_CHARACTER[code]]

CigarParsingException ¶

Bases: Exception

The exception raised specific to parsing a cigar.

Source code in fgpyo/sam/__init__.py

class CigarParsingException(Exception):  # noqa: N818
    """The exception raised specific to parsing a cigar."""

    pass

PairOrientation ¶

Bases: Enum

Enumerations of read pair orientations.

Source code in fgpyo/sam/__init__.py

@enum.unique
class PairOrientation(enum.Enum):
    """Enumerations of read pair orientations."""

    FR = "FR"
    """A pair orientation for forward-reverse reads ("innie")."""

    RF = "RF"
    """A pair orientation for reverse-forward reads ("outie")."""

    TANDEM = "TANDEM"
    """A pair orientation for tandem (forward-forward or reverse-reverse) reads."""

    @classmethod
    def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
        cls, rec1: AlignedSegment, rec2: AlignedSegment | None = None
    ) -> "PairOrientation | None":
        """
        Returns the pair orientation if both reads are mapped to the same reference sequence.

        Args:
            rec1: The first record in the pair.
            rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

        See:
            [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
        """
        if rec2 is None:
            rec2_is_unmapped = rec1.mate_is_unmapped
            rec2_reference_id = rec1.next_reference_id
        else:
            rec2_is_unmapped = rec2.is_unmapped
            rec2_reference_id = rec2.reference_id

        if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
            return None

        if rec2 is None:
            rec2_is_forward = rec1.mate_is_forward
            rec2_reference_start = rec1.next_reference_start
        else:
            rec2_is_forward = rec2.is_forward
            rec2_reference_start = rec2.reference_start

        assert rec1.reference_end is not None  # type narrowing

        if rec1.is_forward is rec2_is_forward:
            return PairOrientation.TANDEM
        if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
            return PairOrientation.FR
        if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
            return PairOrientation.RF

        if rec2 is None:
            if not rec1.has_tag("MC"):
                raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
            rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
            rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
        else:
            assert rec2.reference_end is not None  # type narrowing
            rec2_reference_end = rec2.reference_end

        if rec1.reference_start < rec2_reference_end:
            return PairOrientation.FR
        else:
            return PairOrientation.RF

Attributes¶

FR class-attribute instance-attribute ¶

FR = 'FR'

A pair orientation for forward-reverse reads ("innie").

RF class-attribute instance-attribute ¶

RF = 'RF'

A pair orientation for reverse-forward reads ("outie").

TANDEM class-attribute instance-attribute ¶

TANDEM = 'TANDEM'

A pair orientation for tandem (forward-forward or reverse-reverse) reads.

Functions¶

from_recs classmethod ¶

from_recs(rec1: AlignedSegment, rec2: AlignedSegment | None = None) -> PairOrientation | None

Returns the pair orientation if both reads are mapped to the same reference sequence.

Parameters:

Name	Type	Description	Default
`rec1`	`AlignedSegment`	The first record in the pair.	required
`rec2`	`AlignedSegment \| None`	The second record in the pair. If None, then mate info on `rec1` will be used.	`None`

See

htsjdk.samtools.SamPairUtil.getPairOrientation()

Source code in fgpyo/sam/__init__.py

@classmethod
def from_recs(  # noqa: C901  # `from_recs` is too complex (11 > 10)
    cls, rec1: AlignedSegment, rec2: AlignedSegment | None = None
) -> "PairOrientation | None":
    """
    Returns the pair orientation if both reads are mapped to the same reference sequence.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

    See:
        [`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
    """
    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return None

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    assert rec1.reference_end is not None  # type narrowing

    if rec1.is_forward is rec2_is_forward:
        return PairOrientation.TANDEM
    if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
        return PairOrientation.FR
    if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
        return PairOrientation.RF

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine pair orientation without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        assert rec2.reference_end is not None  # type narrowing
        rec2_reference_end = rec2.reference_end

    if rec1.reference_start < rec2_reference_end:
        return PairOrientation.FR
    else:
        return PairOrientation.RF

ReadEditInfo ¶

Counts various stats about how a read compares to a reference sequence.

Attributes:

Name	Type	Description
`matches`	`int`	the number of bases in the read that match the reference
`mismatches`	`int`	the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch.
`insertions`	`int`	the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string.
`inserted_bases`	`int`	the total number of bases contained within insertions in the read
`deletions`	`int`	the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string.
`deleted_bases`	`int`	the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read).
`nm`	`int`	the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases
`md`	`str`	the computed value of the SAM MD tag

Source code in fgpyo/sam/__init__.py

@attr.s(frozen=True, auto_attribs=True)
class ReadEditInfo:
    """
    Counts various stats about how a read compares to a reference sequence.

    Attributes:
        matches: the number of bases in the read that match the reference
        mismatches: the number of mismatches between the read sequence and the reference sequence
            as dictated by the alignment.  Like as defined for the SAM NM tag computation, any base
            except A/C/G/T in the read is considered a mismatch.
        insertions: the number of insertions in the read vs. the reference.  I.e. the number of I
            operators in the CIGAR string.
        inserted_bases: the total number of bases contained within insertions in the read
        deletions: the number of deletions in the read vs. the reference.  I.e. the number of D
            operators in the CIGAT string.
        deleted_bases: the total number of that are deleted within the alignment (i.e. bases in
            the reference but not in the read).
        nm: the computed value of the SAM NM tag, calculated as mismatches + inserted_bases +
            deleted_bases
        md: the computed value of the SAM MD tag
    """

    matches: int
    mismatches: int
    insertions: int
    inserted_bases: int
    deletions: int
    deleted_bases: int
    nm: int
    md: str

SamFileType ¶

Bases: Enum

Enumeration of valid SAM/BAM/CRAM file types.

Attributes:

Name	Type	Description
`mode`	`str`	The additional mode character to add when opening this file type.
`ext`	`str`	The standard file extension for this file type.

Source code in fgpyo/sam/__init__.py

@enum.unique
class SamFileType(enum.Enum):
    """
    Enumeration of valid SAM/BAM/CRAM file types.

    Attributes:
        mode (str): The additional mode character to add when opening this file type.
        ext (str): The standard file extension for this file type.
    """

    def __init__(self, mode: str, ext: str) -> None:
        """Initializes the file type with the given mode and extension."""
        self.mode = mode
        self.extension = ext

    SAM = ("", ".sam")
    BAM = ("b", ".bam")
    CRAM = ("c", ".cram")

    @property
    def indexable(self) -> bool:
        """True if the file type can be indexed, false otherwise."""
        return self is SamFileType.BAM or self is SamFileType.CRAM

    @classmethod
    def from_path(cls, path: Path | str) -> "SamFileType":
        """
        Infers the file type based on the file extension.

        Args:
            path: the path to the SAM/BAM/CRAM to read or write.
        """
        ext = Path(path).suffix
        try:
            return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
        except StopIteration as ex:
            raise ValueError(f"Could not infer file type from {path}") from ex

Attributes¶

indexable property ¶

indexable: bool

True if the file type can be indexed, false otherwise.

Functions¶

__init__ ¶

__init__(mode: str, ext: str) -> None

Initializes the file type with the given mode and extension.

Source code in fgpyo/sam/__init__.py

def __init__(self, mode: str, ext: str) -> None:
    """Initializes the file type with the given mode and extension."""
    self.mode = mode
    self.extension = ext

from_path classmethod ¶

from_path(path: Path | str) -> SamFileType

Infers the file type based on the file extension.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	the path to the SAM/BAM/CRAM to read or write.	required

Source code in fgpyo/sam/__init__.py

@classmethod
def from_path(cls, path: Path | str) -> "SamFileType":
    """
    Infers the file type based on the file extension.

    Args:
        path: the path to the SAM/BAM/CRAM to read or write.
    """
    ext = Path(path).suffix
    try:
        return next(iter([tpe for tpe in SamFileType if tpe.extension == ext]))
    except StopIteration as ex:
        raise ValueError(f"Could not infer file type from {path}") from ex

SamOrder ¶

Bases: Enum

Enumerations of possible sort orders for a SAM file.

Source code in fgpyo/sam/__init__.py

class SamOrder(enum.Enum):
    """Enumerations of possible sort orders for a SAM file."""

    Unsorted = "unsorted"  #: the SAM / BAM / CRAM is unsorted
    Coordinate = "coordinate"  #: coordinate sorted
    QueryName = "queryname"  #: queryname sorted
    Unknown = "unknown"  # Unknown SAM / BAM / CRAM sort order

SupplementaryAlignment ¶

Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

Attributes:

Name	Type	Description
`reference_name`	`str`	the name of the reference (i.e. contig, chromosome) aligned to
`start`	`int`	the 0-based start position of the alignment
`is_forward`	`bool`	true if the alignment is in the forward strand, false otherwise
`cigar`	`Cigar`	the cigar for the alignment
`mapq`	`int`	the mapping quality
`nm`	`int`	the number of edits

Source code in fgpyo/sam/__init__.py

@attr.s(frozen=True, auto_attribs=True)
class SupplementaryAlignment:
    """
    Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.

    Attributes:
        reference_name: the name of the reference (i.e. contig, chromosome) aligned to
        start: the 0-based start position of the alignment
        is_forward: true if the alignment is in the forward strand, false otherwise
        cigar: the cigar for the alignment
        mapq: the mapping quality
        nm: the number of edits
    """

    reference_name: str
    start: int
    is_forward: bool
    cigar: Cigar
    mapq: int
    nm: int

    def __str__(self) -> str:
        """Returns the comma-delimited SA tag representation."""
        return ",".join(
            str(item)
            for item in (
                self.reference_name,
                self.start + 1,
                "+" if self.is_forward else "-",
                self.cigar,
                self.mapq,
                self.nm,
            )
        )

    @property
    def end(self) -> int:
        """The 0-based exclusive end position of the alignment."""
        return self.start + self.cigar.length_on_target()

    @staticmethod
    def parse(string: str) -> "SupplementaryAlignment":
        """
        Returns a supplementary alignment parsed from the given string.

        The various fields should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`).
        """
        fields = string.split(",")
        return SupplementaryAlignment(
            reference_name=fields[0],
            start=int(fields[1]) - 1,
            is_forward=fields[2] == "+",
            cigar=Cigar.from_cigarstring(fields[3]),
            mapq=int(fields[4]),
            nm=int(fields[5]),
        )

    @staticmethod
    def parse_sa_tag(tag: str) -> list["SupplementaryAlignment"]:
        """
        Parses an SA tag of supplementary alignments from a BAM file.

        If the tag is empty or contains just a single semi-colon then an empty list will be
        returned.  Otherwise a list containing a SupplementaryAlignment per ;-separated value
        in the tag will be returned.
        """
        return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]

    @classmethod
    def from_read(cls, read: pysam.AlignedSegment) -> list["SupplementaryAlignment"]:
        """
        Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

        Args:
            read: An alignment. The presence of the "SA" tag is not required.

        Returns:
            A list of all SupplementaryAlignments present in the SA tag.
            If the SA tag is not present, or it is empty, an empty list will be returned.
        """
        if read.has_tag("SA"):
            sa_tag: str = cast(str, read.get_tag("SA"))
            return cls.parse_sa_tag(sa_tag)
        else:
            return []

Attributes¶

end property ¶

end: int

The 0-based exclusive end position of the alignment.

Functions¶

__str__ ¶

__str__() -> str

Returns the comma-delimited SA tag representation.

Source code in fgpyo/sam/__init__.py

def __str__(self) -> str:
    """Returns the comma-delimited SA tag representation."""
    return ",".join(
        str(item)
        for item in (
            self.reference_name,
            self.start + 1,
            "+" if self.is_forward else "-",
            self.cigar,
            self.mapq,
            self.nm,
        )
    )

from_read classmethod ¶

from_read(read: AlignedSegment) -> list[SupplementaryAlignment]

Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

Parameters:

Name	Type	Description	Default
`read`	`AlignedSegment`	An alignment. The presence of the "SA" tag is not required.	required

Returns:

Type	Description
`list[SupplementaryAlignment]`	A list of all SupplementaryAlignments present in the SA tag.
`list[SupplementaryAlignment]`	If the SA tag is not present, or it is empty, an empty list will be returned.

Source code in fgpyo/sam/__init__.py

@classmethod
def from_read(cls, read: pysam.AlignedSegment) -> list["SupplementaryAlignment"]:
    """
    Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.

    Args:
        read: An alignment. The presence of the "SA" tag is not required.

    Returns:
        A list of all SupplementaryAlignments present in the SA tag.
        If the SA tag is not present, or it is empty, an empty list will be returned.
    """
    if read.has_tag("SA"):
        sa_tag: str = cast(str, read.get_tag("SA"))
        return cls.parse_sa_tag(sa_tag)
    else:
        return []

parse staticmethod ¶

parse(string: str) -> SupplementaryAlignment

Returns a supplementary alignment parsed from the given string.

The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4).

Source code in fgpyo/sam/__init__.py

@staticmethod
def parse(string: str) -> "SupplementaryAlignment":
    """
    Returns a supplementary alignment parsed from the given string.

    The various fields should be comma-delimited (ex. `chr1,123,-,100M50S,60,4`).
    """
    fields = string.split(",")
    return SupplementaryAlignment(
        reference_name=fields[0],
        start=int(fields[1]) - 1,
        is_forward=fields[2] == "+",
        cigar=Cigar.from_cigarstring(fields[3]),
        mapq=int(fields[4]),
        nm=int(fields[5]),
    )

parse_sa_tag staticmethod ¶

parse_sa_tag(tag: str) -> list[SupplementaryAlignment]

Parses an SA tag of supplementary alignments from a BAM file.

If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.

Source code in fgpyo/sam/__init__.py

@staticmethod
def parse_sa_tag(tag: str) -> list["SupplementaryAlignment"]:
    """
    Parses an SA tag of supplementary alignments from a BAM file.

    If the tag is empty or contains just a single semi-colon then an empty list will be
    returned.  Otherwise a list containing a SupplementaryAlignment per ;-separated value
    in the tag will be returned.
    """
    return [SupplementaryAlignment.parse(a) for a in tag.split(";") if len(a) > 0]

Template ¶

A container for alignment records corresponding to a single sequenced template or insert.

It is strongly preferred that new Template instances be created with Template.build() which will ensure that reads are stored in the correct Template property, and run basic validations of the Template by default. If constructing Template instances by construction users are encouraged to use the validate method post-construction.

In the special cases there are alignments records that are both secondary and supplementary then they will be stored upon the r1_supplementals and r2_supplementals fields only.

Attributes:

Name	Type	Description
`name`	`str`	the name of the template/query
`r1`	`AlignedSegment \| None`	Primary non-supplementary alignment for read 1, or None if there is none
`r2`	`AlignedSegment \| None`	Primary non-supplementary alignment for read 2, or None if there is none
`r1_supplementals`	`list[AlignedSegment]`	Supplementary alignments for read 1
`r2_supplementals`	`list[AlignedSegment]`	Supplementary alignments for read 2
`r1_secondaries`	`list[AlignedSegment]`	Secondary (non-primary, non-supplementary) alignments for read 1
`r2_secondaries`	`list[AlignedSegment]`	Secondary (non-primary, non-supplementary) alignments for read 2

Source code in fgpyo/sam/__init__.py

@attr.s(frozen=True, auto_attribs=True)
class Template:
    """
    A container for alignment records corresponding to a single sequenced template or insert.

    It is strongly preferred that new Template instances be created with `Template.build()`
    which will ensure that reads are stored in the correct Template property, and run basic
    validations of the Template by default.  If constructing Template instances by construction
    users are encouraged to use the validate method post-construction.

    In the special cases there are alignments records that are _*both secondary and supplementary*_
    then they will be stored upon the `r1_supplementals` and `r2_supplementals` fields only.

    Attributes:
        name: the name of the template/query
        r1: Primary non-supplementary alignment for read 1, or None if there is none
        r2: Primary non-supplementary alignment for read 2, or None if there is none
        r1_supplementals: Supplementary alignments for read 1
        r2_supplementals: Supplementary alignments for read 2
        r1_secondaries: Secondary (non-primary, non-supplementary) alignments for read 1
        r2_secondaries: Secondary (non-primary, non-supplementary) alignments for read 2
    """

    name: str
    r1: AlignedSegment | None
    r2: AlignedSegment | None
    r1_supplementals: list[AlignedSegment]
    r2_supplementals: list[AlignedSegment]
    r1_secondaries: list[AlignedSegment]
    r2_secondaries: list[AlignedSegment]

    @staticmethod
    def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
        """
        Returns an iterator over templates from queryname-grouped alignments.

        Gathers consecutive runs of records sharing a common query name into templates.
        """
        return TemplateIterator(alns)

    @staticmethod
    def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
        """Build a template from a set of records all with the same queryname."""
        name = None
        r1 = None
        r2 = None
        r1_supplementals: list[AlignedSegment] = []
        r2_supplementals: list[AlignedSegment] = []
        r1_secondaries: list[AlignedSegment] = []
        r2_secondaries: list[AlignedSegment] = []

        for rec in recs:
            if name is None:
                name = rec.query_name

            is_r1 = not rec.is_paired or rec.is_read1

            if not rec.is_supplementary and not rec.is_secondary:
                if is_r1:
                    assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                    r1 = rec
                else:
                    assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                    r2 = rec
            elif rec.is_supplementary:
                if is_r1:
                    r1_supplementals.append(rec)
                else:
                    r2_supplementals.append(rec)
            elif rec.is_secondary:
                if is_r1:
                    r1_secondaries.append(rec)
                else:
                    r2_secondaries.append(rec)

        assert name is not None, "Cannot construct a template from zero records."

        template = Template(
            name=name,
            r1=r1,
            r2=r2,
            r1_supplementals=r1_supplementals,
            r2_supplementals=r2_supplementals,
            r1_secondaries=r1_secondaries,
            r2_secondaries=r2_secondaries,
        )

        if validate:
            template.validate()

        return template

    def validate(self) -> None:
        """Performs sanity checks that all the records in the Template are as expected."""
        for rec in self.all_recs():
            assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

        if self.r1 is not None:
            assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
            assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
            assert not self.r1.is_secondary, "R1 primary flagged as secondary"

        if self.r2 is not None:
            assert self.r2.is_read2, "R2 not flagged as R2"
            assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
            assert not self.r2.is_secondary, "R2 primary flagged as secondary"

        for rec in self.r1_secondaries:
            assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
            assert rec.is_secondary, "R1 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

        for rec in self.r1_supplementals:
            assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
            assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

        for rec in self.r2_secondaries:
            assert rec.is_read2, "R2 secondary not flagged as R2"
            assert rec.is_secondary, "R2 secondary not flagged as secondary"
            assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

        for rec in self.r2_supplementals:
            assert rec.is_read2, "R2 supp. not flagged as R2"
            assert rec.is_supplementary, "R2 supp. not flagged as supplementary"

    def primary_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the primary records for the template."""
        return (r for r in (self.r1, self.r2) if r is not None)

    def all_r1s(self) -> Iterator[AlignedSegment]:
        """Yields all R1 alignments of this template including secondary and supplementary."""
        r1_primary = [] if self.r1 is None else [self.r1]
        return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)

    def all_r2s(self) -> Iterator[AlignedSegment]:
        """Yields all R2 alignments of this template including secondary and supplementary."""
        r2_primary = [] if self.r2 is None else [self.r2]
        return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)

    def all_recs(self) -> Iterator[AlignedSegment]:
        """Returns a list with all the records for the template."""
        for rec in self.primary_recs():
            yield rec

        for recs in (
            self.r1_supplementals,
            self.r1_secondaries,
            self.r2_supplementals,
            self.r2_secondaries,
        ):
            for rec in recs:
                yield rec

    def set_mate_info(
        self,
        is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
        isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
    ) -> Self:
        """
        Reset all mate information on every alignment in the template.

        Args:
            is_proper_pair: A function that takes two alignments and determines proper pair status.
            isize: A function that takes the two alignments and calculates their isize.
        """
        if self.r1 is not None and self.r2 is not None:
            set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
        if self.r1 is not None:
            for rec in self.r2_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
            for rec in self.r2_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
        if self.r2 is not None:
            for rec in self.r1_secondaries:
                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
            for rec in self.r1_supplementals:
                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
        return self

    def write_to(
        self,
        writer: SamFile,
        primary_only: bool = False,
    ) -> None:
        """
        Write the records associated with the template to file.

        Args:
            writer: An open, writable AlignmentFile.
            primary_only: If True, only write primary alignments.
        """
        if primary_only:
            rec_iter = self.primary_recs()
        else:
            rec_iter = self.all_recs()

        for rec in rec_iter:
            writer.write(rec)

    def set_tag(
        self,
        tag: str,
        value: str | int | float | None,
    ) -> None:
        """
        Add a tag to all records associated with the template.

        Setting a tag to `None` will remove the tag.

        Args:
            tag: The name of the tag.
            value: The value of the tag.
        """
        assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

        for rec in self.all_recs():
            rec.set_tag(tag, value)

Functions¶

all_r1s ¶

all_r1s() -> Iterator[AlignedSegment]

Yields all R1 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py

def all_r1s(self) -> Iterator[AlignedSegment]:
    """Yields all R1 alignments of this template including secondary and supplementary."""
    r1_primary = [] if self.r1 is None else [self.r1]
    return chain(r1_primary, self.r1_secondaries, self.r1_supplementals)

all_r2s ¶

all_r2s() -> Iterator[AlignedSegment]

Yields all R2 alignments of this template including secondary and supplementary.

Source code in fgpyo/sam/__init__.py

def all_r2s(self) -> Iterator[AlignedSegment]:
    """Yields all R2 alignments of this template including secondary and supplementary."""
    r2_primary = [] if self.r2 is None else [self.r2]
    return chain(r2_primary, self.r2_secondaries, self.r2_supplementals)

all_recs ¶

all_recs() -> Iterator[AlignedSegment]

Returns a list with all the records for the template.

Source code in fgpyo/sam/__init__.py

def all_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the records for the template."""
    for rec in self.primary_recs():
        yield rec

    for recs in (
        self.r1_supplementals,
        self.r1_secondaries,
        self.r2_supplementals,
        self.r2_secondaries,
    ):
        for rec in recs:
            yield rec

build staticmethod ¶

build(recs: Iterable[AlignedSegment], validate: bool = True) -> Template

Build a template from a set of records all with the same queryname.

Source code in fgpyo/sam/__init__.py

@staticmethod
def build(recs: Iterable[AlignedSegment], validate: bool = True) -> "Template":
    """Build a template from a set of records all with the same queryname."""
    name = None
    r1 = None
    r2 = None
    r1_supplementals: list[AlignedSegment] = []
    r2_supplementals: list[AlignedSegment] = []
    r1_secondaries: list[AlignedSegment] = []
    r2_secondaries: list[AlignedSegment] = []

    for rec in recs:
        if name is None:
            name = rec.query_name

        is_r1 = not rec.is_paired or rec.is_read1

        if not rec.is_supplementary and not rec.is_secondary:
            if is_r1:
                assert r1 is None, f"Multiple R1 primary reads found in {recs}"
                r1 = rec
            else:
                assert r2 is None, f"Multiple R2 primary reads found in {recs}"
                r2 = rec
        elif rec.is_supplementary:
            if is_r1:
                r1_supplementals.append(rec)
            else:
                r2_supplementals.append(rec)
        elif rec.is_secondary:
            if is_r1:
                r1_secondaries.append(rec)
            else:
                r2_secondaries.append(rec)

    assert name is not None, "Cannot construct a template from zero records."

    template = Template(
        name=name,
        r1=r1,
        r2=r2,
        r1_supplementals=r1_supplementals,
        r2_supplementals=r2_supplementals,
        r1_secondaries=r1_secondaries,
        r2_secondaries=r2_secondaries,
    )

    if validate:
        template.validate()

    return template

iterator staticmethod ¶

iterator(alns: Iterator[AlignedSegment]) -> Iterator[Template]

Returns an iterator over templates from queryname-grouped alignments.

Gathers consecutive runs of records sharing a common query name into templates.

Source code in fgpyo/sam/__init__.py

@staticmethod
def iterator(alns: Iterator[AlignedSegment]) -> Iterator["Template"]:
    """
    Returns an iterator over templates from queryname-grouped alignments.

    Gathers consecutive runs of records sharing a common query name into templates.
    """
    return TemplateIterator(alns)

primary_recs ¶

primary_recs() -> Iterator[AlignedSegment]

Returns a list with all the primary records for the template.

Source code in fgpyo/sam/__init__.py

def primary_recs(self) -> Iterator[AlignedSegment]:
    """Returns a list with all the primary records for the template."""
    return (r for r in (self.r1, self.r2) if r is not None)

set_mate_info ¶

set_mate_info(is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> Self

Reset all mate information on every alignment in the template.

Parameters:

Name	Type	Description	Default
`is_proper_pair`	`Callable[[AlignedSegment, AlignedSegment], bool]`	A function that takes two alignments and determines proper pair status.	`is_proper_pair`
`isize`	`Callable[[AlignedSegment, AlignedSegment], int]`	A function that takes the two alignments and calculates their isize.	`isize`

Source code in fgpyo/sam/__init__.py

def set_mate_info(
    self,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> Self:
    """
    Reset all mate information on every alignment in the template.

    Args:
        is_proper_pair: A function that takes two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.
    """
    if self.r1 is not None and self.r2 is not None:
        set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
    if self.r1 is not None:
        for rec in self.r2_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
        for rec in self.r2_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
    if self.r2 is not None:
        for rec in self.r1_secondaries:
            set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
        for rec in self.r1_supplementals:
            set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
    return self

set_tag ¶

set_tag(tag: str, value: str | int | float | None) -> None

Add a tag to all records associated with the template.

Setting a tag to None will remove the tag.

Parameters:

Name	Type	Description	Default
`tag`	`str`	The name of the tag.	required
`value`	`str \| int \| float \| None`	The value of the tag.	required

Source code in fgpyo/sam/__init__.py

def set_tag(
    self,
    tag: str,
    value: str | int | float | None,
) -> None:
    """
    Add a tag to all records associated with the template.

    Setting a tag to `None` will remove the tag.

    Args:
        tag: The name of the tag.
        value: The value of the tag.
    """
    assert len(tag) == 2, f"Tags must be 2 characters: {tag}."

    for rec in self.all_recs():
        rec.set_tag(tag, value)

validate ¶

validate() -> None

Performs sanity checks that all the records in the Template are as expected.

Source code in fgpyo/sam/__init__.py

def validate(self) -> None:
    """Performs sanity checks that all the records in the Template are as expected."""
    for rec in self.all_recs():
        assert rec.query_name == self.name, f"Name error {self.name} vs. {rec.query_name}"

    if self.r1 is not None:
        assert self.r1.is_read1 or not self.r1.is_paired, "R1 not flagged as R1 or unpaired"
        assert not self.r1.is_supplementary, "R1 primary flagged as supplementary"
        assert not self.r1.is_secondary, "R1 primary flagged as secondary"

    if self.r2 is not None:
        assert self.r2.is_read2, "R2 not flagged as R2"
        assert not self.r2.is_supplementary, "R2 primary flagged as supplementary"
        assert not self.r2.is_secondary, "R2 primary flagged as secondary"

    for rec in self.r1_secondaries:
        assert rec.is_read1 or not rec.is_paired, "R1 secondary not flagged as R1 or unpaired"
        assert rec.is_secondary, "R1 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R1 secondary supplementals belong with supplementals"

    for rec in self.r1_supplementals:
        assert rec.is_read1 or not rec.is_paired, "R1 supp. not flagged as R1 or unpaired"
        assert rec.is_supplementary, "R1 supp. not flagged as supplementary"

    for rec in self.r2_secondaries:
        assert rec.is_read2, "R2 secondary not flagged as R2"
        assert rec.is_secondary, "R2 secondary not flagged as secondary"
        assert not rec.is_supplementary, "R2 secondary supplementals belong with supplementals"

    for rec in self.r2_supplementals:
        assert rec.is_read2, "R2 supp. not flagged as R2"
        assert rec.is_supplementary, "R2 supp. not flagged as supplementary"

write_to ¶

write_to(writer: AlignmentFile, primary_only: bool = False) -> None

Write the records associated with the template to file.

Parameters:

Name	Type	Description	Default
`writer`	`AlignmentFile`	An open, writable AlignmentFile.	required
`primary_only`	`bool`	If True, only write primary alignments.	`False`

Source code in fgpyo/sam/__init__.py

def write_to(
    self,
    writer: SamFile,
    primary_only: bool = False,
) -> None:
    """
    Write the records associated with the template to file.

    Args:
        writer: An open, writable AlignmentFile.
        primary_only: If True, only write primary alignments.
    """
    if primary_only:
        rec_iter = self.primary_recs()
    else:
        rec_iter = self.all_recs()

    for rec in rec_iter:
        writer.write(rec)

TemplateIterator ¶

Bases: Iterator[Template]

An iterator that converts query-grouped reads into templates.

Source code in fgpyo/sam/__init__.py

class TemplateIterator(Iterator[Template]):
    """An iterator that converts query-grouped reads into templates."""

    def __init__(self, iterator: Iterator[AlignedSegment]) -> None:
        """Initializes the iterator from a query-grouped alignment iterator."""
        self._iter = PeekableIterator(iterator)

    def __iter__(self) -> Iterator[Template]:
        """Returns self as the iterator."""
        return self

    def __next__(self) -> Template:
        """Returns the next Template from the query-grouped iterator."""
        name = self._iter.peek().query_name
        recs = self._iter.takewhile(lambda r: r.query_name == name)
        return Template.build(recs, validate=False)

Functions¶

__init__ ¶

__init__(iterator: Iterator[AlignedSegment]) -> None

Initializes the iterator from a query-grouped alignment iterator.

Source code in fgpyo/sam/__init__.py

def __init__(self, iterator: Iterator[AlignedSegment]) -> None:
    """Initializes the iterator from a query-grouped alignment iterator."""
    self._iter = PeekableIterator(iterator)

__iter__ ¶

__iter__() -> Iterator[Template]

Returns self as the iterator.

Source code in fgpyo/sam/__init__.py

def __iter__(self) -> Iterator[Template]:
    """Returns self as the iterator."""
    return self

__next__ ¶

__next__() -> Template

Returns the next Template from the query-grouped iterator.

Source code in fgpyo/sam/__init__.py

def __next__(self) -> Template:
    """Returns the next Template from the query-grouped iterator."""
    name = self._iter.peek().query_name
    recs = self._iter.takewhile(lambda r: r.query_name == name)
    return Template.build(recs, validate=False)

Functions¶

calculate_edit_info ¶

calculate_edit_info(rec: AlignedSegment, reference_sequence: str, match_htsjdk: bool = False, reference_offset: int | None = None) -> ReadEditInfo | None

Constructs a ReadEditInfo with summary stats about how the read aligns to the reference.

Computes the number of mismatches, indels, indel bases as well as the SAM NM and MD tags.

Calculation of NM and MD tags is based off of htsjdk: https://github.com/samtools/htsjdk/blob/7034b33636b4cb9fec300a2136588e7c12c7ccd5/src/main/java/htsjdk/samtools/util/SequenceUtil.java#L964:L1029

Per the SAM specification (https://samtools.github.io/hts-specs/SAMtags.pdf), the NM tag encapsulates the number of differences between the query read and reference sequence, counting only A, C, G and T bases (case-insensitive). Everything else should be considered a mismatch (e.g., ambiguity codes like R and N). We set the default of n_as_match to False to be concordant with the SAM specification. Conversely, htsjdk treats an N->N as a match.

If the read is unmapped or the query sequence contains missing bases (*), returns None, as it is not possible to recalculate the MD and NM tags without access to the query sequence and reference sequence.

The order of the CIGAR operator checks is for performance and modeled after htsjdk's calculateMdAndNmTags.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	the read/record for which to calculate values	required
`reference_sequence`	`str`	the reference sequence (or fragment thereof) to which the read is aligned	required
`match_htsjdk`	`bool`	if True, mirror htsjdk `calculateMdAndNmTags` -- only match is the bases are equal, including ambiguity codes (e.g., R->R is counted as a match, but R->A is not a match). If False, follow SAM spec (everything else should be considered a mismatch, including ambiguity codes like R and N). When a deletion extends beyond the available reference sequence, htsjdk will not count the deletion in NM, while samtools will count it; set to False for samtools-style behavior.	`False`
`reference_offset`	`int \| None`	if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start	`None`

Returns:

Type	Description
`ReadEditInfo \| None`	a ReadEditInfo with information about how the read differs from the reference

Source code in fgpyo/sam/__init__.py

def calculate_edit_info(  # noqa: C901 (11 > 10)
    rec: AlignedSegment,
    reference_sequence: str,
    match_htsjdk: bool = False,
    reference_offset: int | None = None,
) -> ReadEditInfo | None:
    """
    Constructs a `ReadEditInfo` with summary stats about how the read aligns to the reference.

    Computes the number of mismatches, indels, indel bases as well as the SAM NM and MD tags.

    Calculation of NM and MD tags is based off of htsjdk:
    https://github.com/samtools/htsjdk/blob/7034b33636b4cb9fec300a2136588e7c12c7ccd5/src/main/java/htsjdk/samtools/util/SequenceUtil.java#L964:L1029

    Per the SAM specification (https://samtools.github.io/hts-specs/SAMtags.pdf), the NM tag
    encapsulates the number of differences between the query read and reference sequence, counting
    only A, C, G and T bases (case-insensitive). Everything else should be considered a mismatch
    (e.g., ambiguity codes like R and N). We set the default of `n_as_match` to False to be
    concordant with the SAM specification. Conversely, `htsjdk` treats an N->N as a match.

    If the read is unmapped or the query sequence contains missing bases (`*`), returns None, as it
    is not possible to recalculate the MD and NM tags without access to the query sequence and
    reference sequence.

    The order of the CIGAR operator checks is for performance and modeled after htsjdk's
    `calculateMdAndNmTags`.

    Args:
        rec: the read/record for which to calculate values
        reference_sequence: the reference sequence (or fragment thereof) to which the read is
            aligned
        match_htsjdk: if True, mirror htsjdk `calculateMdAndNmTags` -- only match is the bases are
            equal, including ambiguity codes (e.g., R->R is counted as a match, but R->A is not a
            match). If False, follow SAM spec (everything else should be considered a mismatch,
            including ambiguity codes like R and N). When a deletion extends beyond
            the available reference sequence, htsjdk will not count the deletion in NM, while
            samtools will count it; set to False for samtools-style behavior.
        reference_offset: if provided, assume that reference_sequence[reference_offset] is the
            first base aligned to in reference_sequence, otherwise use r.reference_start

    Returns:
        a ReadEditInfo with information about how the read differs from the reference
    """
    if rec.is_unmapped or rec.query_sequence is None or rec.query_sequence == NO_QUERY_BASES:
        return None

    query_offset: int = 0
    target_offset: int = reference_offset if reference_offset is not None else rec.reference_start
    cigar: Cigar = Cigar.from_cigartuples(rec.cigartuples)

    matches, mismatches, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
    md_edits: list[str] = []
    current_match_count: int = 0
    for elem in cigar.elements:
        op = elem.operator
        # TODO: use match-case statements after we drop Python 3.9 support
        if op in (CigarOp.M, CigarOp.X, CigarOp.EQ):
            for in_block_offset in range(0, elem.length):
                if (target_offset + in_block_offset) >= len(reference_sequence):
                    break  # out of bounds
                query_base: str = rec.query_sequence[query_offset + in_block_offset].upper()
                ref_base: str = reference_sequence[target_offset + in_block_offset].upper()
                is_match = _is_match_base(
                    query_base=query_base, ref_base=ref_base, match_htsjdk=match_htsjdk
                )

                if is_match:
                    matches += 1
                    current_match_count += 1
                else:  # mismatch
                    md_edits.append(str(current_match_count))  # append match count and reset
                    current_match_count = 0
                    md_edits.append(ref_base)  # grab mismatched base from the reference
                    mismatches += 1
            query_offset += elem.length_on_query
            target_offset += elem.length_on_target
        elif op == CigarOp.D:  # consumes ref
            md_edits.append(str(current_match_count))  # append match count and reset
            md_edits.append("^")
            md_edits.append(reference_sequence[target_offset : target_offset + elem.length].upper())
            current_match_count = 0
            # Early break when a deletion starts before its own length into the reference
            # (e.g., "6D4M" at position 0). This matches htsjdk/samtools behavior.
            if target_offset < elem.length:
                if not match_htsjdk:
                    deletions += 1
                    del_bases += elem.length
                break
            target_offset += elem.length
            deletions += 1
            del_bases += elem.length
        elif op in (CigarOp.I, CigarOp.S):  # consumes query
            query_offset += elem.length
            if op == CigarOp.I:
                insertions += 1
                ins_bases += elem.length
        elif op == CigarOp.N:  # skipped region from ref, consumes ref
            target_offset += elem.length
        elif op not in (CigarOp.H, CigarOp.P):  # pragma: not covered
            raise ValueError(f"Invalid CIGAR operation: {op}")

    md_edits.append(f"{current_match_count}")

    return ReadEditInfo(
        matches=matches,
        mismatches=mismatches,
        insertions=insertions,
        inserted_bases=ins_bases,
        deletions=deletions,
        deleted_bases=del_bases,
        nm=mismatches + ins_bases + del_bases,
        md="".join(md_edits),
    )

is_proper_pair ¶

is_proper_pair(rec1: AlignedSegment, rec2: AlignedSegment | None = None, max_insert_size: int = 1000, orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations, isize: Callable[[AlignedSegment, AlignedSegment | None], int] = isize) -> bool

Determines if a pair of records are properly paired or not.

Criteria for records in a proper pair are

Both records are aligned
Both records are aligned to the same reference sequence
The pair orientation of the records is one of the valid pair orientations (default "FR")
The inferred insert size is not more than a maximum length (default 1000)

Parameters:

Name	Type	Description	Default
`rec1`	`AlignedSegment`	The first record in the pair.	required
`rec2`	`AlignedSegment \| None`	The second record in the pair. If None, then mate info on `rec1` will be used.	`None`
`max_insert_size`	`int`	The maximum insert size to consider a pair "proper".	`1000`
`orientations`	`Collection[PairOrientation]`	The valid set of orientations to consider a pair "proper".	`DefaultProperlyPairedOrientations`
`isize`	`Callable[[AlignedSegment, AlignedSegment \| None], int]`	A function that takes the two alignments and calculates their isize.	`isize`

See

htsjdk.samtools.SamPairUtil.isProperPair()

Source code in fgpyo/sam/__init__.py

def is_proper_pair(
    rec1: AlignedSegment,
    rec2: AlignedSegment | None = None,
    max_insert_size: int = 1000,
    orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations,
    isize: Callable[[AlignedSegment, AlignedSegment | None], int] = isize,
) -> bool:
    """
    Determines if a pair of records are properly paired or not.

    Criteria for records in a proper pair are:
        - Both records are aligned
        - Both records are aligned to the same reference sequence
        - The pair orientation of the records is one of the valid pair orientations (default "FR")
        - The inferred insert size is not more than a maximum length (default 1000)

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
        max_insert_size: The maximum insert size to consider a pair "proper".
        orientations: The valid set of orientations to consider a pair "proper".
        isize: A function that takes the two alignments and calculates their isize.

    See:
        [`htsjdk.samtools.SamPairUtil.isProperPair()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L106-L125)
    """
    if rec2 is None:
        rec2_is_mapped = rec1.mate_is_mapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_mapped = rec2.is_mapped
        rec2_reference_id = rec2.reference_id

    return (
        rec1.is_mapped
        and rec2_is_mapped
        and rec1.reference_id == rec2_reference_id
        and PairOrientation.from_recs(rec1=rec1, rec2=rec2) in orientations
        and 0 < abs(isize(rec1, rec2)) <= max_insert_size
    )

isize ¶

isize(rec1: AlignedSegment, rec2: AlignedSegment | None = None) -> int

Computes the insert size ("template length" or "TLEN") for a pair of records.

Parameters:

Name	Type	Description	Default
`rec1`	`AlignedSegment`	The first record in the pair.	required
`rec2`	`AlignedSegment \| None`	The second record in the pair. If None, then mate info on `rec1` will be used.	`None`

Source code in fgpyo/sam/__init__.py

def isize(rec1: AlignedSegment, rec2: AlignedSegment | None = None) -> int:
    """
    Computes the insert size ("template length" or "TLEN") for a pair of records.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
    """
    if rec2 is None:
        rec2_is_unmapped = rec1.mate_is_unmapped
        rec2_reference_id = rec1.next_reference_id
    else:
        rec2_is_unmapped = rec2.is_unmapped
        rec2_reference_id = rec2.reference_id

    if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
        return 0

    if rec2 is None:
        rec2_is_forward = rec1.mate_is_forward
        rec2_reference_start = rec1.next_reference_start
    else:
        rec2_is_forward = rec2.is_forward
        rec2_reference_start = rec2.reference_start

    if rec1.is_forward and rec2_is_forward:
        return rec2_reference_start - rec1.reference_start
    if rec1.is_reverse and rec2_is_forward:
        assert rec1.reference_end is not None  # type narrowing
        return rec2_reference_start - rec1.reference_end

    if rec2 is None:
        if not rec1.has_tag("MC"):
            raise ValueError('Cannot determine proper pair status without a mate cigar ("MC")!')
        rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
        rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
    else:
        assert rec2.reference_end is not None  # type narrowing
        rec2_reference_end = rec2.reference_end

    if rec1.is_forward:
        return rec2_reference_end - rec1.reference_start
    else:
        assert rec1.reference_end is not None  # type narrowing
        return rec2_reference_end - rec1.reference_end

reader ¶

reader(path: SamPath, file_type: SamFileType | None = None, unmapped: bool = False) -> AlignmentFile

Opens a SAM/BAM/CRAM for reading.

To read from standard input, provide any of "-", "stdin", or "/dev/stdin" as the input path.

Parameters:

Name	Type	Description	Default
`path`	`SamPath`	a file handle or path to the SAM/BAM/CRAM to read or write.	required
`file_type`	`SamFileType \| None`	the file type to assume when opening the file. If None, then the file type will be auto-detected.	`None`
`unmapped`	`bool`	True if the file is unmapped and has no sequence dictionary, False otherwise.	`False`

Source code in fgpyo/sam/__init__.py

def reader(path: SamPath, file_type: SamFileType | None = None, unmapped: bool = False) -> SamFile:
    """
    Opens a SAM/BAM/CRAM for reading.

    To read from standard input, provide any of `"-"`, `"stdin"`, or `"/dev/stdin"` as the input
    `path`.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        file_type: the file type to assume when opening the file.  If None, then the file
            type will be auto-detected.
        unmapped: True if the file is unmapped and has no sequence dictionary, False otherwise.
    """
    return _pysam_open(path=path, open_for_reading=True, file_type=file_type, unmapped=unmapped)

set_mate_info ¶

set_mate_info(rec1: AlignedSegment, rec2: AlignedSegment, is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> None

Resets mate pair information between two primary alignments that share a query name.

Parameters:

Name	Type	Description	Default
`rec1`	`AlignedSegment`	The first record in the pair.	required
`rec2`	`AlignedSegment`	The second record in the pair.	required
`is_proper_pair`	`Callable[[AlignedSegment, AlignedSegment], bool]`	A function that takes the two alignments and determines proper pair status.	`is_proper_pair`
`isize`	`Callable[[AlignedSegment, AlignedSegment], int]`	A function that takes the two alignments and calculates their isize.	`isize`

Raises:

Type	Description
`ValueError`	If rec1 and rec2 are of the same read ordinal.
`ValueError`	If either rec1 or rec2 is secondary or supplementary.
`ValueError`	If rec1 and rec2 do not share the same query name.

Source code in fgpyo/sam/__init__.py

def set_mate_info(
    rec1: AlignedSegment,
    rec2: AlignedSegment,
    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> None:
    """
    Resets mate pair information between two primary alignments that share a query name.

    Args:
        rec1: The first record in the pair.
        rec2: The second record in the pair.
        is_proper_pair: A function that takes the two alignments and determines proper pair status.
        isize: A function that takes the two alignments and calculates their isize.

    Raises:
        ValueError: If rec1 and rec2 are of the same read ordinal.
        ValueError: If either rec1 or rec2 is secondary or supplementary.
        ValueError: If rec1 and rec2 do not share the same query name.
    """
    for dest, source in [(rec1, rec2), (rec2, rec1)]:
        _set_common_mate_fields(dest=dest, mate_primary=source)

    template_length = isize(rec1, rec2)
    rec1.template_length = template_length
    rec2.template_length = -template_length

    proper_pair = is_proper_pair(rec1, rec2)
    rec1.is_proper_pair = proper_pair
    rec2.is_proper_pair = proper_pair

set_mate_info_on_secondary ¶

set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a secondary alignment from its mate's primary alignment.

Parameters:

Name	Type	Description	Default
`secondary`	`AlignedSegment`	The secondary alignment to set mate information upon.	required
`mate_primary`	`AlignedSegment`	The primary alignment of the secondary's mate.	required

Raises:

Type	Description
`ValueError`	If secondary and mate_primary are of the same read ordinal.
`ValueError`	If secondary and mate_primary do not share the same query name.
`ValueError`	If mate_primary is secondary or supplementary.
`ValueError`	If secondary is not marked as a secondary alignment.

Source code in fgpyo/sam/__init__.py

def set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """
    Set mate info on a secondary alignment from its mate's primary alignment.

    Args:
        secondary: The secondary alignment to set mate information upon.
        mate_primary: The primary alignment of the secondary's mate.

    Raises:
        ValueError: If secondary and mate_primary are of the same read ordinal.
        ValueError: If secondary and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If secondary is not marked as a secondary alignment.
    """
    if not secondary.is_secondary:
        raise ValueError("Cannot set mate info on an alignment not marked as secondary!")

    _set_common_mate_fields(dest=secondary, mate_primary=mate_primary)

set_mate_info_on_supplementary ¶

set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None

Set mate info on a supplementary alignment from its mate's primary alignment.

Parameters:

Name	Type	Description	Default
`supp`	`AlignedSegment`	The supplementary alignment to set mate information upon.	required
`mate_primary`	`AlignedSegment`	The primary alignment of the supplementary's mate.	required

Raises:

Type	Description
`ValueError`	If supp and mate_primary are of the same read ordinal.
`ValueError`	If supp and mate_primary do not share the same query name.
`ValueError`	If mate_primary is secondary or supplementary.
`ValueError`	If supp is not marked as a supplementary alignment.

Source code in fgpyo/sam/__init__.py

def set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None:
    """
    Set mate info on a supplementary alignment from its mate's primary alignment.

    Args:
        supp: The supplementary alignment to set mate information upon.
        mate_primary: The primary alignment of the supplementary's mate.

    Raises:
        ValueError: If supp and mate_primary are of the same read ordinal.
        ValueError: If supp and mate_primary do not share the same query name.
        ValueError: If mate_primary is secondary or supplementary.
        ValueError: If supp is not marked as a supplementary alignment.
    """
    if not supp.is_supplementary:
        raise ValueError("Cannot set mate info on an alignment not marked as supplementary!")

    _set_common_mate_fields(dest=supp, mate_primary=mate_primary)

    # NB: for a non-secondary supplemental alignment, set the following to the same as the primary.
    if not supp.is_secondary:
        supp.is_proper_pair = mate_primary.is_proper_pair
        supp.template_length = -mate_primary.template_length

set_pair_info ¶

set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None

Resets mate pair information between reads in a pair.

Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters:

Name	Type	Description	Default
`r1`	`AlignedSegment`	Read 1 (first read in the template).	required
`r2`	`AlignedSegment`	Read 2 with the same query name as r1 (second read in the template).	required
`proper_pair`	`bool`	whether the pair is proper or not.	`True`

Source code in fgpyo/sam/__init__.py

@deprecated("Use `set_mate_info()` instead. Deprecated after fgpyo 0.8.0.")
def set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None:
    """
    Resets mate pair information between reads in a pair.

    Can be handed reads that already have pairing flags setup or independent R1 and R2 records that
    are currently flagged as SE reads.

    Args:
        r1: Read 1 (first read in the template).
        r2: Read 2 with the same query name as r1 (second read in the template).
        proper_pair: whether the pair is proper or not.
    """
    if r1.query_name != r2.query_name:
        raise ValueError("Cannot set pair info on reads with different query names!")

    for r in [r1, r2]:
        r.is_paired = True

    r1.is_read1 = True
    r1.is_read2 = False
    r2.is_read2 = True
    r2.is_read1 = False

    set_mate_info(rec1=r1, rec2=r2, is_proper_pair=lambda _a, _b: proper_pair)

sum_of_base_qualities ¶

sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int

Calculate the sum of base qualities score for an alignment record.

This function is useful for calculating the "mate score" as implemented in samtools fixmate. Consistently with samtools fixmate, this function returns 0 if the record has no base qualities.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	The alignment record to calculate the sum of base qualities from.	required
`min_quality_score`	`int`	The minimum base quality score to use for summation.	`15`

Returns:

Type	Description
`int`	The sum of base qualities on the input record. 0 if the record has no base qualities.

See

calc_sum_of_base_qualities() MD_MIN_QUALITY

Source code in fgpyo/sam/__init__.py

def sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int:
    """
    Calculate the sum of base qualities score for an alignment record.

    This function is useful for calculating the "mate score" as implemented in `samtools fixmate`.
    Consistently with `samtools fixmate`, this function returns 0 if the record has no base
    qualities.

    Args:
        rec: The alignment record to calculate the sum of base qualities from.
        min_quality_score: The minimum base quality score to use for summation.

    Returns:
        The sum of base qualities on the input record. 0 if the record has no base qualities.

    See:
        [`calc_sum_of_base_qualities()`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L227-L238)
        [`MD_MIN_QUALITY`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L42)
    """
    if rec.query_qualities is None or rec.query_qualities == NO_QUERY_QUALITIES:
        return 0

    score: int = sum(qual for qual in rec.query_qualities if qual >= min_quality_score)
    return score

writer ¶

writer(path: SamPath, header: str | dict[str, Any] | AlignmentHeader, file_type: SamFileType | None = None) -> AlignmentFile

Opens a SAM/BAM/CRAM for writing.

To write to standard output, provide any of "-", "stdout", or "/dev/stdout" as the output path. Note: When writing to stdout, the file_type must be given.

Parameters:

Name	Type	Description	Default
`path`	`SamPath`	a file handle or path to the SAM/BAM/CRAM to read or write.	required
`header`	`str \| dict[str, Any] \| AlignmentHeader`	Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.	required
`file_type`	`SamFileType \| None`	the file type to assume when opening the file. If `None`, then the filetype will be auto-detected and must be a path-like object. This argument is required when writing to standard output.	`None`

Source code in fgpyo/sam/__init__.py

def writer(
    path: SamPath,
    header: str | dict[str, Any] | SamHeader,
    file_type: SamFileType | None = None,
) -> SamFile:
    """
    Opens a SAM/BAM/CRAM for writing.

    To write to standard output, provide any of `"-"`, `"stdout"`, or `"/dev/stdout"` as the output
    `path`. **Note**: When writing to `stdout`, the `file_type` _must_ be given.

    Args:
        path: a file handle or path to the SAM/BAM/CRAM to read or write.
        header: Either a string to use for the header or a multi-level dictionary.  The
            multi-level dictionary should be given as follows.  The first level are the four
            types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being
            a list of tag-value pairs. The header is constructed first from all the defined
            fields, followed by user tags in alphabetical order.
        file_type: the file type to assume when opening the file.  If `None`, then the
            filetype will be auto-detected and must be a path-like object. This argument is required
            when writing to standard output.
    """
    # Set the header for pysam's AlignmentFile
    key = "text" if isinstance(header, str) else "header"
    kwargs = {key: header}

    return _pysam_open(
        path=path, open_for_reading=False, file_type=file_type, unmapped=False, **kwargs
    )

Modules¶

builder ¶

Classes for generating SAM and BAM files and records for testing.¶

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

Classes¶

SamBuilder ¶

Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can be written to a temporary file using to_path().

Source code in fgpyo/sam/builder.py

class SamBuilder:
    """
    Builder for constructing one or more sam records (AlignmentSegments in pysam terms).

    Provides the ability to manufacture records from minimal arguments, while generating
    any remaining attributes to ensure a valid record.

    A builder is constructed with a handful of defaults including lengths for generated R1s
    and R2s, the default base quality score to use, a sequence dictionary and a single read group.

    Records are then added using the [`add_pair()`][fgpyo.sam.builder.SamBuilder.add_pair]
    method.  Once accumulated the records can be accessed in the order in which they were created
    through the [`to_unsorted_list()`][fgpyo.sam.builder.SamBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.sam.builder.SamBuilder.to_sorted_list].  The latter creates
    a temporary file to do the sorting and is somewhat slower as a result.  Lastly, the records can
    be written to a temporary file using
    [`to_path()`][fgpyo.sam.builder.SamBuilder.to_path].
    """

    # The default read one length
    DEFAULT_R1_LENGTH: int = 100

    # The default read two length
    DEFAULT_R2_LENGTH: int = 100

    @staticmethod
    def default_sd() -> list[dict[str, Any]]:
        """
        Generates the sequence dictionary that is used by default by SamBuilder.

        Matches the names and lengths of the HG19 reference in use in production.

        Returns:
            A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
        """
        return [
            {"SN": "chr1", "LN": 249250621},
            {"SN": "chr2", "LN": 243199373},
            {"SN": "chr3", "LN": 198022430},
            {"SN": "chr4", "LN": 191154276},
            {"SN": "chr5", "LN": 180915260},
            {"SN": "chr6", "LN": 171115067},
            {"SN": "chr7", "LN": 159138663},
            {"SN": "chr8", "LN": 146364022},
            {"SN": "chr9", "LN": 141213431},
            {"SN": "chr10", "LN": 135534747},
            {"SN": "chr11", "LN": 135006516},
            {"SN": "chr12", "LN": 133851895},
            {"SN": "chr13", "LN": 115169878},
            {"SN": "chr14", "LN": 107349540},
            {"SN": "chr15", "LN": 102531392},
            {"SN": "chr16", "LN": 90354753},
            {"SN": "chr17", "LN": 81195210},
            {"SN": "chr18", "LN": 78077248},
            {"SN": "chr19", "LN": 59128983},
            {"SN": "chr20", "LN": 63025520},
            {"SN": "chr21", "LN": 48129895},
            {"SN": "chr22", "LN": 51304566},
            {"SN": "chrX", "LN": 155270560},
            {"SN": "chrY", "LN": 59373566},
            {"SN": "chrM", "LN": 16571},
        ]

    @staticmethod
    def default_rg() -> dict[str, str]:
        """Returns the default read group used by the SamBuilder, as a dictionary."""
        return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}

    def __init__(
        self,
        r1_len: int | None = None,
        r2_len: int | None = None,
        base_quality: int = 30,
        mapping_quality: int = 60,
        sd: list[dict[str, Any]] | None = None,
        rg: dict[str, str] | None = None,
        extra_header: dict[str, Any] | None = None,
        seed: int = 42,
        sort_order: SamOrder = SamOrder.Coordinate,
    ) -> None:
        """
        Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

        Args:
            r1_len: The length of R1s to create unless otherwise specified
            r2_len: The length of R2s to create unless otherwise specified
            base_quality: The base quality of bases to create unless otherwise specified
            mapping_quality: The mapping quality of records to create unless otherwise specified
            sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
            rg: a single read group as a dict; defaults to calling default_sd() if None
            extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                          `pysam.AlignmentHeader` for more details.
            seed: a seed value for random number/string generation
            sort_order: Order to sort records when writing to file, or output of to_sorted_list()
        """
        self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
        self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
        self.base_quality: int = base_quality
        self.mapping_quality: int = mapping_quality

        if not isinstance(sort_order, SamOrder):
            raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
        self._sort_order = sort_order

        self._header: dict[str, Any] = {
            "HD": {"VN": "1.5", "SO": sort_order.value},
            "SQ": (sd if sd is not None else SamBuilder.default_sd()),
            "RG": [(rg if rg is not None else SamBuilder.default_rg())],
        }
        if extra_header is not None:
            self._header = {**self._header, **extra_header}
        self._samheader = AlignmentHeader.from_dict(self._header)
        self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

        self._random: Random = Random(seed)
        self._records: list[AlignedSegment] = []
        self._counter: int = 0

    def _next_name(self) -> str:
        """Returns the next available query/template name."""
        n = self._counter
        self._counter += 1
        return f"q{n:>04}"

    def _bases(self, length: int) -> str:
        """Returns a random string of bases of the length requested."""
        return "".join(self._random.choices("ACGT", k=length))

    def _new_rec(
        self,
        name: str,
        chrom: str,
        start: int,
        mapq: int | None,
        attrs: dict[str, Any] | None,
    ) -> AlignedSegment:
        """
        Generates a new AlignedSegment.

        Sets the segment up with the correct header and adds the RG attribute if not
        contained in attrs.

        Args:
            name: the name of the read/template
            chrom: the chromosome to which the read is mapped
            start: the start position of the read on the chromosome
            mapq: an optional mapping quality; use self.mapping_quality if None
            attrs: an optional dictionary of SAM attributes with two-char keys

        Returns:
            AlignedSegment: an aligned segment with name, chrom, pos, attributes the
                read group, and the unmapped flag all set appropriately.
        """
        if chrom is not sam.NO_REF_NAME and chrom not in self._seq_lookup:
            raise ValueError(f"{chrom} is not a valid chromosome name in this builder.")

        rec = AlignedSegment(header=self._samheader)
        rec.query_name = name
        rec.reference_name = chrom
        rec.reference_start = start
        rec.mapping_quality = mapq if mapq is not None else self.mapping_quality

        if chrom == sam.NO_REF_NAME or start == sam.NO_REF_POS:
            rec.is_unmapped = True
            rec.mapping_quality = 0

        attrs = attrs if attrs else dict()
        if "RG" not in attrs:
            attrs["RG"] = self.rg_id()
        rec.set_tags(list(attrs.items()))
        return rec

    def _set_flags(
        self,
        rec: pysam.AlignedSegment,
        read_num: int | None,
        strand: str,
        secondary: bool = False,
        supplementary: bool = False,
    ) -> None:
        """
        Appropriately sets most flag fields on the given read.

        Args:
            rec: the read to set the flags on
            read_num: Either None for an unpaired read, or 1 or 2
            strand: Either "+" or "-" to indicate strand of the read
            secondary: If True, set the secondary alignment flag
            supplementary: If True, set the supplementary alignment flag
        """
        rec.is_paired = read_num is not None
        rec.is_read1 = read_num == 1
        rec.is_read2 = read_num == 2
        rec.is_qcfail = False
        rec.is_duplicate = False
        rec.is_secondary = secondary
        rec.is_supplementary = supplementary
        if not rec.is_unmapped:
            rec.is_reverse = strand != "+"

    def _set_length_dependent_fields(
        self,
        rec: pysam.AlignedSegment,
        length: int,
        bases: str | None = _UNSET,
        quals: list[int] | None = _UNSET,
        cigar: str | None = None,
    ) -> None:
        """
        Fills in bases, quals and cigar on a record.

        If any of bases, quals or cigar are defined, they must all have the same length/query
        length.  If none are defined then the length parameter is used.  Unspecified values are
        synthesized at the inferred length.  A caller may pass `None` explicitly for `bases` or
        `quals` to produce a record with no sequence or no qualities.

        Args:
            rec: a SAM record
            length: the length to use if none of bases/quals/cigar contribute a length
            bases: a string of bases for the read, or `None` to produce a record with no sequence.
                If omitted, a random sequence is synthesized.
            quals: a list of qualities for the read, or `None` to produce a record with no
                qualities. If omitted, the default base quality is used.
            cigar: an optional cigar string for the read
        """
        # Only concrete values contribute a length.  An explicit `None` is a user request for
        # a missing attribute and contributes nothing, just like an omitted argument.
        lengths = set()
        if bases is not _UNSET and bases is not None:
            lengths.add(len(bases))
        if quals is not _UNSET and quals is not None:
            lengths.add(len(quals))
        if cigar is not None:
            cig = sam.Cigar.from_cigarstring(cigar)
            lengths.add(sum(elem.length_on_query for elem in cig.elements))

        if not lengths:
            lengths.add(length)

        if len(lengths) != 1:
            raise ValueError("Provided bases/quals/cigar are not length compatible.")

        length = lengths.pop()
        resolved_bases = self._bases(length) if bases is _UNSET else bases
        resolved_quals = self._resolve_quals(quals, bases, length)

        # Assign sequence first: pysam resets query_qualities when query_sequence is assigned.
        rec.query_sequence = resolved_bases
        rec.query_qualities = resolved_quals

        if not rec.is_unmapped:
            rec.cigarstring = cigar if cigar else f"{length}M"

    def _resolve_quals(
        self,
        quals: list[int] | None,
        bases: str | None,
        length: int,
    ) -> "array[int] | None":
        """Resolves the value to assign to `query_qualities` from the caller's inputs."""
        if quals is None:
            return None
        if quals is not _UNSET:
            # Qualities without a sequence is explicitly disallowed by the SAM spec.
            if bases is None:
                raise ValueError("Cannot provide qualities when bases is None.")
            return array("B", quals)
        # Argument was omitted: synthesize unless the sequence was explicitly cleared.
        if bases is None:
            return None
        return array("B", [self.base_quality] * length)

    def rg(self) -> dict[str, Any]:
        """Returns the single read group that is defined in the header."""
        # The `RG` field contains a list of read group mappings
        # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
        rgs = cast(list[dict[str, Any]], self._header["RG"])
        assert len(rgs) == 1, "Header did not contain exactly one read group!"
        return rgs[0]

    def rg_id(self) -> str:
        """Returns the ID of the single read group that is defined in the header."""
        # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
        # field is always a string.
        return cast(str, self.rg()["ID"])

    def add_pair(
        self,
        *,
        name: str | None = None,
        bases1: str | None = _UNSET,
        bases2: str | None = _UNSET,
        quals1: list[int] | None = _UNSET,
        quals2: list[int] | None = _UNSET,
        chrom: str | None = None,
        chrom1: str | None = None,
        chrom2: str | None = None,
        start1: int = sam.NO_REF_POS,
        start2: int = sam.NO_REF_POS,
        cigar1: str | None = None,
        cigar2: str | None = None,
        mapq1: int | None = None,
        mapq2: int | None = None,
        strand1: str = "+",
        strand2: str = "-",
        attrs: dict[str, Any] | None = None,
    ) -> tuple[AlignedSegment, AlignedSegment]:
        """
        Generates a new pair of reads, adds them to the internal collection, and returns them.

        Most fields are optional.

        Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
        pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
        where reads map to different contigs. i.e.:

            - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
              the same contig (`chrom`).
            - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
              map to different contigs (`chrom1` and `chrom2`).

        A pair with only one of the two reads mapped can be created by setting only one start
        position. Flags will automatically be set correctly for the unmapped mate.

            - `add_pair(chrom, start1)`
            - `add_pair(chrom1, start1)`
            - `add_pair(chrom, start2)`
            - `add_pair(chrom2, start2)`

        An unmapped pair can be created by calling the method with no parameters (specifically,
        not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
        provided, it will be ignored.

        For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
        or absence of bases, quals, and cigar.  If values are provided for one or more of these
        parameters, the lengths must match, and the length will be used to generate any
        unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
        synthesized based on either the r1_len or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            bases1: The bases for R1. If omitted, a random sequence is generated. Pass `None`
                explicitly to produce a record with no sequence.
            bases2: The bases for R2. If omitted, a random sequence is generated. Pass `None`
                explicitly to produce a record with no sequence.
            quals1: The list of int qualities for R1. If omitted, the default base quality is used.
                Pass `None` explicitly to produce a record with no qualities.
            quals2: The list of int qualities for R2. If omitted, the default base quality is used.
                Pass `None` explicitly to produce a record with no qualities.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
            chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
            start1: The start position of R1. Defaults to the unmapped value.
            start2: The start position of R2. Defaults to the unmapped value.
            cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
            mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
            mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
            strand1: The strand for R1, either "+" or "-". Defaults to "+".
            strand2: The strand for R2, either "+" or "-". Defaults to "-".
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if either strand field is not "+" or "-"
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
        """
        if strand1 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand1}")
        if strand2 not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand2: {strand2}")

        name = name if name is not None else self._next_name()

        # Valid parameterizations for contig mapping (backward compatible):
        # - chrom, start1, start2
        # - chrom, start1
        # - chrom, start2
        # Valid parameterizations for contig mapping (new):
        # - chrom1, start1, chrom2, start2
        # - chrom1, start1
        # - chrom2, start2
        if chrom is not None and (chrom1 is not None or chrom2 is not None):
            raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

        chrom = sam.NO_REF_NAME if chrom is None else chrom

        if start1 != sam.NO_REF_POS:
            chrom1 = next(c for c in (chrom1, chrom) if c is not None)
        else:
            chrom1 = sam.NO_REF_NAME

        if start2 != sam.NO_REF_POS:
            chrom2 = next(c for c in (chrom2, chrom) if c is not None)
        else:
            chrom2 = sam.NO_REF_NAME

        if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
            raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

        if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
            raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

        # Setup R1
        r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
        self._set_flags(r1, read_num=1, strand=strand1)
        self._set_length_dependent_fields(
            rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
        )

        # Setup R2
        r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
        self._set_flags(r2, read_num=2, strand=strand2)
        self._set_length_dependent_fields(
            rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
        )

        # Sync up mate info and we're done!
        sam.set_mate_info(r1, r2)
        self._records.append(r1)
        self._records.append(r2)
        return r1, r2

    def add_single(
        self,
        *,
        name: str | None = None,
        read_num: int | None = None,
        bases: str | None = _UNSET,
        quals: list[int] | None = _UNSET,
        chrom: str = sam.NO_REF_NAME,
        start: int = sam.NO_REF_POS,
        cigar: str | None = None,
        mapq: int | None = None,
        strand: str = "+",
        secondary: bool = False,
        supplementary: bool = False,
        attrs: dict[str, Any] | None = None,
    ) -> AlignedSegment:
        """
        Generates a new single reads, adds them to the internal collection, and returns it.

        Most fields are optional.

        If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
        set to 1 or 2, the read will have it's paired flag set and read number flags set.

        An unmapped read can be created by calling the method with no parameters (specifically,
        not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

        A mapped read is created by providing chrom and start.

        The length of the read is determined based on the presence or absence of bases, quals,
        and cigar.  If values are provided for one or more of these parameters, the lengths must
        match, and the length will be used to generate any unsupplied values.  If none of bases,
        quals, and cigar are provided, all three will be synthesized based on either the r1_len
        or r2_len stored on the class as appropriate.

        When synthesizing, bases are always a random sequence of bases, quals are all the default
        base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
        operator of the read length.

        Args:
            name: The name of the template. If None is given a unique name will be auto-generated.
            read_num: Either None, 1 for R1 or 2 for R2
            bases: The bases for the read. If omitted, a random sequence is generated. Pass
                `None` explicitly to produce a record with no sequence.
            quals: The list of qualities for the read. If omitted, the default base quality is
                used. Pass `None` explicitly to produce a record with no qualities.
            chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
            start: The start position of the read. Defaults to the unmapped value.
            cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
            mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
            strand: The strand for R1, either "+" or "-". Defaults to "+".
            secondary: If true the read will be flagged as secondary
            supplementary: If true the read will be flagged as supplementary
            attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

        Raises:
            ValueError: if strand field is not "+" or "-"
            ValueError: if read_num is not None, 1 or 2
            ValueError: if bases/quals/cigar are set in a way that is not self-consistent

        Returns:
            AlignedSegment: The record created
        """
        if strand not in ["+", "-"]:
            raise ValueError(f"Invalid value for strand1: {strand}")
        if read_num not in [None, 1, 2]:
            raise ValueError(f"Invalid value for read_num: {read_num}")

        name = name if name is not None else self._next_name()

        # Setup the read
        read_len = self.r1_len if read_num != 2 else self.r2_len
        rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
        self._set_flags(
            rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
        )
        self._set_length_dependent_fields(
            rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
        )

        self._records.append(rec)
        return rec

    def to_path(  # noqa: C901
        self,
        path: Path | None = None,
        index: bool = True,
        pred: Callable[[AlignedSegment], bool] = lambda _r: True,
        tmp_file_type: sam.SamFileType | None = None,
    ) -> Path:
        """
        Writes the accumulated records to a file, sorts & indexes it, and returns the Path.

        If a path is provided, it will be written to, otherwise a temporary file is created
        and returned.

        If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
        (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
        is provided.  See `~pysam` for more details.

        If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
        provided.

        Args:
            path: a path at which to write the file, otherwise a temp file is used.
            index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
                   an index is generated, otherwise not.
            pred: optional predicate to specify which reads should be output
            tmp_file_type: the file type to output when a path is not provided (default is BAM)

        Returns:
            Path: The path to the sorted (and possibly indexed) file.
        """
        if path is not None:
            # Get the file type if a path was given (in this case, a file type may not be
            # provided too)
            if tmp_file_type is not None:
                raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
            tmp_file_type = sam.SamFileType.from_path(path)
        elif tmp_file_type is None:
            # Use the provided file type
            tmp_file_type = sam.SamFileType.BAM

        # Get the extension, and create a path if none was given
        ext = tmp_file_type.extension
        if path is None:
            with NamedTemporaryFile(suffix=ext, delete=False) as fp:
                path = Path(fp.name)

        with NamedTemporaryFile(suffix=ext, delete=True) as fp:
            file_handle: IO
            if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
                file_handle = path.open("w")
            else:
                file_handle = fp.file

            with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
                for rec in self._records:
                    if pred(rec):
                        writer.write(rec)

            samtools_sort_args = ["-o", str(path), fp.name]

            file_handle.close()
            if self._sort_order == SamOrder.QueryName:
                pysam.sort("-n", *samtools_sort_args)
            elif self._sort_order == SamOrder.Coordinate:
                if index and tmp_file_type.indexable:
                    samtools_sort_args.insert(0, "--write-index")
                pysam.sort(*samtools_sort_args)

        return path

    def __len__(self) -> int:
        """Returns the number of records accumulated so far."""
        return len(self._records)

    def to_unsorted_list(self) -> list[pysam.AlignedSegment]:
        """Returns the accumulated records in the order they were created."""
        return list(self._records)

    def to_sorted_list(self) -> list[pysam.AlignedSegment]:
        """Returns the accumulated records in coordinate order."""
        with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
            filename = fp.name
            path = self.to_path(path=Path(filename), index=False)
            bam = sam.reader(path)
            return list(bam)

    @property
    def header(self) -> AlignmentHeader:
        """Returns the builder's SAM header."""
        return self._samheader

Attributes¶

header: AlignmentHeader

Returns the builder's SAM header.

Functions¶

__init__ ¶

__init__(r1_len: int | None = None, r2_len: int | None = None, base_quality: int = 30, mapping_quality: int = 60, sd: list[dict[str, Any]] | None = None, rg: dict[str, str] | None = None, extra_header: dict[str, Any] | None = None, seed: int = 42, sort_order: SamOrder = Coordinate) -> None

Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

Parameters:

Name	Type	Description	Default
`r1_len`	`int \| None`	The length of R1s to create unless otherwise specified	`None`
`r2_len`	`int \| None`	The length of R2s to create unless otherwise specified	`None`
`base_quality`	`int`	The base quality of bases to create unless otherwise specified	`30`
`mapping_quality`	`int`	The mapping quality of records to create unless otherwise specified	`60`
`sd`	`list[dict[str, Any]] \| None`	a sequence dictionary as a list of dicts; defaults to calling default_sd() if None	`None`
`rg`	`dict[str, str] \| None`	a single read group as a dict; defaults to calling default_sd() if None	`None`
`extra_header`	`dict[str, Any] \| None`	a dictionary of extra values to add to the header, None otherwise. See `pysam.AlignmentHeader` for more details.	`None`
`seed`	`int`	a seed value for random number/string generation	`42`
`sort_order`	`SamOrder`	Order to sort records when writing to file, or output of to_sorted_list()	`Coordinate`

Source code in fgpyo/sam/builder.py

def __init__(
    self,
    r1_len: int | None = None,
    r2_len: int | None = None,
    base_quality: int = 30,
    mapping_quality: int = 60,
    sd: list[dict[str, Any]] | None = None,
    rg: dict[str, str] | None = None,
    extra_header: dict[str, Any] | None = None,
    seed: int = 42,
    sort_order: SamOrder = SamOrder.Coordinate,
) -> None:
    """
    Initializes a new SamBuilder for generating alignment records and SAM/BAM files.

    Args:
        r1_len: The length of R1s to create unless otherwise specified
        r2_len: The length of R2s to create unless otherwise specified
        base_quality: The base quality of bases to create unless otherwise specified
        mapping_quality: The mapping quality of records to create unless otherwise specified
        sd: a sequence dictionary as a list of dicts; defaults to calling default_sd() if None
        rg: a single read group as a dict; defaults to calling default_sd() if None
        extra_header: a dictionary of extra values to add to the header, None otherwise.  See
                      `pysam.AlignmentHeader` for more details.
        seed: a seed value for random number/string generation
        sort_order: Order to sort records when writing to file, or output of to_sorted_list()
    """
    self.r1_len: int = r1_len if r1_len is not None else self.DEFAULT_R1_LENGTH
    self.r2_len: int = r2_len if r2_len is not None else self.DEFAULT_R2_LENGTH
    self.base_quality: int = base_quality
    self.mapping_quality: int = mapping_quality

    if not isinstance(sort_order, SamOrder):
        raise ValueError(f"sort_order must be a SamOrder, got {type(sort_order)}")
    self._sort_order = sort_order

    self._header: dict[str, Any] = {
        "HD": {"VN": "1.5", "SO": sort_order.value},
        "SQ": (sd if sd is not None else SamBuilder.default_sd()),
        "RG": [(rg if rg is not None else SamBuilder.default_rg())],
    }
    if extra_header is not None:
        self._header = {**self._header, **extra_header}
    self._samheader = AlignmentHeader.from_dict(self._header)
    self._seq_lookup = dict([(s["SN"], s) for s in self._header["SQ"]])

    self._random: Random = Random(seed)
    self._records: list[AlignedSegment] = []
    self._counter: int = 0

__len__ ¶

__len__() -> int

Returns the number of records accumulated so far.

Source code in fgpyo/sam/builder.py

def __len__(self) -> int:
    """Returns the number of records accumulated so far."""
    return len(self._records)

add_pair ¶

add_pair(*, name: str | None = None, bases1: str | None = _UNSET, bases2: str | None = _UNSET, quals1: list[int] | None = _UNSET, quals2: list[int] | None = _UNSET, chrom: str | None = None, chrom1: str | None = None, chrom2: str | None = None, start1: int = NO_REF_POS, start2: int = NO_REF_POS, cigar1: str | None = None, cigar2: str | None = None, mapq1: int | None = None, mapq2: int | None = None, strand1: str = '+', strand2: str = '-', attrs: dict[str, Any] | None = None) -> tuple[AlignedSegment, AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

Mapped pairs can be created by specifying both start1 and start2 and either chrom, for pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs where reads map to different contigs. i.e.:

- `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
  the same contig (`chrom`).
- `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
  map to different contigs (`chrom1` and `chrom2`).

A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.

- `add_pair(chrom, start1)`
- `add_pair(chrom1, start1)`
- `add_pair(chrom, start2)`
- `add_pair(chrom2, start2)`

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is provided, it will be ignored.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the template. If None is given a unique name will be auto-generated.	`None`
`bases1`	`str \| None`	The bases for R1. If omitted, a random sequence is generated. Pass `None` explicitly to produce a record with no sequence.	`_UNSET`
`bases2`	`str \| None`	The bases for R2. If omitted, a random sequence is generated. Pass `None` explicitly to produce a record with no sequence.	`_UNSET`
`quals1`	`list[int] \| None`	The list of int qualities for R1. If omitted, the default base quality is used. Pass `None` explicitly to produce a record with no qualities.	`_UNSET`
`quals2`	`list[int] \| None`	The list of int qualities for R2. If omitted, the default base quality is used. Pass `None` explicitly to produce a record with no qualities.	`_UNSET`
`chrom`	`str \| None`	The chromosome to which both reads are mapped. Defaults to the unmapped value.	`None`
`chrom1`	`str \| None`	The chromosome to which R1 is mapped. If None, `chrom` is used.	`None`
`chrom2`	`str \| None`	The chromosome to which R2 is mapped. If None, `chrom` is used.	`None`
`start1`	`int`	The start position of R1. Defaults to the unmapped value.	`NO_REF_POS`
`start2`	`int`	The start position of R2. Defaults to the unmapped value.	`NO_REF_POS`
`cigar1`	`str \| None`	The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.	`None`
`cigar2`	`str \| None`	The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.	`None`
`mapq1`	`int \| None`	Mapping quality for R1. Defaults to self.mapping_quality if None.	`None`
`mapq2`	`int \| None`	Mapping quality for R2. Defaults to self.mapping_quality if None.	`None`
`strand1`	`str`	The strand for R1, either "+" or "-". Defaults to "+".	`'+'`
`strand2`	`str`	The strand for R2, either "+" or "-". Defaults to "-".	`'-'`
`attrs`	`dict[str, Any] \| None`	An optional dictionary of SAM attribute to place on both R1 and R2.	`None`

Raises:

Type	Description
`ValueError`	if either strand field is not "+" or "-"
`ValueError`	if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Type	Description
`tuple[AlignedSegment, AlignedSegment]`	Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.

Source code in fgpyo/sam/builder.py

def add_pair(
    self,
    *,
    name: str | None = None,
    bases1: str | None = _UNSET,
    bases2: str | None = _UNSET,
    quals1: list[int] | None = _UNSET,
    quals2: list[int] | None = _UNSET,
    chrom: str | None = None,
    chrom1: str | None = None,
    chrom2: str | None = None,
    start1: int = sam.NO_REF_POS,
    start2: int = sam.NO_REF_POS,
    cigar1: str | None = None,
    cigar2: str | None = None,
    mapq1: int | None = None,
    mapq2: int | None = None,
    strand1: str = "+",
    strand2: str = "-",
    attrs: dict[str, Any] | None = None,
) -> tuple[AlignedSegment, AlignedSegment]:
    """
    Generates a new pair of reads, adds them to the internal collection, and returns them.

    Most fields are optional.

    Mapped pairs can be created by specifying both `start1` and `start2` and either `chrom`, for
    pairs where both reads map to the same contig, or both `chrom1` and `chrom2`, for pairs
    where reads map to different contigs. i.e.:

        - `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
          the same contig (`chrom`).
        - `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
          map to different contigs (`chrom1` and `chrom2`).

    A pair with only one of the two reads mapped can be created by setting only one start
    position. Flags will automatically be set correctly for the unmapped mate.

        - `add_pair(chrom, start1)`
        - `add_pair(chrom1, start1)`
        - `add_pair(chrom, start2)`
        - `add_pair(chrom2, start2)`

    An unmapped pair can be created by calling the method with no parameters (specifically,
    not setting `chrom`, `chrom1`, `start1`, `chrom2`, or `start2`). If either cigar is
    provided, it will be ignored.

    For a given read (i.e. R1 or R2) the length of the read is determined based on the presence
    or absence of bases, quals, and cigar.  If values are provided for one or more of these
    parameters, the lengths must match, and the length will be used to generate any
    unsupplied values.  If none of bases, quals, and cigar are provided, all three will be
    synthesized based on either the r1_len or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        bases1: The bases for R1. If omitted, a random sequence is generated. Pass `None`
            explicitly to produce a record with no sequence.
        bases2: The bases for R2. If omitted, a random sequence is generated. Pass `None`
            explicitly to produce a record with no sequence.
        quals1: The list of int qualities for R1. If omitted, the default base quality is used.
            Pass `None` explicitly to produce a record with no qualities.
        quals2: The list of int qualities for R2. If omitted, the default base quality is used.
            Pass `None` explicitly to produce a record with no qualities.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        chrom1: The chromosome to which R1 is mapped. If None, `chrom` is used.
        chrom2: The chromosome to which R2 is mapped. If None, `chrom` is used.
        start1: The start position of R1. Defaults to the unmapped value.
        start2: The start position of R2. Defaults to the unmapped value.
        cigar1: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        cigar2: The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
        mapq1: Mapping quality for R1. Defaults to self.mapping_quality if None.
        mapq2: Mapping quality for R2. Defaults to self.mapping_quality if None.
        strand1: The strand for R1, either "+" or "-". Defaults to "+".
        strand2: The strand for R2, either "+" or "-". Defaults to "-".
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if either strand field is not "+" or "-"
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2.
    """
    if strand1 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand1}")
    if strand2 not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand2: {strand2}")

    name = name if name is not None else self._next_name()

    # Valid parameterizations for contig mapping (backward compatible):
    # - chrom, start1, start2
    # - chrom, start1
    # - chrom, start2
    # Valid parameterizations for contig mapping (new):
    # - chrom1, start1, chrom2, start2
    # - chrom1, start1
    # - chrom2, start2
    if chrom is not None and (chrom1 is not None or chrom2 is not None):
        raise ValueError("Cannot use chrom in combination with chrom1 or chrom2")

    chrom = sam.NO_REF_NAME if chrom is None else chrom

    if start1 != sam.NO_REF_POS:
        chrom1 = next(c for c in (chrom1, chrom) if c is not None)
    else:
        chrom1 = sam.NO_REF_NAME

    if start2 != sam.NO_REF_POS:
        chrom2 = next(c for c in (chrom2, chrom) if c is not None)
    else:
        chrom2 = sam.NO_REF_NAME

    if chrom1 == sam.NO_REF_NAME and start1 != sam.NO_REF_POS:
        raise ValueError("start1 cannot be used on its own - specify chrom or chrom1")

    if chrom2 == sam.NO_REF_NAME and start2 != sam.NO_REF_POS:
        raise ValueError("start2 cannot be used on its own - specify chrom or chrom2")

    # Setup R1
    r1 = self._new_rec(name=name, chrom=chrom1, start=start1, mapq=mapq1, attrs=attrs)
    self._set_flags(r1, read_num=1, strand=strand1)
    self._set_length_dependent_fields(
        rec=r1, length=self.r1_len, bases=bases1, quals=quals1, cigar=cigar1
    )

    # Setup R2
    r2 = self._new_rec(name=name, chrom=chrom2, start=start2, mapq=mapq2, attrs=attrs)
    self._set_flags(r2, read_num=2, strand=strand2)
    self._set_length_dependent_fields(
        rec=r2, length=self.r2_len, bases=bases2, quals=quals2, cigar=cigar2
    )

    # Sync up mate info and we're done!
    sam.set_mate_info(r1, r2)
    self._records.append(r1)
    self._records.append(r2)
    return r1, r2

add_single ¶

add_single(*, name: str | None = None, read_num: int | None = None, bases: str | None = _UNSET, quals: list[int] | None = _UNSET, chrom: str = NO_REF_NAME, start: int = NO_REF_POS, cigar: str | None = None, mapq: int | None = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: dict[str, Any] | None = None) -> AlignedSegment

Generates a new single reads, adds them to the internal collection, and returns it.

Most fields are optional.

If read_num is None (the default) an unpaired read will be created. If read_num is set to 1 or 2, the read will have it's paired flag set and read number flags set.

An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.

A mapped read is created by providing chrom and start.

The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the template. If None is given a unique name will be auto-generated.	`None`
`read_num`	`int \| None`	Either None, 1 for R1 or 2 for R2	`None`
`bases`	`str \| None`	The bases for the read. If omitted, a random sequence is generated. Pass `None` explicitly to produce a record with no sequence.	`_UNSET`
`quals`	`list[int] \| None`	The list of qualities for the read. If omitted, the default base quality is used. Pass `None` explicitly to produce a record with no qualities.	`_UNSET`
`chrom`	`str`	The chromosome to which both reads are mapped. Defaults to the unmapped value.	`NO_REF_NAME`
`start`	`int`	The start position of the read. Defaults to the unmapped value.	`NO_REF_POS`
`cigar`	`str \| None`	The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.	`None`
`mapq`	`int \| None`	Mapping quality for the read. Default to self.mapping_quality if not given.	`None`
`strand`	`str`	The strand for R1, either "+" or "-". Defaults to "+".	`'+'`
`secondary`	`bool`	If true the read will be flagged as secondary	`False`
`supplementary`	`bool`	If true the read will be flagged as supplementary	`False`
`attrs`	`dict[str, Any] \| None`	An optional dictionary of SAM attribute to place on both R1 and R2.	`None`

Raises:

Type	Description
`ValueError`	if strand field is not "+" or "-"
`ValueError`	if read_num is not None, 1 or 2
`ValueError`	if bases/quals/cigar are set in a way that is not self-consistent

Returns:

Name	Type	Description
`AlignedSegment`	`AlignedSegment`	The record created

Source code in fgpyo/sam/builder.py

def add_single(
    self,
    *,
    name: str | None = None,
    read_num: int | None = None,
    bases: str | None = _UNSET,
    quals: list[int] | None = _UNSET,
    chrom: str = sam.NO_REF_NAME,
    start: int = sam.NO_REF_POS,
    cigar: str | None = None,
    mapq: int | None = None,
    strand: str = "+",
    secondary: bool = False,
    supplementary: bool = False,
    attrs: dict[str, Any] | None = None,
) -> AlignedSegment:
    """
    Generates a new single reads, adds them to the internal collection, and returns it.

    Most fields are optional.

    If `read_num` is None (the default) an unpaired read will be created.  If `read_num` is
    set to 1 or 2, the read will have it's paired flag set and read number flags set.

    An unmapped read can be created by calling the method with no parameters (specifically,
    not setting chrom, start1 or start2).  If cigar is provided, it will be ignored.

    A mapped read is created by providing chrom and start.

    The length of the read is determined based on the presence or absence of bases, quals,
    and cigar.  If values are provided for one or more of these parameters, the lengths must
    match, and the length will be used to generate any unsupplied values.  If none of bases,
    quals, and cigar are provided, all three will be synthesized based on either the r1_len
    or r2_len stored on the class as appropriate.

    When synthesizing, bases are always a random sequence of bases, quals are all the default
    base quality (supplied when constructing a SamBuilder) and the cigar is always a single M
    operator of the read length.

    Args:
        name: The name of the template. If None is given a unique name will be auto-generated.
        read_num: Either None, 1 for R1 or 2 for R2
        bases: The bases for the read. If omitted, a random sequence is generated. Pass
            `None` explicitly to produce a record with no sequence.
        quals: The list of qualities for the read. If omitted, the default base quality is
            used. Pass `None` explicitly to produce a record with no qualities.
        chrom: The chromosome to which both reads are mapped. Defaults to the unmapped value.
        start: The start position of the read. Defaults to the unmapped value.
        cigar: The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
        mapq: Mapping quality for the read. Default to self.mapping_quality if not given.
        strand: The strand for R1, either "+" or "-". Defaults to "+".
        secondary: If true the read will be flagged as secondary
        supplementary: If true the read will be flagged as supplementary
        attrs: An optional dictionary of SAM attribute to place on both R1 and R2.

    Raises:
        ValueError: if strand field is not "+" or "-"
        ValueError: if read_num is not None, 1 or 2
        ValueError: if bases/quals/cigar are set in a way that is not self-consistent

    Returns:
        AlignedSegment: The record created
    """
    if strand not in ["+", "-"]:
        raise ValueError(f"Invalid value for strand1: {strand}")
    if read_num not in [None, 1, 2]:
        raise ValueError(f"Invalid value for read_num: {read_num}")

    name = name if name is not None else self._next_name()

    # Setup the read
    read_len = self.r1_len if read_num != 2 else self.r2_len
    rec = self._new_rec(name=name, chrom=chrom, start=start, mapq=mapq, attrs=attrs)
    self._set_flags(
        rec, read_num=read_num, strand=strand, secondary=secondary, supplementary=supplementary
    )
    self._set_length_dependent_fields(
        rec=rec, length=read_len, bases=bases, quals=quals, cigar=cigar
    )

    self._records.append(rec)
    return rec

default_rg staticmethod ¶

default_rg() -> dict[str, str]

Returns the default read group used by the SamBuilder, as a dictionary.

Source code in fgpyo/sam/builder.py

@staticmethod
def default_rg() -> dict[str, str]:
    """Returns the default read group used by the SamBuilder, as a dictionary."""
    return {"ID": "1", "SM": "1_AAAAAA", "LB": "default", "PL": "ILLUMINA", "PU": "xxx.1"}

default_sd staticmethod ¶

default_sd() -> list[dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns:

Type	Description
`list[dict[str, Any]]`	A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

Source code in fgpyo/sam/builder.py

@staticmethod
def default_sd() -> list[dict[str, Any]]:
    """
    Generates the sequence dictionary that is used by default by SamBuilder.

    Matches the names and lengths of the HG19 reference in use in production.

    Returns:
        A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
    """
    return [
        {"SN": "chr1", "LN": 249250621},
        {"SN": "chr2", "LN": 243199373},
        {"SN": "chr3", "LN": 198022430},
        {"SN": "chr4", "LN": 191154276},
        {"SN": "chr5", "LN": 180915260},
        {"SN": "chr6", "LN": 171115067},
        {"SN": "chr7", "LN": 159138663},
        {"SN": "chr8", "LN": 146364022},
        {"SN": "chr9", "LN": 141213431},
        {"SN": "chr10", "LN": 135534747},
        {"SN": "chr11", "LN": 135006516},
        {"SN": "chr12", "LN": 133851895},
        {"SN": "chr13", "LN": 115169878},
        {"SN": "chr14", "LN": 107349540},
        {"SN": "chr15", "LN": 102531392},
        {"SN": "chr16", "LN": 90354753},
        {"SN": "chr17", "LN": 81195210},
        {"SN": "chr18", "LN": 78077248},
        {"SN": "chr19", "LN": 59128983},
        {"SN": "chr20", "LN": 63025520},
        {"SN": "chr21", "LN": 48129895},
        {"SN": "chr22", "LN": 51304566},
        {"SN": "chrX", "LN": 155270560},
        {"SN": "chrY", "LN": 59373566},
        {"SN": "chrM", "LN": 16571},
    ]

rg ¶

rg() -> dict[str, Any]

Returns the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py

def rg(self) -> dict[str, Any]:
    """Returns the single read group that is defined in the header."""
    # The `RG` field contains a list of read group mappings
    # e.g. `[{"ID": "rg1", "PL": "ILLUMINA"}]`
    rgs = cast(list[dict[str, Any]], self._header["RG"])
    assert len(rgs) == 1, "Header did not contain exactly one read group!"
    return rgs[0]

rg_id ¶

rg_id() -> str

Returns the ID of the single read group that is defined in the header.

Source code in fgpyo/sam/builder.py

def rg_id(self) -> str:
    """Returns the ID of the single read group that is defined in the header."""
    # The read group mapping has mixed types of values (e.g. "PI" is numeric), but the "ID"
    # field is always a string.
    return cast(str, self.rg()["ID"])

to_path ¶

to_path(path: Path | None = None, index: bool = True, pred: Callable[[AlignedSegment], bool] = lambda _r: True, tmp_file_type: SamFileType | None = None) -> Path

Writes the accumulated records to a file, sorts & indexes it, and returns the Path.

If a path is provided, it will be written to, otherwise a temporary file is created and returned.

If path is provided, tmp_file_type may not be provided. In this case, the file type (SAM/BAM/CRAM) will be automatically determined by the file extension when a path is provided. See ~pysam for more details.

If path is not provided, the file type will default to BAM unless tmp_file_type is provided.

Parameters:

Name	Type	Description	Default
`path`	`Path \| None`	a path at which to write the file, otherwise a temp file is used.	`None`
`index`	`bool`	if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then an index is generated, otherwise not.	`True`
`pred`	`Callable[[AlignedSegment], bool]`	optional predicate to specify which reads should be output	`lambda _r: True`
`tmp_file_type`	`SamFileType \| None`	the file type to output when a path is not provided (default is BAM)	`None`

Returns:

Name	Type	Description
`Path`	`Path`	The path to the sorted (and possibly indexed) file.

Source code in fgpyo/sam/builder.py

def to_path(  # noqa: C901
    self,
    path: Path | None = None,
    index: bool = True,
    pred: Callable[[AlignedSegment], bool] = lambda _r: True,
    tmp_file_type: sam.SamFileType | None = None,
) -> Path:
    """
    Writes the accumulated records to a file, sorts & indexes it, and returns the Path.

    If a path is provided, it will be written to, otherwise a temporary file is created
    and returned.

    If `path` is provided, `tmp_file_type` may not be provided. In this case, the file type
    (SAM/BAM/CRAM) will be automatically determined by the file extension when a path
    is provided.  See `~pysam` for more details.

    If `path` is not provided, the file type will default to BAM unless `tmp_file_type` is
    provided.

    Args:
        path: a path at which to write the file, otherwise a temp file is used.
        index: if True and `sort_order` is `Coordinate` and output is a BAM/CRAM file, then
               an index is generated, otherwise not.
        pred: optional predicate to specify which reads should be output
        tmp_file_type: the file type to output when a path is not provided (default is BAM)

    Returns:
        Path: The path to the sorted (and possibly indexed) file.
    """
    if path is not None:
        # Get the file type if a path was given (in this case, a file type may not be
        # provided too)
        if tmp_file_type is not None:
            raise ValueError("Both `path` and `tmp_file_type` cannot be provided.")
        tmp_file_type = sam.SamFileType.from_path(path)
    elif tmp_file_type is None:
        # Use the provided file type
        tmp_file_type = sam.SamFileType.BAM

    # Get the extension, and create a path if none was given
    ext = tmp_file_type.extension
    if path is None:
        with NamedTemporaryFile(suffix=ext, delete=False) as fp:
            path = Path(fp.name)

    with NamedTemporaryFile(suffix=ext, delete=True) as fp:
        file_handle: IO
        if self._sort_order in {SamOrder.Unsorted, SamOrder.Unknown}:
            file_handle = path.open("w")
        else:
            file_handle = fp.file

        with sam.writer(file_handle, header=self._samheader, file_type=tmp_file_type) as writer:
            for rec in self._records:
                if pred(rec):
                    writer.write(rec)

        samtools_sort_args = ["-o", str(path), fp.name]

        file_handle.close()
        if self._sort_order == SamOrder.QueryName:
            pysam.sort("-n", *samtools_sort_args)
        elif self._sort_order == SamOrder.Coordinate:
            if index and tmp_file_type.indexable:
                samtools_sort_args.insert(0, "--write-index")
            pysam.sort(*samtools_sort_args)

    return path

to_sorted_list ¶

to_sorted_list() -> list[AlignedSegment]

Returns the accumulated records in coordinate order.

Source code in fgpyo/sam/builder.py

def to_sorted_list(self) -> list[pysam.AlignedSegment]:
    """Returns the accumulated records in coordinate order."""
    with NamedTemporaryFile(suffix=".bam", delete=True) as fp:
        filename = fp.name
        path = self.to_path(path=Path(filename), index=False)
        bam = sam.reader(path)
        return list(bam)

to_unsorted_list ¶

to_unsorted_list() -> list[AlignedSegment]

Returns the accumulated records in the order they were created.

Source code in fgpyo/sam/builder.py

def to_unsorted_list(self) -> list[pysam.AlignedSegment]:
    """Returns the accumulated records in the order they were created."""
    return list(self._records)

clipping ¶

Utility Functions for Soft-Clipping records in SAM/BAM Files.¶

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

softclip_start_of_alignment_by_query() clips the start of the alignment in terms of query bases
softclip_end_of_alignment_by_query() clips the end of the alignment in terms of query bases
softclip_start_of_alignment_by_ref() clips the start of the alignment in terms of reference bases
softclip_end_of_alignment_by_ref() clips the end of the alignment in terms of reference bases

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
...     for rec in fh:
...         before = rec.cigarstring
...         info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         after = rec.cigarstring
...         print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

Mate pairs may have incorrect information about their mate's positions
Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd

Classes¶

ClippingInfo ¶

Bases: NamedTuple

Named tuple holding the number of bases clipped on the query and reference respectively.

Source code in fgpyo/sam/clipping.py

class ClippingInfo(NamedTuple):
    """Named tuple holding the number of bases clipped on the query and reference respectively."""

    query_bases_clipped: int
    """The number of query bases in the alignment that were clipped."""

    ref_bases_clipped: int
    """The number of reference bases in the alignment that were clipped."""

Attributes¶

query_bases_clipped instance-attribute ¶

query_bases_clipped: int

The number of query bases in the alignment that were clipped.

ref_bases_clipped instance-attribute ¶

ref_bases_clipped: int

The number of reference bases in the alignment that were clipped.

Functions¶

softclip_end_of_alignment_by_query ¶

softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the end of a read's alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	the BAM record to clip	required
`bases_to_clip`	`int`	the number of additional bases of clipping desired in the read/query	required
`clipped_base_quality`	`int \| None`	if not None, set bases in the clipped region to this quality	`None`
`tags_to_invalidate`	`Iterable[str]`	the set of extended attributes to remove upon clipping	`TAGS_TO_INVALIDATE`

Returns:

Name	Type	Description
`ClippingInfo`	`ClippingInfo`	a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py

def softclip_end_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: int | None = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the end of a read's alignment.

    Clipping is applied before any existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    # type narrowing; rec.query_qualities is not None if the record is mapped
    assert rec.query_qualities is not None

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    # Reverse the cigar and qualities so we can clip from the start
    cigar = Cigar.from_cigartuples(rec.cigartuples).reversed()
    quals = rec.query_qualities
    quals.reverse()
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)

    # Then reverse everything back again
    quals.reverse()
    rec.query_qualities = quals
    rec.cigarstring = str(new_cigar.reversed())

    _cleanup(rec, tags_to_invalidate)
    return clipping_info

softclip_end_of_alignment_by_ref ¶

softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	the BAM record to clip	required
`bases_to_clip`	`int`	the number of additional bases of clipping desired on the reference	required
`clipped_base_quality`	`int \| None`	if not None, set bases in the clipped region to this quality	`None`
`tags_to_invalidate`	`Iterable[str]`	the set of extended attributes to remove upon clipping	`TAGS_TO_INVALIDATE`

Returns:

Name	Type	Description
`ClippingInfo`	`ClippingInfo`	a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py

def softclip_end_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: int | None = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Soft-clips the end of an alignment by bases_to_clip bases on the reference.

    Clipping is applied beforeany existing hard or soft clipping.  E.g. a read with cigar 100M5S
    that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    assert rec.reference_length is not None  # type narrowing
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    assert rec.reference_end is not None  # type narrowing
    new_end = rec.reference_end - bases_to_clip
    new_query_end = _read_pos_at_ref_pos(rec, new_end, previous=False)

    assert new_query_end is not None  # type narrowing
    query_bases_to_clip = rec.query_alignment_end - new_query_end

    return softclip_end_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )

softclip_start_of_alignment_by_query ¶

softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Adds soft-clipping to the start of a read's alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	the BAM record to clip	required
`bases_to_clip`	`int`	the number of additional bases of clipping desired in the read/query	required
`clipped_base_quality`	`int \| None`	if not None, set bases in the clipped region to this quality	`None`
`tags_to_invalidate`	`Iterable[str]`	the set of extended attributes to remove upon clipping	`TAGS_TO_INVALIDATE`

Returns:

Name	Type	Description
`ClippingInfo`	`ClippingInfo`	a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py

def softclip_start_of_alignment_by_query(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: int | None = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Adds soft-clipping to the start of a read's alignment.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired in the read/query
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    if rec.is_unmapped or bases_to_clip < 1:
        return ClippingInfo(0, 0)

    # type narrowing; rec.query_qualities is not None if the record is mapped
    assert rec.query_qualities is not None

    num_clippable_bases = rec.query_alignment_length

    if bases_to_clip >= num_clippable_bases:
        return _clip_whole_read(rec, tags_to_invalidate)

    cigar = Cigar.from_cigartuples(rec.cigartuples)
    quals = rec.query_qualities
    new_cigar, clipping_info = _clip(cigar, quals, bases_to_clip, clipped_base_quality)
    rec.query_qualities = quals

    rec.reference_start += clipping_info.ref_bases_clipped
    rec.cigarstring = str(new_cigar)
    _cleanup(rec, tags_to_invalidate)
    return clipping_info

softclip_start_of_alignment_by_ref ¶

softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	the BAM record to clip	required
`bases_to_clip`	`int`	the number of additional bases of clipping desired on the reference	required
`clipped_base_quality`	`int \| None`	if not None, set bases in the clipped region to this quality	`None`
`tags_to_invalidate`	`Iterable[str]`	the set of extended attributes to remove upon clipping	`TAGS_TO_INVALIDATE`

Returns:

Name	Type	Description
`ClippingInfo`	`ClippingInfo`	a named tuple containing the number of query/read bases and the number of target/reference bases clipped.

Source code in fgpyo/sam/clipping.py

def softclip_start_of_alignment_by_ref(
    rec: AlignedSegment,
    bases_to_clip: int,
    clipped_base_quality: int | None = None,
    tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE,
) -> ClippingInfo:
    """
    Soft-clips the start of an alignment by bases_to_clip bases on the reference.

    Clipping is applied after any existing hard or soft clipping.  E.g. a read with cigar 5S100M
    that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

    If the read is unmapped or bases_to_clip < 1 then nothing is done.

    If the read has fewer clippable bases than requested the read will be unmapped.

    Args:
        rec: the BAM record to clip
        bases_to_clip: the number of additional bases of clipping desired on the reference
        clipped_base_quality: if not None, set bases in the clipped region to this quality
        tags_to_invalidate: the set of extended attributes to remove upon clipping

    Returns:
        ClippingInfo: a named tuple containing the number of query/read bases and the number
            of target/reference bases clipped.
    """
    assert rec.reference_length is not None  # type narrowing
    if rec.reference_length <= bases_to_clip:
        return _clip_whole_read(rec, tags_to_invalidate)

    new_start = rec.reference_start + bases_to_clip
    new_query_start = _read_pos_at_ref_pos(rec, new_start, previous=False)

    assert new_query_start is not None  # type narrowing
    query_bases_to_clip = new_query_start - rec.query_alignment_start

    return softclip_start_of_alignment_by_query(
        rec, query_bases_to_clip, clipped_base_quality, tags_to_invalidate
    )

sequence ¶

Utility Functions for Manipulating DNA and RNA sequences.¶

This module contains utility functions for manipulating DNA and RNA sequences.

levenshtein and hamming functions are included for convenience. If you are performing many distance calculations, using a C based method is preferable. ex. https://pypi.org/project/Distance/

Functions¶

complement ¶

complement(base: str) -> str

Returns the complement of any base.

Source code in fgpyo/sequence.py

def complement(base: str) -> str:
    """Returns the complement of any base."""
    if len(base) != 1:
        raise ValueError(f"complement() may only be called with 1-character strings: {base}")
    else:
        return _COMPLEMENTS[base]

gc_content ¶

gc_content(bases: str) -> float

Calculates the fraction of G and C bases in a sequence.

Source code in fgpyo/sequence.py

def gc_content(bases: str) -> float:
    """Calculates the fraction of G and C bases in a sequence."""
    if len(bases) == 0:
        return 0
    gc_count = sum(1 for base in bases if base in "CGcg")
    return gc_count / len(bases)

hamming ¶

hamming(string1: str, string2: str) -> int

Calculates hamming distance between two strings, case sensitive.

Strings must be of equal lengths.

Parameters:

Name	Type	Description	Default
`string1`	`str`	first string for comparison	required
`string2`	`str`	second string for comparison	required

Raises:

Type	Description
`ValueError`	If strings are of different lengths.

Source code in fgpyo/sequence.py

def hamming(string1: str, string2: str) -> int:
    """
    Calculates hamming distance between two strings, case sensitive.

    Strings must be of equal lengths.

    Args:
        string1: first string for comparison
        string2: second string for comparison

    Raises:
        ValueError: If strings are of different lengths.
    """
    if len(string1) != len(string2):
        raise ValueError(
            "Hamming distance requires two strings of equal lengths."
            f"Received {string1} and {string2}."
        )
    return sum(c1 != c2 for c1, c2 in zip(string1, string2, strict=True))

levenshtein ¶

levenshtein(string1: str, string2: str) -> int

Calculates levenshtein distance between two strings, case sensitive.

Parameters:

Name	Type	Description	Default
`string1`	`str`	first string for comparison	required
`string2`	`str`	second string for comparison	required

Source code in fgpyo/sequence.py

def levenshtein(string1: str, string2: str) -> int:
    """
    Calculates levenshtein distance between two strings, case sensitive.

    Args:
        string1: first string for comparison
        string2: second string for comparison

    """
    n: int = len(string1)
    m: int = len(string2)
    if n == 0 or m == 0:
        return max(n, m)
    # Initialize n + 1 x m + 1 matrix with final row/column representing the empty string.
    # Fill in initial values for empty string sub-problem comparisons.
    #   A D C "
    # A - - - 3
    # B - - - 2
    # C - - - 1
    # " 3 2 1 0
    matrix: list[list[int]] = [[int()] * (m + 1) for _ in range(n + 1)]
    for j in range(m + 1):
        matrix[n][j] = m - j
    for i in range(n + 1):
        matrix[i][m] = n - i
    # Fill in matrix from bottom up using previous sub-problem solutions.
    #   A D C "      A D C "      A D C "      A D C "      A D C "
    # A - - - 3    A - - - 3    A - - 2 3    A - 2 2 3    A 1 2 2 3
    # B - - - 2 -> B - - 1 2 -> B - 1 1 2 -> B 2 1 1 2 -> B 2 1 1 2
    # C - - 0 1    C - 1 0 1    C 2 1 0 1    C 2 1 0 1    C 2 1 0 1
    # " 3 2 1 0    " 3 2 1 0    " 3 2 1 0    " 3 2 1 0    " 3 2 1 0
    for i in range(n - 1, -1, -1):
        for j in range(m - 1, -1, -1):
            if string1[i] == string2[j]:
                matrix[i][j] = matrix[i + 1][j + 1]  # No Operation
            else:
                matrix[i][j] = 1 + min(
                    matrix[i + 1][j],  # Deletion
                    matrix[i][j + 1],  # Insertion
                    matrix[i + 1][j + 1],  # Substitution
                )
    return matrix[0][0]

longest_dinucleotide_run_length ¶

longest_dinucleotide_run_length(bases: str) -> int

Number of bases in the longest dinucleotide run in a primer.

A dinucleotide run is when two nucleotides are repeated in tandem. For example, TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.

Parameters:

Name	Type	Description	Default
`bases`	`str`	the bases over which to compute	required

Return

the number of bases in the longest dinuc repeat (NOT the number of repeat units)

Source code in fgpyo/sequence.py

def longest_dinucleotide_run_length(bases: str) -> int:
    """
    Number of bases in the longest dinucleotide run in a primer.

    A dinucleotide run is when two nucleotides are repeated in tandem. For example,
    TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.

    Args:
        bases: the bases over which to compute

    Return:
        the number of bases in the longest dinuc repeat (NOT the number of repeat units)
    """
    return longest_multinucleotide_run_length(bases=bases, repeat_unit_length=2)

longest_homopolymer_length ¶

longest_homopolymer_length(bases: str) -> int

Calculates the length of the longest homopolymer in the input sequence.

Parameters:

Name	Type	Description	Default
`bases`	`str`	the bases over which to compute	required

Return

the length of the longest homopolymer

Source code in fgpyo/sequence.py

def longest_homopolymer_length(bases: str) -> int:
    """
    Calculates the length of the longest homopolymer in the input sequence.

    Args:
        bases: the bases over which to compute

    Return:
        the length of the longest homopolymer
    """
    cur_length: int = 0
    i = 0
    # NB: if we have found a homopolymer of length `min_hp`, then we do not need
    # to examine the last `min_hp` bases since we'll never find a longer one.
    bases_len = len(bases)
    while i < bases_len - cur_length:
        base = bases[i].upper()
        j = i + 1
        while j < bases_len and bases[j].upper() == base:
            j += 1
        cur_length = max(cur_length, j - i)
        # skip over all the bases in the current homopolymer
        i = j
    return cur_length

longest_hp_length ¶

longest_hp_length(bases: str) -> int

Calculates the length of the longest homopolymer in the input sequence.

Parameters:

Name	Type	Description	Default
`bases`	`str`	the bases over which to compute	required

Return

the length of the longest homopolymer

Source code in fgpyo/sequence.py

def longest_hp_length(bases: str) -> int:
    """
    Calculates the length of the longest homopolymer in the input sequence.

    Args:
        bases: the bases over which to compute

    Return:
        the length of the longest homopolymer
    """
    return longest_homopolymer_length(bases=bases)

longest_multinucleotide_run_length ¶

longest_multinucleotide_run_length(bases: str, repeat_unit_length: int) -> int

Number of bases in the longest multi-nucleotide run.

A multi-nucleotide run is when N nucleotides are repeated in tandem. For example, TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs, returns 0.

Parameters:

Name	Type	Description	Default
`bases`	`str`	the bases over which to compute	required
`repeat_unit_length`	`int`	the length of the multi-nucleotide repetitive unit (must be > 0)	required

Returns:

Type	Description
`int`	the number of bases in the longest multinucleotide repeat (NOT the number of repeat units)

Source code in fgpyo/sequence.py

def longest_multinucleotide_run_length(bases: str, repeat_unit_length: int) -> int:
    """
    Number of bases in the longest multi-nucleotide run.

    A multi-nucleotide run is when N nucleotides are repeated in tandem. For example,
    TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs,
    returns 0.

    Args:
        bases: the bases over which to compute
        repeat_unit_length: the length of the multi-nucleotide repetitive unit (must be > 0)

    Returns:
        the number of bases in the longest multinucleotide repeat (NOT the number of repeat units)
    """
    if repeat_unit_length <= 0:
        raise ValueError(f"repeat_unit_length must be > 0, found: {repeat_unit_length}")
    elif len(bases) < repeat_unit_length:
        return 0
    elif len(bases) == repeat_unit_length:
        return repeat_unit_length
    elif repeat_unit_length == 1:
        return longest_homopolymer_length(bases=bases)

    best_length: int = 0
    start = 0  # the start index of the current multi-nucleotide run
    # Note: using `< len(bases) - 1` instead of `< len(bases)` is intentional.
    # The algorithm processes overlapping windows and will capture repeats at the sequence end
    # through the sliding window approach, avoiding potential off-by-one errors.
    while start < len(bases) - 1:
        # get the dinuc bases
        dinuc = bases[start : start + repeat_unit_length].upper()
        # keep going while there are more di-nucs
        end = start + repeat_unit_length
        # The same boundary logic applies here - the sliding window captures all valid repeats
        while end < len(bases) - 1 and dinuc == bases[end : end + repeat_unit_length].upper():
            end += repeat_unit_length
        cur_length = end - start
        # update the longest total run length
        best_length = max(best_length, cur_length)
        # move to the next start
        if cur_length <= repeat_unit_length:  # only one repeat unit found, move the start by 1bp
            start += 1
        else:  # multiple repeats found, skip to the last base of the current run
            start += cur_length - 1

    return best_length

reverse_complement ¶

reverse_complement(bases: str) -> str

Reverse complements a base sequence.

Parameters:

Name	Type	Description	Default
`bases`	`str`	the bases to be reverse complemented.	required

Returns:

Type	Description
`str`	the reverse complement of the provided base string

Source code in fgpyo/sequence.py

def reverse_complement(bases: str) -> str:
    """
    Reverse complements a base sequence.

    Arguments:
        bases: the bases to be reverse complemented.

    Returns:
        the reverse complement of the provided base string
    """
    rev_comp = bases.translate(_COMPLEMENTS_TABLE)[::-1]
    if len(rev_comp) != len(bases):
        # There were invalid characters that weren't translated.
        # Raise KeyError with all the invalid bases.
        bad_bases = "".join({base for base in bases if base not in _COMPLEMENTS})
        raise KeyError(f"Invalid bases found: {bad_bases}")
    return rev_comp

util ¶

Modules¶

inspect ¶

Attributes¶

FieldType module-attribute ¶

FieldType: TypeAlias = Field | Attribute

TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass

Classes¶

ParserNotFoundException ¶

Bases: Exception

Raised when no parser can be found for a given type.

Source code in fgpyo/util/inspect.py

class ParserNotFoundException(Exception):  # noqa: N818
    """Raised when no parser can be found for a given type."""

Functions¶

attr_from ¶

attr_from(cls: type[_AttrFromType], kwargs: dict[str, str], parsers: dict[type, Callable[[str], Any]] | None = None) -> _AttrFromType

Builds an attr or dataclasses class from key-word arguments.

Parameters:

Name	Type	Description	Default
`cls`	`type[_AttrFromType]`	the attr or dataclasses class to be built	required
`kwargs`	`dict[str, str]`	a dictionary of keyword arguments	required
`parsers`	`dict[type, Callable[[str], Any]] \| None`	a dictionary of parser functions to apply to specific types	`None`

Source code in fgpyo/util/inspect.py

def attr_from(
    cls: type[_AttrFromType],
    kwargs: dict[str, str],
    parsers: dict[type, Callable[[str], Any]] | None = None,
) -> _AttrFromType:
    """
    Builds an attr or dataclasses class from key-word arguments.

    Args:
        cls: the attr or dataclasses class to be built
        kwargs: a dictionary of keyword arguments
        parsers: a dictionary of parser functions to apply to specific types

    """
    return_values: dict[str, Any] = {}
    for attribute in get_fields(cls):  # type: ignore[arg-type]
        return_value: Any
        if attribute.name in kwargs:
            str_value: str = kwargs[attribute.name]
            set_value: bool = False

            # Use the converter if provided
            converter = getattr(attribute, "converter", None)
            if converter is not None:
                return_value = converter(str_value)
                set_value = True

            # try getting a known parser
            if not set_value:
                try:
                    parser = _get_parser(cls=cls, type_=attribute.type, parsers=parsers)
                    return_value = parser(str_value)
                    set_value = True
                except ParserNotFoundException:
                    pass

            # try setting by casting
            # Note that while bools *can* be cast from string, all non-empty strings evaluate to
            # True, because python, so we need to check for that explicitly
            if not set_value and attribute.type is not None and attribute.type is not bool:
                try:
                    return_value = attribute.type(str_value)  # type: ignore[operator]
                    set_value = True
                except (ValueError, TypeError):
                    pass

            # fail otherwise
            assert set_value, (
                f"Do not know how to convert string to {attribute.type} for value: {str_value}"
            )
        else:  # no value, check for a default
            assert attribute.default is not None or _attribute_is_optional(attribute), (
                f"No value given and no default for attribute `{attribute.name}`"
            )
            return_value = attribute.default
            # when the default is attr.NOTHING, just use None
            if return_value in MISSING:
                return_value = None

        return_values[attribute.name] = return_value

    return cls(**return_values)

dict_parser ¶

dict_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial

Returns a function that parses a stringified dict into a Dict of the correct type.

Parameters:

Name	Type	Description	Default
`cls`	`type`	the type of the class object this is being parsed for (used to get default val for parsers)	required
`type_`	`TypeAnnotation`	the type of the attribute to be parsed	required
`parsers`	`dict[type, Callable[[str], Any]] \| None`	an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)	`None`

Source code in fgpyo/util/inspect.py

def dict_parser(
    cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None
) -> partial:
    """
    Returns a function that parses a stringified dict into a `Dict` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type
            (allows for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 2, "Dict object must have exactly 2 subtypes per PEP specification!"
    (key_parser, val_parser) = (
        _get_parser(
            cls,
            subtypes[0],
            parsers,
        ),
        _get_parser(
            cls,
            subtypes[1],
            parsers,
        ),
    )

    def dict_parse(dict_string: str) -> dict[Any, Any]:
        """Parses a dictionary value (can do so recursively)."""
        assert dict_string[0] == "{", "Dict val improperly formatted"
        assert dict_string[-1] == "}", "Dict val improprly formatted"
        dict_string = dict_string[1:-1]
        if len(dict_string) == 0:
            return {}
        else:
            outer_splits = split_at_given_level(dict_string, split_delim=",")
            out_dict = {}
            for outer_split in outer_splits:
                inner_splits = split_at_given_level(outer_split, split_delim=";")
                assert len(inner_splits) % 2 == 0, (
                    "Inner splits of dict didn't have matched key val pairs"
                )
                for i in range(0, len(inner_splits), 2):
                    key = key_parser(inner_splits[i])
                    if key in out_dict:
                        raise ValueError("Duplicate key found in dict: {}".format(key))
                    out_dict[key] = val_parser(inner_splits[i + 1])
            return out_dict

    return functools.partial(dict_parse)

get_attr_fields ¶

get_attr_fields(_cls: type) -> tuple[Field, ...]

Get tuple of fields for attr class. attrs isn't imported so return empty tuple.

Source code in fgpyo/util/inspect.py

def get_attr_fields(_cls: type) -> tuple[dataclasses.Field, ...]:
    """Get tuple of fields for attr class. attrs isn't imported so return empty tuple."""
    return ()

get_attr_fields_dict ¶

get_attr_fields_dict(_cls: type) -> dict[str, Field]

Get dict of name->field for attr class. attrs isn't imported so return empty dict.

Source code in fgpyo/util/inspect.py

def get_attr_fields_dict(_cls: type) -> dict[str, dataclasses.Field]:
    """Get dict of name->field for attr class. attrs isn't imported so return empty dict."""
    return {}

get_fields ¶

get_fields(cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass]) -> tuple[FieldType, ...]

Get the fields tuple from either a dataclasses or attr dataclass (or instance).

Source code in fgpyo/util/inspect.py

def get_fields(
    cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass],
) -> tuple[FieldType, ...]:
    """Get the fields tuple from either a dataclasses or attr dataclass (or instance)."""
    if is_dataclasses_class(cls):
        return get_dataclasses_fields(cls)
    # Always pass a type to is_attr_class
    cls_type = cls if isinstance(cls, type) else type(cls)
    if is_attr_class(cls_type):
        return get_attr_fields(cls_type)  # type: ignore[no-any-return]
    else:
        raise TypeError("cls must a dataclasses or attr class")

get_fields_dict ¶

get_fields_dict(cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass]) -> Mapping[str, FieldType]

Get the fields dict from either a dataclasses or attr dataclass (or instance).

Source code in fgpyo/util/inspect.py

def get_fields_dict(
    cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass],
) -> Mapping[str, FieldType]:
    """Get the fields dict from either a dataclasses or attr dataclass (or instance)."""
    if is_dataclasses_class(cls):
        return _get_dataclasses_fields_dict(cls)
    # Always pass a type to is_attr_class
    cls_type = cls if isinstance(cls, type) else type(cls)
    if is_attr_class(cls_type):
        # attr.fields_dict returns Any, so cast to Mapping[str, FieldType] for type checking
        return typing.cast(Mapping[str, FieldType], get_attr_fields_dict(cls_type))
    else:
        raise TypeError("cls must a dataclasses or attr class")

is_attr_class ¶

is_attr_class(cls: type) -> TypeGuard[type[AttrsInstance]]

Return True if the class is an attr class, and False otherwise.

Source code in fgpyo/util/inspect.py

def is_attr_class(cls: type) -> TypeGuard[type[AttrsInstance]]:
    """Return True if the class is an attr class, and False otherwise."""
    return hasattr(cls, "__attrs_attrs__")

list_parser ¶

list_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial

Returns a function that parses a "stringified" list into a List of the correct type.

Parameters:

Name	Type	Description	Default
`cls`	`type`	the type of the class object this is being parsed for (used to get default val for parsers)	required
`type_`	`TypeAnnotation`	the type of the attribute to be parsed	required
`parsers`	`dict[type, Callable[[str], Any]] \| None`	an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)	`None`

Source code in fgpyo/util/inspect.py

def list_parser(
    cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None
) -> partial:
    """
    Returns a function that parses a "stringified" list into a `List` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 1, "Lists are allowed only one subtype per PEP specification!"
    subtype_parser = _get_parser(
        cls,
        subtypes[0],
        parsers,
    )
    return functools.partial(
        lambda s: list(
            []
            if s == ""
            else [subtype_parser(item) for item in list(split_at_given_level(s, split_delim=","))]
        )
    )

set_parser ¶

set_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial

Returns a function that parses a stringified set into a Set of the correct type.

Parameters:

Name	Type	Description	Default
`cls`	`type`	the type of the class object this is being parsed for (used to get default val for parsers)	required
`type_`	`TypeAnnotation`	the type of the attribute to be parsed	required
`parsers`	`dict[type, Callable[[str], Any]] \| None`	an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)	`None`

Source code in fgpyo/util/inspect.py

def set_parser(
    cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None
) -> partial:
    """
    Returns a function that parses a stringified set into a `Set` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtypes = typing.get_args(type_)
    assert len(subtypes) == 1, "Sets are allowed only one subtype per PEP specification!"
    subtype_parser = _get_parser(
        cls,
        subtypes[0],
        parsers,
    )
    return functools.partial(
        lambda s: set(
            set({})
            if s == "{}"
            else [
                subtype_parser(item) for item in set(split_at_given_level(s[1:-1], split_delim=","))
            ]
        )
    )

split_at_given_level ¶

split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) -> list[str]

Splits a nested field by its outer-most level.

Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth

Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO

Source code in fgpyo/util/inspect.py

def split_at_given_level(
    field: str,
    split_delim: str = ",",
    increase_depth_chars: Iterable[str] = ("{", "(", "["),
    decrease_depth_chars: Iterable[str] = ("}", ")", "]"),
) -> list[str]:
    """
    Splits a nested field by its outer-most level.

    Note that this method may produce incorrect results fields containing strings containing
    unpaired characters that increase or decrease the depth

    Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO
    """
    outer_depth_of_split = 0
    current_outer_splits = []
    out_vals: list[str] = []
    for high_level_split in field.split(split_delim):
        increase_in_depth = 0
        for char in increase_depth_chars:
            increase_in_depth += high_level_split.count(char)

        decrease_in_depth = 0
        for char in decrease_depth_chars:
            decrease_in_depth += high_level_split.count(char)
        outer_depth_of_split += increase_in_depth - decrease_in_depth

        assert outer_depth_of_split >= 0, "Unpaired depth character! Likely incorrect output"

        current_outer_splits.append(high_level_split)
        if outer_depth_of_split == 0:
            out_vals.append(split_delim.join(current_outer_splits))
            current_outer_splits = []
    assert outer_depth_of_split == 0, "Unpaired depth character! Likely incorrect output!"
    return out_vals

tuple_parser ¶

tuple_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial

Returns a function that parses a stringified tuple into a Tuple of the correct type.

Parameters:

Name	Type	Description	Default
`cls`	`type`	the type of the class object this is being parsed for (used to get default val for parsers)	required
`type_`	`TypeAnnotation`	the type of the attribute to be parsed	required
`parsers`	`dict[type, Callable[[str], Any]] \| None`	an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types)	`None`

Source code in fgpyo/util/inspect.py

def tuple_parser(
    cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None
) -> partial:
    """
    Returns a function that parses a stringified tuple into a `Tuple` of the correct type.

    Args:
        cls: the type of the class object this is being parsed for (used to get default val for
            parsers)
        type_: the type of the attribute to be parsed
        parsers: an optional mapping from type to the function to use for parsing that type (allows
            for parsing of more complex types)
    """
    subtype_parsers = [
        _get_parser(
            cls,
            subtype,
            parsers,
        )
        for subtype in typing.get_args(type_)
    ]

    def tuple_parse(tuple_string: str) -> tuple[Any, ...]:
        """
        Parses a dictionary value (can do so recursively).

        Note that this tool will fail on tuples containing strings containing
        unpaired '{', or '}' characters.
        """
        assert tuple_string[0] == "(", "Tuple val improperly formatted"
        assert tuple_string[-1] == ")", "Tuple val improperly formatted"
        tuple_string = tuple_string[1:-1]
        if len(tuple_string) == 0:
            return ()
        else:
            val_strings = split_at_given_level(tuple_string, split_delim=",")
            return tuple(
                parser(val_str)
                for parser, val_str in zip(subtype_parsers, val_strings, strict=True)
            )

    return functools.partial(tuple_parse)

Modules¶

logging ¶

Methods for setting up logging for tools.¶

Progress Logging Examples¶

Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging progress is useful to not only log how many inputs have been consumed, but also their genomic coordinate. ProgressLogger() can log progress every fixed number of records. Logging can be written to logging.Logger as well as custom print method.

>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
...     printer=lambda s: logged_lines.append(s),
...     verb="recorded",
...     noun="items",
...     unit=2
... )
>>> progress.record(reference_name="chr1", position=1)  # does not log
False
>>> progress.record(reference_name="chr1", position=2)  # logs
True
>>> progress.record(reference_name="chr1", position=3)  # does not log
False
>>> progress.log_last()  # will log the last recorded item, if not previously logged
True
>>> logged_lines  # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']

Classes¶

ProgressLogger ¶

Bases: AbstractContextManager

A little class to track progress.

This will output a log message every unit number times recorded.

Attributes:

Name	Type	Description
`printer`	`Callable[[str], Any]`	either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string
`noun`	`str`	the noun to use in the log message
`verb`	`str`	the verb to use in the log message
`unit`	`int`	the number of items for every log message
`count`	`int`	the total count of items recorded

Source code in fgpyo/util/logging.py

class ProgressLogger(AbstractContextManager):
    """
    A little class to track progress.

    This will output a log message every `unit` number times recorded.

    Attributes:
        printer: either a Logger (in which case progress will be printed at Info) or a lambda
            that consumes a single string
        noun: the noun to use in the log message
        verb: the verb to use in the log message
        unit: the number of items for every log message
        count: the total count of items recorded
    """

    def __init__(
        self,
        printer: Logger | Callable[[str], Any],
        noun: str = "records",
        verb: str = "Read",
        unit: int = 100000,
    ) -> None:
        """Initializes the progress logger with the given printer and settings."""
        self.printer: Callable[[str], Any]
        if isinstance(printer, Logger):
            self.printer = lambda s: printer.info(s)
        else:
            self.printer = printer
        self.noun: str = noun
        self.verb: str = verb
        self.unit: int = unit
        self.count: int = 0
        self._count_mod_unit: int = 0
        self._last_reference_name: str | None = None
        self._last_position: int | None = None

    def __exit__(
        self, ex_type: Any | None, ex_value: Any | None, traceback: Any | None
    ) -> Literal[False]:
        """Logs the final count on exit if no exception occurred."""
        if ex_value is None:
            self.log_last()
        return False

    def record(
        self,
        reference_name: str | None = None,
        position: int | None = None,
    ) -> bool:
        """
        Record an item at a given genomic coordinate.

        Args:
            reference_name: the reference name of the item
            position: the 1-based start position of the item
        Returns:
            true if a message was logged, false otherwise
        """
        self.count += 1
        self._count_mod_unit += 1
        self._last_reference_name = reference_name
        self._last_position = None if position is None or position <= 0 else position
        if self._count_mod_unit == self.unit:
            self._count_mod_unit = 0
            self._log(refname=self._last_reference_name, position=self._last_position)
            return True
        else:
            return False

    def record_alignment(
        self,
        rec: AlignedSegment,
    ) -> bool:
        """
        Correctly record pysam.AlignedSegments (zero-based coordinates).

        Args:
            rec: pysam.AlignedSegment object

        Returns:
            true if a message was logged, false otherwise
        """
        if rec.is_unmapped:
            return self.record(None, None)
        return self.record(rec.reference_name, rec.reference_start + 1)

    def record_alignments(
        self,
        recs: Iterable[AlignedSegment],
    ) -> bool:
        """
        Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

        Args:
            recs: pysam.AlignedSegment objects

        Returns:
            true if a message was logged, false otherwise
        """
        logged_message: bool = False
        for rec in recs:
            logged_message = self.record_alignment(rec) or logged_message
        return logged_message

    def _log(
        self,
        refname: str | None = None,
        position: int | None = None,
    ) -> None:
        """
        Helper method to print the log message.

        Args:
            refname: the name of the reference of the item
            position: the 1-based start position of the item

        Returns:
            None
        """
        coordinate: str
        if refname is None and position is None:
            coordinate = "NA"
        else:
            assert refname is not None and position is not None, f"{refname} {position}"
            coordinate = f"{refname}:{position:,d}"

        self.printer(f"{self.verb} {self.count:,d} {self.noun}: {coordinate}")

        return None

    def log_last(
        self,
    ) -> bool:
        """Force logging the last record, for example when progress has completed."""
        if self._count_mod_unit != 0:
            self._log(refname=self._last_reference_name, position=self._last_position)
            return True
        else:
            return False

Functions¶

__exit__ ¶

__exit__(ex_type: Any | None, ex_value: Any | None, traceback: Any | None) -> Literal[False]

Logs the final count on exit if no exception occurred.

Source code in fgpyo/util/logging.py

def __exit__(
    self, ex_type: Any | None, ex_value: Any | None, traceback: Any | None
) -> Literal[False]:
    """Logs the final count on exit if no exception occurred."""
    if ex_value is None:
        self.log_last()
    return False

__init__ ¶

__init__(printer: Logger | Callable[[str], Any], noun: str = 'records', verb: str = 'Read', unit: int = 100000) -> None

Initializes the progress logger with the given printer and settings.

Source code in fgpyo/util/logging.py

def __init__(
    self,
    printer: Logger | Callable[[str], Any],
    noun: str = "records",
    verb: str = "Read",
    unit: int = 100000,
) -> None:
    """Initializes the progress logger with the given printer and settings."""
    self.printer: Callable[[str], Any]
    if isinstance(printer, Logger):
        self.printer = lambda s: printer.info(s)
    else:
        self.printer = printer
    self.noun: str = noun
    self.verb: str = verb
    self.unit: int = unit
    self.count: int = 0
    self._count_mod_unit: int = 0
    self._last_reference_name: str | None = None
    self._last_position: int | None = None

log_last ¶

log_last() -> bool

Force logging the last record, for example when progress has completed.

Source code in fgpyo/util/logging.py

def log_last(
    self,
) -> bool:
    """Force logging the last record, for example when progress has completed."""
    if self._count_mod_unit != 0:
        self._log(refname=self._last_reference_name, position=self._last_position)
        return True
    else:
        return False

record ¶

record(reference_name: str | None = None, position: int | None = None) -> bool

Record an item at a given genomic coordinate.

Parameters:

Name	Type	Description	Default
`reference_name`	`str \| None`	the reference name of the item	`None`
`position`	`int \| None`	the 1-based start position of the item	`None`

Returns: true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py

def record(
    self,
    reference_name: str | None = None,
    position: int | None = None,
) -> bool:
    """
    Record an item at a given genomic coordinate.

    Args:
        reference_name: the reference name of the item
        position: the 1-based start position of the item
    Returns:
        true if a message was logged, false otherwise
    """
    self.count += 1
    self._count_mod_unit += 1
    self._last_reference_name = reference_name
    self._last_position = None if position is None or position <= 0 else position
    if self._count_mod_unit == self.unit:
        self._count_mod_unit = 0
        self._log(refname=self._last_reference_name, position=self._last_position)
        return True
    else:
        return False

record_alignment ¶

record_alignment(rec: AlignedSegment) -> bool

Correctly record pysam.AlignedSegments (zero-based coordinates).

Parameters:

Name	Type	Description	Default
`rec`	`AlignedSegment`	pysam.AlignedSegment object	required

Returns:

Type	Description
`bool`	true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py

def record_alignment(
    self,
    rec: AlignedSegment,
) -> bool:
    """
    Correctly record pysam.AlignedSegments (zero-based coordinates).

    Args:
        rec: pysam.AlignedSegment object

    Returns:
        true if a message was logged, false otherwise
    """
    if rec.is_unmapped:
        return self.record(None, None)
    return self.record(rec.reference_name, rec.reference_start + 1)

record_alignments ¶

record_alignments(recs: Iterable[AlignedSegment]) -> bool

Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

Parameters:

Name	Type	Description	Default
`recs`	`Iterable[AlignedSegment]`	pysam.AlignedSegment objects	required

Returns:

Type	Description
`bool`	true if a message was logged, false otherwise

Source code in fgpyo/util/logging.py

def record_alignments(
    self,
    recs: Iterable[AlignedSegment],
) -> bool:
    """
    Correctly record multiple pysam.AlignedSegments (zero-based coordinates).

    Args:
        recs: pysam.AlignedSegment objects

    Returns:
        true if a message was logged, false otherwise
    """
    logged_message: bool = False
    for rec in recs:
        logged_message = self.record_alignment(rec) or logged_message
    return logged_message

Functions¶

setup_logging ¶

setup_logging(level: str = 'INFO', name: str = 'fgpyo') -> None

Globally configure logging for all modules.

Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.

Parameters:

Name	Type	Description	Default
`level`	`str`	the default level for the logger	`'INFO'`
`name`	`str`	the name of the logger	`'fgpyo'`

Source code in fgpyo/util/logging.py

def setup_logging(level: str = "INFO", name: str = "fgpyo") -> None:
    """
    Globally configure logging for all modules.

    Configures logging to run at a specific level and output messages to stderr with
    useful information preceding the actual log message.

    Args:
        level: the default level for the logger
        name: the name of the logger
    """
    global __FGPYO_LOGGING_SETUP

    with __LOCK:
        if not __FGPYO_LOGGING_SETUP:
            log_format = (
                f"%(asctime)s {socket.gethostname()} %(name)s:%(funcName)s:%(lineno)s "
                + "[%(levelname)s]: %(message)s"
            )
            handler = logging.StreamHandler()
            handler.setLevel(level)
            handler.setFormatter(logging.Formatter(log_format))

            logger = logging.getLogger(name)
            logger.setLevel(level)
            logger.addHandler(handler)
        else:
            logging.getLogger(__name__).warn("Logging already initialized.")

        __FGPYO_LOGGING_SETUP = True

metric ¶

Metrics.¶

Module for storing, reading, and writing metric-like tab-delimited information.

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R. For example, a row per person, with columns for age, gender, and address.

The Metric() class makes it easy to read, write, and store one or metrics of the same type, all the while preserving types for each value in a metric. It is an abstract base class decorated by @dataclass, or @attr.s, with attributes storing one or more typed values. If using multiple layers of inheritance, keep in mind that it's not possible to mix these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately initialize the values of the attr superclass.

Examples¶

Defining a new metric class:

>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
...     name: str
...     age: int

or using attr:

>>> from fgpyo.util.metric import Metric
>>> import attr
>>> @attr.s(auto_attribs=True, frozen=True)
... class PersonAttr(Metric["PersonAttr"]):
...     name: str
...     age: int
...     address: str | None = None

Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.

>>> Person.header()
['name', 'age']

Getting the values from a metric class instance. The values are in the same order as the header.

>>> list(Person(name="Alice", age=47).values())
['Alice', 47]

Writing a list of metrics to a file:

>>> metrics = [
...     Person(name="Alice", age=47),
...     Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)

Then the contents of the written metrics file:

$ column -t /path/to/metrics.txt
name   age
Alice  47
Bob    24

Reading the metrics file back in:

>>> list(Person.read(Path("/path/to/metrics.txt")))  
[Person(name='Alice', age=47), Person(name='Bob', age=24)]

Formatting and parsing the values for custom types is supported by overriding the _parsers() and format_value() methods.

>>> @dataclasses.dataclass(frozen=True)
... class Name:
...     first: str
...     last: str
...     @classmethod
...     def parse(cls, value: str) -> "Name":
...          fields = value.split(" ")
...          return Name(first=fields[0], last=fields[1])
>>> from typing import Dict, Callable, Any
>>> @dataclasses.dataclass(frozen=True)
... class PersonWithName(Metric["PersonWithName"]):
...     name: Name
...     age: int
...     @classmethod
...     def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
...         return {Name: lambda value: Name.parse(value=value)}
...     @classmethod
...     def format_value(cls, value: Any) -> str:
...         if isinstance(value, Name):
...             return f"{value.first} {value.last}"
...         else:
...             return super().format_value(value=value)
>>> PersonWithName.parse(fields=["john doe", "42"])
PersonWithName(name=Name(first='john', last='doe'), age=42)
>>> PersonWithName(name=Name(first='john', last='doe'), age=42).formatted_values()
['john doe', '42']

Classes¶

Metric ¶

Bases: ABC, Generic[MetricType]

Abstract base class for all metric-like tab-delimited files.

Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This makes it easy for them to be read in languages like R.

Subclasses of Metric() can support parsing and formatting custom types with _parsers() and format_value().

Source code in fgpyo/util/metric.py

class Metric(ABC, Generic[MetricType]):
    """
    Abstract base class for all metric-like tab-delimited files.

    Metric files are tab-delimited, contain a header, and zero or more rows for metric values.  This
    makes it easy for them to be read in languages like `R`.

    Subclasses of [`Metric()`][fgpyo.util.metric.Metric] can support parsing and
    formatting custom types with `_parsers()` and
    [`format_value()`][fgpyo.util.metric.Metric.format_value].
    """

    @classmethod
    def keys(cls) -> Iterator[str]:
        """An iterator over field names in the same order as the header."""
        for field in inspect.get_fields(cls):  # type: ignore[arg-type]
            yield field.name

    def values(self) -> Iterator[Any]:
        """An iterator over attribute values in the same order as the header."""
        for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
            yield getattr(self, field.name)

    def items(self) -> Iterator[tuple[str, Any]]:
        """An iterator over field names and values in the same order as the header."""
        for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
            yield (field.name, getattr(self, field.name))

    def formatted_values(self) -> list[str]:
        """An iterator over formatted attribute values in the same order as the header."""
        return [self.format_value(value) for value in self.values()]

    def formatted_items(self) -> list[tuple[str, str]]:
        """An iterator over formatted attribute values in the same order as the header."""
        return [(key, self.format_value(value)) for key, value in self.items()]

    @classmethod
    def _parsers(cls) -> dict[type, Callable[[str], Any]]:
        """
        Mapping of type to a specific parser for that type.

        The parser must accept a string as a single parameter and return a single value of
        the given type.  Sub-classes may override this method to support custom types.
        """
        return {}

    @classmethod
    def read(
        cls,
        path: Path,
        ignore_extra_fields: bool = True,
        strip_whitespace: bool = False,
        threads: int | None = None,
    ) -> Iterator[Any]:
        """
        Reads in zero or more metrics from the given path.

        The metric file must contain a matching header.

        Columns that are not present in the file but are optional in the metric class will
        be default values.

        Args:
            path: the path to the metrics file.
            ignore_extra_fields: True to ignore any extra columns, False to raise an exception.
            strip_whitespace: True to strip leading and trailing whitespace from each field,
                               False to keep as-is.
            threads: the number of threads to use when decompressing gzip files
        """
        parsers = cls._parsers()
        with io.to_reader(path, threads=threads) as reader:
            header: list[str] = reader.readline().rstrip("\r\n").split("\t")
            # check the header
            class_fields = set(cls.header())
            file_fields = set(header)
            missing_from_class = file_fields.difference(class_fields)
            missing_from_file = class_fields.difference(file_fields)

            field_name_to_attribute = inspect.get_fields_dict(cls)  # type: ignore[arg-type]

            # ignore class fields that are missing from the file (via header) if they're optional
            # or have a default
            if len(missing_from_file) > 0:
                fields_with_defaults = [
                    field
                    for field in missing_from_file
                    if inspect._attribute_has_default(field_name_to_attribute[field])
                ]
                # remove optional class fields from the fields
                missing_from_file = missing_from_file.difference(fields_with_defaults)

            # raise an exception if there are non-optional class fields missing from the file
            if len(missing_from_file) > 0:
                raise ValueError(
                    f"In file: {path}, fields in file missing from class '{cls.__name__}': "
                    + ", ".join(missing_from_file)
                )

            # raise an exception if there are fields in the file not in the header, unless they
            # should be ignored.
            if not ignore_extra_fields and len(missing_from_class) > 0:
                raise ValueError(
                    f"In file: {path}, extra fields in file missing from class '{cls.__name__}': "
                    ", ".join(missing_from_file)
                )

            # read the metric lines
            for lineno, line in enumerate(reader, 2):
                # parse the raw values
                values: list[str] = line.rstrip("\r\n").split("\t")
                if strip_whitespace:
                    values = [v.strip() for v in values]

                # raise an exception if there aren't the same number of values as the header
                if len(header) != len(values):
                    raise ValueError(
                        f"In file: {path}, expected {len(header)} columns, got {len(values)} on "
                        f"line {lineno}: {line}"
                    )

                # build the metric
                instance: Metric[MetricType] = inspect.attr_from(
                    cls=cls, kwargs=dict(zip(header, values, strict=True)), parsers=parsers
                )
                yield instance

    @classmethod
    def parse(cls, fields: list[str]) -> Any:
        """
        Parses the string-representation of this metric.

        One string per attribute should be given.
        """
        parsers = cls._parsers()
        header = cls.header()
        assert len(fields) == len(header)
        return inspect.attr_from(
            cls=cls, kwargs=dict(zip(header, fields, strict=True)), parsers=parsers
        )

    @classmethod
    def write(cls, path: Path, *values: MetricType, threads: int | None = None) -> None:
        """
        Writes zero or more metrics to the given path.

        The header will always be written.

        Args:
            path: Path to the output file.
            values: Zero or more metrics.
            threads: the number of threads to use when compressing gzip files

        """
        with MetricWriter[MetricType](path, metric_class=cls, threads=threads) as writer:
            writer.writeall(values)

    @classmethod
    def header(cls) -> list[str]:
        """The list of header values for the metric."""
        return [a.name for a in inspect.get_fields(cls)]  # type: ignore[arg-type]

    @classmethod
    def format_value(cls, value: Any) -> str:  # noqa: C901
        """
        The default method to format values of a given type.

        By default, this method will comma-delimit `list`, `tuple`, and `set` types, and apply
        `str` to all others.

        Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs
        delimited by commas.

        In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries
        with '{}'

        Args:
            value: the value to format.
        """
        if issubclass(type(value), Enum):
            return cls.format_value(value.value)
        if isinstance(value, (tuple)):
            if len(value) == 0:
                return "()"
            else:
                return "(" + ",".join(cls.format_value(v) for v in value) + ")"
        if isinstance(value, (list)):
            if len(value) == 0:
                return ""
            else:
                return ",".join(cls.format_value(v) for v in value)
        if isinstance(value, (set)):
            if len(value) == 0:
                return ""
            else:
                return "{" + ",".join(cls.format_value(v) for v in value) + "}"

        elif isinstance(value, dict):
            if len(value) == 0:
                return "{}"
            else:
                return (
                    "{"
                    + ",".join(
                        f"{cls.format_value(k)};{cls.format_value(v)}" for k, v in value.items()
                    )
                    + "}"
                )
        elif isinstance(value, float):
            return f"{round(value, 5)}"
        elif value is None:
            return ""
        else:
            return f"{value}"

    @classmethod
    def to_list(cls, value: str) -> list[Any]:
        """Returns a list value split on comma delimeter."""
        return [] if value == "" else value.split(",")

    @staticmethod
    def fast_concat(*inputs: Path, output: Path) -> None:
        """Concatenates multiple metric files into one, validating headers match."""
        if len(inputs) == 0:
            raise ValueError("No inputs provided")

        headers = [next(io.read_lines(input_path)) for input_path in inputs]
        assert len(set(headers)) == 1, "Input headers do not match"
        io.write_lines(path=output, lines_to_write=set(headers))

        for input_path in inputs:
            io.write_lines(
                path=output, lines_to_write=list(io.read_lines(input_path))[1:], append=True
            )

    @staticmethod
    def _read_header(
        reader: TextIOWrapper,
        delimiter: str = "\t",
        comment_prefix: str = "#",
    ) -> MetricFileHeader:
        """
        Read the header from an open file.

        The first row after any commented or empty lines will be used as the fieldnames.

        Lines preceding the fieldnames will be returned in the `preamble`. Leading and trailing
        whitespace are removed and ignored.

        Args:
            reader: An open, readable file handle.
            delimiter: The delimiter character used to separate fields in the file.
            comment_prefix: The prefix for comment lines in the file.

        Returns:
            A `MetricFileHeader` containing the field names and any preceding lines.

        Raises:
            ValueError: If the file was empty or contained only comments or empty lines.
        """
        preamble: list[str] = []
        fieldnames: list[str] = []

        for line in reader:
            if line.strip().startswith(comment_prefix) or line.strip() == "":
                # Skip any commented or empty lines before the header
                preamble.append(line.strip())
            else:
                # The first line with any other content is assumed to be the header
                fieldnames = line.strip().split(delimiter)
                break

        return MetricFileHeader(preamble=preamble, fieldnames=fieldnames)

Functions¶

fast_concat staticmethod ¶

fast_concat(*inputs: Path, output: Path) -> None

Concatenates multiple metric files into one, validating headers match.

Source code in fgpyo/util/metric.py

@staticmethod
def fast_concat(*inputs: Path, output: Path) -> None:
    """Concatenates multiple metric files into one, validating headers match."""
    if len(inputs) == 0:
        raise ValueError("No inputs provided")

    headers = [next(io.read_lines(input_path)) for input_path in inputs]
    assert len(set(headers)) == 1, "Input headers do not match"
    io.write_lines(path=output, lines_to_write=set(headers))

    for input_path in inputs:
        io.write_lines(
            path=output, lines_to_write=list(io.read_lines(input_path))[1:], append=True
        )

format_value classmethod ¶

format_value(value: Any) -> str

The default method to format values of a given type.

By default, this method will comma-delimit list, tuple, and set types, and apply str to all others.

Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.

In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries with '{}'

Parameters:

Name	Type	Description	Default
`value`	`Any`	the value to format.	required

Source code in fgpyo/util/metric.py

@classmethod
def format_value(cls, value: Any) -> str:  # noqa: C901
    """
    The default method to format values of a given type.

    By default, this method will comma-delimit `list`, `tuple`, and `set` types, and apply
    `str` to all others.

    Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs
    delimited by commas.

    In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries
    with '{}'

    Args:
        value: the value to format.
    """
    if issubclass(type(value), Enum):
        return cls.format_value(value.value)
    if isinstance(value, (tuple)):
        if len(value) == 0:
            return "()"
        else:
            return "(" + ",".join(cls.format_value(v) for v in value) + ")"
    if isinstance(value, (list)):
        if len(value) == 0:
            return ""
        else:
            return ",".join(cls.format_value(v) for v in value)
    if isinstance(value, (set)):
        if len(value) == 0:
            return ""
        else:
            return "{" + ",".join(cls.format_value(v) for v in value) + "}"

    elif isinstance(value, dict):
        if len(value) == 0:
            return "{}"
        else:
            return (
                "{"
                + ",".join(
                    f"{cls.format_value(k)};{cls.format_value(v)}" for k, v in value.items()
                )
                + "}"
            )
    elif isinstance(value, float):
        return f"{round(value, 5)}"
    elif value is None:
        return ""
    else:
        return f"{value}"

formatted_items ¶

formatted_items() -> list[tuple[str, str]]

An iterator over formatted attribute values in the same order as the header.

Source code in fgpyo/util/metric.py

def formatted_items(self) -> list[tuple[str, str]]:
    """An iterator over formatted attribute values in the same order as the header."""
    return [(key, self.format_value(value)) for key, value in self.items()]

formatted_values ¶

formatted_values() -> list[str]

An iterator over formatted attribute values in the same order as the header.

Source code in fgpyo/util/metric.py

def formatted_values(self) -> list[str]:
    """An iterator over formatted attribute values in the same order as the header."""
    return [self.format_value(value) for value in self.values()]

header() -> list[str]

The list of header values for the metric.

Source code in fgpyo/util/metric.py

@classmethod
def header(cls) -> list[str]:
    """The list of header values for the metric."""
    return [a.name for a in inspect.get_fields(cls)]  # type: ignore[arg-type]

items ¶

items() -> Iterator[tuple[str, Any]]

An iterator over field names and values in the same order as the header.

Source code in fgpyo/util/metric.py

def items(self) -> Iterator[tuple[str, Any]]:
    """An iterator over field names and values in the same order as the header."""
    for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
        yield (field.name, getattr(self, field.name))

keys classmethod ¶

keys() -> Iterator[str]

An iterator over field names in the same order as the header.

Source code in fgpyo/util/metric.py

@classmethod
def keys(cls) -> Iterator[str]:
    """An iterator over field names in the same order as the header."""
    for field in inspect.get_fields(cls):  # type: ignore[arg-type]
        yield field.name

parse classmethod ¶

parse(fields: list[str]) -> Any

Parses the string-representation of this metric.

One string per attribute should be given.

Source code in fgpyo/util/metric.py

@classmethod
def parse(cls, fields: list[str]) -> Any:
    """
    Parses the string-representation of this metric.

    One string per attribute should be given.
    """
    parsers = cls._parsers()
    header = cls.header()
    assert len(fields) == len(header)
    return inspect.attr_from(
        cls=cls, kwargs=dict(zip(header, fields, strict=True)), parsers=parsers
    )

read classmethod ¶

read(path: Path, ignore_extra_fields: bool = True, strip_whitespace: bool = False, threads: int | None = None) -> Iterator[Any]

Reads in zero or more metrics from the given path.

The metric file must contain a matching header.

Columns that are not present in the file but are optional in the metric class will be default values.

Parameters:

Name	Type	Description	Default
`path`	`Path`	the path to the metrics file.	required
`ignore_extra_fields`	`bool`	True to ignore any extra columns, False to raise an exception.	`True`
`strip_whitespace`	`bool`	True to strip leading and trailing whitespace from each field, False to keep as-is.	`False`
`threads`	`int \| None`	the number of threads to use when decompressing gzip files	`None`

Source code in fgpyo/util/metric.py

@classmethod
def read(
    cls,
    path: Path,
    ignore_extra_fields: bool = True,
    strip_whitespace: bool = False,
    threads: int | None = None,
) -> Iterator[Any]:
    """
    Reads in zero or more metrics from the given path.

    The metric file must contain a matching header.

    Columns that are not present in the file but are optional in the metric class will
    be default values.

    Args:
        path: the path to the metrics file.
        ignore_extra_fields: True to ignore any extra columns, False to raise an exception.
        strip_whitespace: True to strip leading and trailing whitespace from each field,
                           False to keep as-is.
        threads: the number of threads to use when decompressing gzip files
    """
    parsers = cls._parsers()
    with io.to_reader(path, threads=threads) as reader:
        header: list[str] = reader.readline().rstrip("\r\n").split("\t")
        # check the header
        class_fields = set(cls.header())
        file_fields = set(header)
        missing_from_class = file_fields.difference(class_fields)
        missing_from_file = class_fields.difference(file_fields)

        field_name_to_attribute = inspect.get_fields_dict(cls)  # type: ignore[arg-type]

        # ignore class fields that are missing from the file (via header) if they're optional
        # or have a default
        if len(missing_from_file) > 0:
            fields_with_defaults = [
                field
                for field in missing_from_file
                if inspect._attribute_has_default(field_name_to_attribute[field])
            ]
            # remove optional class fields from the fields
            missing_from_file = missing_from_file.difference(fields_with_defaults)

        # raise an exception if there are non-optional class fields missing from the file
        if len(missing_from_file) > 0:
            raise ValueError(
                f"In file: {path}, fields in file missing from class '{cls.__name__}': "
                + ", ".join(missing_from_file)
            )

        # raise an exception if there are fields in the file not in the header, unless they
        # should be ignored.
        if not ignore_extra_fields and len(missing_from_class) > 0:
            raise ValueError(
                f"In file: {path}, extra fields in file missing from class '{cls.__name__}': "
                ", ".join(missing_from_file)
            )

        # read the metric lines
        for lineno, line in enumerate(reader, 2):
            # parse the raw values
            values: list[str] = line.rstrip("\r\n").split("\t")
            if strip_whitespace:
                values = [v.strip() for v in values]

            # raise an exception if there aren't the same number of values as the header
            if len(header) != len(values):
                raise ValueError(
                    f"In file: {path}, expected {len(header)} columns, got {len(values)} on "
                    f"line {lineno}: {line}"
                )

            # build the metric
            instance: Metric[MetricType] = inspect.attr_from(
                cls=cls, kwargs=dict(zip(header, values, strict=True)), parsers=parsers
            )
            yield instance

to_list classmethod ¶

to_list(value: str) -> list[Any]

Returns a list value split on comma delimeter.

Source code in fgpyo/util/metric.py

@classmethod
def to_list(cls, value: str) -> list[Any]:
    """Returns a list value split on comma delimeter."""
    return [] if value == "" else value.split(",")

values ¶

values() -> Iterator[Any]

An iterator over attribute values in the same order as the header.

Source code in fgpyo/util/metric.py

def values(self) -> Iterator[Any]:
    """An iterator over attribute values in the same order as the header."""
    for field in inspect.get_fields(self.__class__):  # type: ignore[arg-type]
        yield getattr(self, field.name)

write classmethod ¶

write(path: Path, *values: MetricType, threads: int | None = None) -> None

Writes zero or more metrics to the given path.

The header will always be written.

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to the output file.	required
`values`	`MetricType`	Zero or more metrics.	`()`
`threads`	`int \| None`	the number of threads to use when compressing gzip files	`None`

Source code in fgpyo/util/metric.py

@classmethod
def write(cls, path: Path, *values: MetricType, threads: int | None = None) -> None:
    """
    Writes zero or more metrics to the given path.

    The header will always be written.

    Args:
        path: Path to the output file.
        values: Zero or more metrics.
        threads: the number of threads to use when compressing gzip files

    """
    with MetricWriter[MetricType](path, metric_class=cls, threads=threads) as writer:
        writer.writeall(values)

MetricFileHeader dataclass ¶

Header of a file.

A file's header contains an optional preamble, consisting of lines prefixed by a comment character and/or empty lines, and a required row of fieldnames before the data rows begin.

Attributes:

Name	Type	Description
`preamble`	`list[str]`	A list of any lines preceding the fieldnames.
`fieldnames`	`list[str]`	The field names specified in the final line of the header.

Source code in fgpyo/util/metric.py

@dataclass(frozen=True)
class MetricFileHeader:
    """
    Header of a file.

    A file's header contains an optional preamble, consisting of lines prefixed by a comment
    character and/or empty lines, and a required row of fieldnames before the data rows begin.

    Attributes:
        preamble: A list of any lines preceding the fieldnames.
        fieldnames: The field names specified in the final line of the header.
    """

    preamble: list[str]
    fieldnames: list[str]

MetricWriter ¶

Bases: Generic[MetricType], AbstractContextManager

Writes Metric instances to a delimited file.

Source code in fgpyo/util/metric.py

class MetricWriter(Generic[MetricType], AbstractContextManager):
    """Writes Metric instances to a delimited file."""

    _metric_class: type[Metric]
    _fieldnames: list[str]
    _fout: TextIOWrapper
    _writer: DictWriter

    def __init__(
        self,
        filename: Path | str,
        metric_class: type[Metric],
        append: bool = False,
        delimiter: str = "\t",
        include_fields: list[str] | None = None,
        exclude_fields: list[str] | None = None,
        lineterminator: str = "\n",
        threads: int | None = None,
    ) -> None:
        r"""
        Initializes the MetricWriter.

        Args:
            filename: Path to the file to write.
            metric_class: Metric class.
            append: If `True`, the file will be appended to. Otherwise, the specified file will be
                overwritten.
            delimiter: The output file delimiter.
            include_fields: If specified, only the listed fieldnames will be included when writing
                records to file. Fields will be written in the order provided.
                May not be used together with `exclude_fields`.
            exclude_fields: If specified, any listed fieldnames will be excluded when writing
                records to file.
                May not be used together with `include_fields`.
            lineterminator: The string used to terminate lines produced by the MetricWriter.
                Default = "\n".
            threads: the number of threads to use when compressing gzip files.

        Raises:
            TypeError: If the provided metric class is not a dataclass- or attr-decorated
                subclass of `Metric`.
            AssertionError: If the provided filepath is not writable.
            AssertionError: If `append=True` and the provided file is not readable. (When appending,
                we check to ensure that the header matches the specified metric class. The file must
                be readable to get the header.)
            ValueError: If `append=True` and the provided file is a FIFO (named pipe).
            ValueError: If `append=True` and the provided file does not include a header.
            ValueError: If `append=True` and the header of the provided file does not match the
                specified metric class and the specified include/exclude fields.
        """
        filepath: Path = Path(filename)
        if (filepath.is_fifo() or filepath.is_char_device()) and append:
            raise ValueError("Cannot append to stdout, stderr, or other named pipe or stream")

        ordered_fieldnames: list[str] = _validate_and_generate_final_output_fieldnames(
            metric_class=metric_class,
            include_fields=include_fields,
            exclude_fields=exclude_fields,
        )

        _assert_is_metric_class(metric_class)
        io.assert_path_is_writable(filepath)
        if append:
            io.assert_path_is_readable(filepath)
            _assert_file_header_matches_metric(
                path=filepath,
                metric_class=metric_class,
                ordered_fieldnames=ordered_fieldnames,
                delimiter=delimiter,
            )

        self._metric_class = metric_class
        self._fieldnames = ordered_fieldnames
        self._fout = io.to_writer(filepath, append=append, threads=threads)
        self._writer = DictWriter(
            f=self._fout,
            fieldnames=self._fieldnames,
            delimiter=delimiter,
            lineterminator=lineterminator,
        )

        # If we aren't appending to an existing file, write the header before any rows
        if not append:
            self._writer.writeheader()

    def __enter__(self) -> "MetricWriter":
        """Returns self for use as a context manager."""
        return self

    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_value: BaseException | None,
        traceback: TracebackType | None,
    ) -> None:
        """Closes the underlying writer on exit."""
        self.close()
        super().__exit__(exc_type, exc_value, traceback)

    def close(self) -> None:
        """Close the underlying file handle."""
        self._fout.close()

    def write(self, metric: MetricType) -> None:
        """
        Write a single Metric instance to file.

        The Metric is converted to a dictionary and then written using the underlying
        `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
        `exclude_fields` arguments, the fields of the Metric are subset and/or reordered
        accordingly before writing.

        Args:
            metric: An instance of the specified Metric.

        Raises:
            TypeError: If the provided `metric` is not an instance of the Metric class used to
                parametrize the writer.
        """
        # Serialize the Metric to a dict for writing by the underlying `DictWriter`
        row = {fieldname: val for fieldname, val in metric.formatted_items()}

        # Filter and/or re-order output fields if necessary
        row = {fieldname: row[fieldname] for fieldname in self._fieldnames}

        self._writer.writerow(row)

    def writeall(self, metrics: Iterable[MetricType]) -> None:
        """
        Write multiple Metric instances to file.

        Each Metric is converted to a dictionary and then written using the underlying
        `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
        `exclude_fields` arguments, the attributes of each Metric are subset and/or reordered
        accordingly before writing.

        Args:
            metrics: A sequence of instances of the specified Metric.
        """
        for metric in metrics:
            self.write(metric)

Functions¶

__enter__ ¶

__enter__() -> MetricWriter

Returns self for use as a context manager.

Source code in fgpyo/util/metric.py

def __enter__(self) -> "MetricWriter":
    """Returns self for use as a context manager."""
    return self

__exit__ ¶

__exit__(exc_type: type[BaseException] | None, exc_value: BaseException | None, traceback: TracebackType | None) -> None

Closes the underlying writer on exit.

Source code in fgpyo/util/metric.py

def __exit__(
    self,
    exc_type: type[BaseException] | None,
    exc_value: BaseException | None,
    traceback: TracebackType | None,
) -> None:
    """Closes the underlying writer on exit."""
    self.close()
    super().__exit__(exc_type, exc_value, traceback)

__init__ ¶

__init__(filename: Path | str, metric_class: type[Metric], append: bool = False, delimiter: str = '\t', include_fields: list[str] | None = None, exclude_fields: list[str] | None = None, lineterminator: str = '\n', threads: int | None = None) -> None

Initializes the MetricWriter.

Parameters:

Name	Type	Description	Default
`filename`	`Path \| str`	Path to the file to write.	required
`metric_class`	`type[Metric]`	Metric class.	required
`append`	`bool`	If `True`, the file will be appended to. Otherwise, the specified file will be overwritten.	`False`
`delimiter`	`str`	The output file delimiter.	`'\t'`
`include_fields`	`list[str] \| None`	If specified, only the listed fieldnames will be included when writing records to file. Fields will be written in the order provided. May not be used together with `exclude_fields`.	`None`
`exclude_fields`	`list[str] \| None`	If specified, any listed fieldnames will be excluded when writing records to file. May not be used together with `include_fields`.	`None`
`lineterminator`	`str`	The string used to terminate lines produced by the MetricWriter. Default = "\n".	`'\n'`
`threads`	`int \| None`	the number of threads to use when compressing gzip files.	`None`

Raises:

Type	Description
`TypeError`	If the provided metric class is not a dataclass- or attr-decorated subclass of `Metric`.
`AssertionError`	If the provided filepath is not writable.
`AssertionError`	If `append=True` and the provided file is not readable. (When appending, we check to ensure that the header matches the specified metric class. The file must be readable to get the header.)
`ValueError`	If `append=True` and the provided file is a FIFO (named pipe).
`ValueError`	If `append=True` and the provided file does not include a header.
`ValueError`	If `append=True` and the header of the provided file does not match the specified metric class and the specified include/exclude fields.

Source code in fgpyo/util/metric.py

def __init__(
    self,
    filename: Path | str,
    metric_class: type[Metric],
    append: bool = False,
    delimiter: str = "\t",
    include_fields: list[str] | None = None,
    exclude_fields: list[str] | None = None,
    lineterminator: str = "\n",
    threads: int | None = None,
) -> None:
    r"""
    Initializes the MetricWriter.

    Args:
        filename: Path to the file to write.
        metric_class: Metric class.
        append: If `True`, the file will be appended to. Otherwise, the specified file will be
            overwritten.
        delimiter: The output file delimiter.
        include_fields: If specified, only the listed fieldnames will be included when writing
            records to file. Fields will be written in the order provided.
            May not be used together with `exclude_fields`.
        exclude_fields: If specified, any listed fieldnames will be excluded when writing
            records to file.
            May not be used together with `include_fields`.
        lineterminator: The string used to terminate lines produced by the MetricWriter.
            Default = "\n".
        threads: the number of threads to use when compressing gzip files.

    Raises:
        TypeError: If the provided metric class is not a dataclass- or attr-decorated
            subclass of `Metric`.
        AssertionError: If the provided filepath is not writable.
        AssertionError: If `append=True` and the provided file is not readable. (When appending,
            we check to ensure that the header matches the specified metric class. The file must
            be readable to get the header.)
        ValueError: If `append=True` and the provided file is a FIFO (named pipe).
        ValueError: If `append=True` and the provided file does not include a header.
        ValueError: If `append=True` and the header of the provided file does not match the
            specified metric class and the specified include/exclude fields.
    """
    filepath: Path = Path(filename)
    if (filepath.is_fifo() or filepath.is_char_device()) and append:
        raise ValueError("Cannot append to stdout, stderr, or other named pipe or stream")

    ordered_fieldnames: list[str] = _validate_and_generate_final_output_fieldnames(
        metric_class=metric_class,
        include_fields=include_fields,
        exclude_fields=exclude_fields,
    )

    _assert_is_metric_class(metric_class)
    io.assert_path_is_writable(filepath)
    if append:
        io.assert_path_is_readable(filepath)
        _assert_file_header_matches_metric(
            path=filepath,
            metric_class=metric_class,
            ordered_fieldnames=ordered_fieldnames,
            delimiter=delimiter,
        )

    self._metric_class = metric_class
    self._fieldnames = ordered_fieldnames
    self._fout = io.to_writer(filepath, append=append, threads=threads)
    self._writer = DictWriter(
        f=self._fout,
        fieldnames=self._fieldnames,
        delimiter=delimiter,
        lineterminator=lineterminator,
    )

    # If we aren't appending to an existing file, write the header before any rows
    if not append:
        self._writer.writeheader()

close ¶

close() -> None

Close the underlying file handle.

Source code in fgpyo/util/metric.py

def close(self) -> None:
    """Close the underlying file handle."""
    self._fout.close()

write ¶

write(metric: MetricType) -> None

Write a single Metric instance to file.

The Metric is converted to a dictionary and then written using the underlying csv.DictWriter. If the MetricWriter was created using the include_fields or exclude_fields arguments, the fields of the Metric are subset and/or reordered accordingly before writing.

Parameters:

Name	Type	Description	Default
`metric`	`MetricType`	An instance of the specified Metric.	required

Raises:

Type	Description
`TypeError`	If the provided `metric` is not an instance of the Metric class used to parametrize the writer.

Source code in fgpyo/util/metric.py

def write(self, metric: MetricType) -> None:
    """
    Write a single Metric instance to file.

    The Metric is converted to a dictionary and then written using the underlying
    `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
    `exclude_fields` arguments, the fields of the Metric are subset and/or reordered
    accordingly before writing.

    Args:
        metric: An instance of the specified Metric.

    Raises:
        TypeError: If the provided `metric` is not an instance of the Metric class used to
            parametrize the writer.
    """
    # Serialize the Metric to a dict for writing by the underlying `DictWriter`
    row = {fieldname: val for fieldname, val in metric.formatted_items()}

    # Filter and/or re-order output fields if necessary
    row = {fieldname: row[fieldname] for fieldname in self._fieldnames}

    self._writer.writerow(row)

writeall ¶

writeall(metrics: Iterable[MetricType]) -> None

Write multiple Metric instances to file.

Each Metric is converted to a dictionary and then written using the underlying csv.DictWriter. If the MetricWriter was created using the include_fields or exclude_fields arguments, the attributes of each Metric are subset and/or reordered accordingly before writing.

Parameters:

Name	Type	Description	Default
`metrics`	`Iterable[MetricType]`	A sequence of instances of the specified Metric.	required

Source code in fgpyo/util/metric.py

def writeall(self, metrics: Iterable[MetricType]) -> None:
    """
    Write multiple Metric instances to file.

    Each Metric is converted to a dictionary and then written using the underlying
    `csv.DictWriter`. If the `MetricWriter` was created using the `include_fields` or
    `exclude_fields` arguments, the attributes of each Metric are subset and/or reordered
    accordingly before writing.

    Args:
        metrics: A sequence of instances of the specified Metric.
    """
    for metric in metrics:
        self.write(metric)

Modules¶

string ¶

Functions¶

column_it ¶

column_it(rows: list[list[str]], delimiter: str = ' ') -> str

A simple version of Unix's column utility. This assumes the table is NxM.

Parameters:

Name	Type	Description	Default
`rows`	`list[list[str]]`	the rows to adjust. Each row must have the same number of delimited fields.	required
`delimiter`	`str`	the delimiter for each field in a row.	`' '`

Source code in fgpyo/util/string.py

def column_it(rows: list[list[str]], delimiter: str = " ") -> str:
    """
    A simple version of Unix's `column` utility.  This assumes the table is NxM.

    Args:
        rows: the rows to adjust.  Each row must have the same number of delimited fields.
        delimiter: the delimiter for each field in a row.
    """
    # get the # of columns
    num_columns = len(rows[0])
    # for each column, find the maximum length of a cell
    max_column_lengths: list[int] = [
        max(len(row[col_i]) for row in rows) for col_i in range(num_columns)
    ]
    # pad each row in the table
    return "\n".join(
        delimiter.join(
            (" " * (max_column_lengths[col_i] - len(row[col_i]))) + row[col_i]
            for col_i in range(num_columns)
        )
        for row in rows
    )

types ¶

Attributes¶

TypeAnnotation module-attribute ¶

TypeAnnotation: TypeAlias = type | _GenericAlias | UnionType | GenericAlias

A function parameter's type annotation may be any of the following: 1) type, when declaring any of the built-in Python types 2) typing._GenericAlias, when declaring generic collection types or union types using pre-PEP 585 and pre-PEP 604 syntax (e.g. List[int], Optional[int], or Union[int, None]) 3) types.UnionType, when declaring union types using PEP604 syntax (e.g. int | None) 4) types.GenericAlias, when declaring generic collection types using PEP 585 syntax (e.g. list[int]) types.GenericAlias is a subclass of type, but typing._GenericAlias and types.UnionType are not and must be considered explicitly.

Classes¶

InspectException ¶

Bases: Exception

Raised when type inspection or parsing fails.

Source code in fgpyo/util/types.py

class InspectException(Exception):  # noqa: N818
    """Raised when type inspection or parsing fails."""

Functions¶

all_not_none ¶

all_not_none(values: tuple[T | None, ...]) -> TypeGuard[tuple[T, ...]]

all_not_none(values: list[T | None]) -> TypeGuard[list[T]]

all_not_none(values: set[T | None]) -> TypeGuard[set[T]]

all_not_none(values: Sequence[T | None]) -> TypeGuard[Sequence[T]]

all_not_none(values: Collection[T | None]) -> TypeGuard[Collection[T]]

all_not_none(values: Iterable[T | None]) -> bool

Type guard that checks all Optional collection elements are not None.

Parameters:

Name	Type	Description	Default
`values`	`Iterable[T \| None]`	Collection of Optional elements.	required

Returns:

Type	Description
`bool`	True if no elements are None, False otherwise. When True, narrows the collection type from
`bool`	`Container[T \| None]` to `Container[T]`.

Source code in fgpyo/util/types.py

def all_not_none(values: Iterable[T | None]) -> bool:
    """
    Type guard that checks all Optional collection elements are not None.

    Args:
        values: Collection of Optional elements.

    Returns:
        True if no elements are None, False otherwise. When True, narrows the collection type from
        `Container[T | None]` to `Container[T]`.
    """
    return all(v is not None for v in values)

is_constructible_from_str ¶

is_constructible_from_str(type_: TypeAnnotation) -> TypeGuard[type]

Returns true if the provided type is a class constructible from a single str argument.

Source code in fgpyo/util/types.py

def is_constructible_from_str(type_: TypeAnnotation) -> TypeGuard[type]:
    """Returns true if the provided type is a class constructible from a single str argument."""
    if not isinstance(type_, type):
        return False
    try:
        sig = inspect.signature(type_)
        ((argname, _),) = sig.bind(object()).arguments.items()
    except (TypeError, ValueError):
        return False
    return sig.parameters[argname].annotation is str

is_known_str_constructible ¶

is_known_str_constructible(type_: TypeAnnotation) -> TypeGuard[type]

Returns true if type_ is one of the built-in types known to be constructible from a str.

Complements is_constructible_from_str, which detects str-constructibility via constructor signature inspection. This predicate covers types whose constructors aren't annotated for introspection (e.g. int, str, float) or whose subclasses don't all share an annotation (e.g. PurePath).

Source code in fgpyo/util/types.py

def is_known_str_constructible(type_: TypeAnnotation) -> TypeGuard[type]:
    """
    Returns true if `type_` is one of the built-in types known to be constructible from a str.

    Complements `is_constructible_from_str`, which detects str-constructibility via constructor
    signature inspection. This predicate covers types whose constructors aren't annotated for
    introspection (e.g. `int`, `str`, `float`) or whose subclasses don't all share an annotation
    (e.g. `PurePath`).
    """
    return isinstance(type_, type) and (type_ in (str, int, float) or issubclass(type_, PurePath))

is_list_like ¶

is_list_like(type_: type) -> bool

Returns true if the value is a list or list like object.

Source code in fgpyo/util/types.py

def is_list_like(type_: type) -> bool:
    """Returns true if the value is a list or list like object."""
    return typing.get_origin(type_) in [list, collections.abc.Iterable, collections.abc.Sequence]

make_enum_parser ¶

make_enum_parser(enum: type[EnumType]) -> partial

Makes a parser function for enum classes.

Source code in fgpyo/util/types.py

def make_enum_parser(enum: type[EnumType]) -> partial:
    """Makes a parser function for enum classes."""
    return partial(_make_enum_parser_worker, enum)

make_literal_parser ¶

make_literal_parser(literal: TypeAnnotation, parsers: Iterable[Callable[[str], LiteralType]]) -> partial

Generates a parser function for a literal type object.

Source code in fgpyo/util/types.py

def make_literal_parser(
    literal: TypeAnnotation, parsers: Iterable[Callable[[str], LiteralType]]
) -> partial:
    """Generates a parser function for a literal type object."""
    return partial(_make_literal_parser_worker, literal, parsers)

make_union_parser ¶

make_union_parser(union: TypeAnnotation, parsers: Iterable[Callable[[str], UnionType]]) -> partial

Generates a parser function for a union type object.

Source code in fgpyo/util/types.py

def make_union_parser(
    union: TypeAnnotation, parsers: Iterable[Callable[[str], UnionType]]
) -> partial:
    """Generates a parser function for a union type object."""
    return partial(_make_union_parser_worker, union, parsers)

none_parser ¶

none_parser(value: str) -> Literal[None]

Returns None if the value is 'None', else raises an error.

Source code in fgpyo/util/types.py

def none_parser(value: str) -> Literal[None]:
    """Returns None if the value is 'None', else raises an error."""
    if value == "":
        return None
    raise ValueError(f"NoneType not a valid type for {value}")

parse_bool ¶

parse_bool(string: str) -> bool

Parses strings into bools.

Accounts for the many different text representations of bools that can be used.

Source code in fgpyo/util/types.py

def parse_bool(string: str) -> bool:
    """
    Parses strings into bools.

    Accounts for the many different text representations of bools that can be used.
    """
    if string.lower() in ["t", "true", "1"]:
        return True
    elif string.lower() in ["f", "false", "0"]:
        return False
    else:
        raise ValueError("{} is not a valid boolean string".format(string))

vcf ¶

Classes for generating VCF and records for testing.¶

This module contains utility classes for the generation of VCF files and variant records, for use in testing.

The module contains the following public classes:

VariantBuilder() -- A builder class that allows the accumulation of variant records and access as a list and writing to file.

Examples¶

Typically, we have pysam.VariantRecord records obtained from reading from a VCF file. The VariantBuilder() class builds such records.

Variants are added with the add() method, which returns a pysam.VariantRecord.

>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
...     qual=40, filter=["PASS"]
... )

VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs

>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add()  # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
...     samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
... )

The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the to_path() method:

>>> from pathlib import Path
>>> path_to_vcf: Path = builder.to_path()

The variants may also be retrieved in the order they were added via the to_unsorted_list() method and in coordinate sorted order via the to_sorted_list() method.

Functions¶

reader ¶

reader(path: VcfPath) -> Generator[VariantFile, None, None]

Opens the given path for VCF reading.

Parameters:

Name	Type	Description	Default
`path`	`VcfPath`	the path to a VCF, or an open file handle	required

Source code in fgpyo/vcf/__init__.py

@contextmanager
def reader(path: VcfPath) -> Generator[VcfReader, None, None]:
    """
    Opens the given path for VCF reading.

    Args:
        path: the path to a VCF, or an open file handle
    """
    if not isinstance(path, (str, Path, io.IOBase)):
        raise TypeError(f"Cannot open '{type(path)}' for VCF reading.")
    with fgpyo.io.suppress_stderr():
        # to avoid spamming log about index older than vcf, redirect stderr to /dev/null: only
        # when first opening the file
        _reader = VariantFile(path, mode="r")  # type: ignore[arg-type]
    # now stderr is back, so any later stderr messages will go through
    try:
        yield _reader
    finally:
        _reader.close()

writer ¶

writer(path: VcfPath, header: VariantHeader, mode: str = 'w') -> Generator[VariantFile, None, None]

Opens the given path for VCF writing.

Parameters:

Name	Type	Description	Default
`path`	`VcfPath`	the path to a VCF, or an open filehandle	required
`header`	`VariantHeader`	the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header	required
`mode`	`str`	the pysam write mode. The default `"w"` relies on pysam auto-detecting the output format from the filename extension (e.g. `.vcf.gz` → bgzipped). Callers passing an open file handle should supply an explicit mode — `"wz"` for bgzipped VCF, `"wb"` for BCF — since there is no filename to sniff.	`'w'`

Source code in fgpyo/vcf/__init__.py

@contextmanager
def writer(
    path: VcfPath,
    header: VariantHeader,
    mode: str = "w",
) -> Generator[VcfWriter, None, None]:
    """
    Opens the given path for VCF writing.

    Args:
        path: the path to a VCF, or an open filehandle
        header: the source for the output VCF header. If you are modifying a VCF file that you are
                reading from, you can pass reader.header
        mode: the pysam write mode. The default `"w"` relies on pysam auto-detecting the output
                format from the filename extension (e.g. `.vcf.gz` → bgzipped). Callers passing an
                open file handle should supply an explicit mode — `"wz"` for bgzipped VCF, `"wb"`
                for BCF — since there is no filename to sniff.
    """
    if not isinstance(path, (str, Path, io.IOBase)):
        raise TypeError(f"Cannot open '{type(path)}' for VCF writing.")
    # Convert Path to str such that pysam will autodetect to write as a gzipped file if provided
    # with a .vcf.gz suffix.
    if isinstance(path, Path):
        path = str(path)
    # pysam's stubs narrow `mode` and `path` below what's accepted at runtime (e.g. mode="wz").
    _writer = VariantFile(path, header=header, mode=mode)  # type: ignore[arg-type]
    try:
        yield _writer
    finally:
        _writer.close()

Modules¶

builder ¶

Classes for generating VCF and records for testing.¶

Classes¶

VariantBuilder ¶

Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF.

The VCF can be sites-only, single-sample, or multi-sample.

Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.

A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be provided to the VariantBuilder constructor.

Variants are then added using the add() method. Once accumulated the variants can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). Lastly, the records can be written to a temporary file using to_path().

Attributes:

Name	Type	Description
`sample_ids`	`list[str]`	the sample name(s)
`sd`	`dict[str, dict[str, Any]]`	sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as contig_name) and "length", the contig length. Other values will be added to the VCF header line for that contig.
`seq_idx_lookup`	`dict[str, int]`	dictionary mapping contig name to index of contig in sd
`records`	`list[VariantRecord]`	the list of variant records
`header`	`VariantHeader`	the pysam header

Source code in fgpyo/vcf/builder.py

class VariantBuilder:
    """
    Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF.

    The VCF can be sites-only, single-sample, or multi-sample.

    Provides the ability to manufacture variants from minimal arguments, while generating
    any remaining attributes to ensure a valid variant.

    A builder is constructed with a handful of defaults including the sample name and sequence
    dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be
    provided to the VariantBuilder constructor.

    Variants are then added using the [`add()`][fgpyo.vcf.builder.VariantBuilder.add]
    method.
    Once accumulated the variants can be accessed in the order in which they were created through
    the [`to_unsorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_unsorted_list]
    function, or in a list sorted by coordinate order via
    [`to_sorted_list()`][fgpyo.vcf.builder.VariantBuilder.to_sorted_list]. Lastly, the
    records can be written to a temporary file using
    [`to_path()`][fgpyo.vcf.builder.VariantBuilder.to_path].

    Attributes:
        sample_ids: the sample name(s)
        sd: sequence dictionary, implemented as python dict from contig name to dictionary with
            contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as
            contig_name) and "length", the contig length. Other values will be added to the VCF
            header line for that contig.
        seq_idx_lookup: dictionary mapping contig name to index of contig in sd
        records: the list of variant records
        header: the pysam header
    """

    sample_ids: list[str]
    sd: dict[str, dict[str, Any]]
    seq_idx_lookup: dict[str, int]
    records: list[VariantRecord]
    header: VariantHeader

    def __init__(
        self,
        sample_ids: Iterable[str] | None = None,
        sd: dict[str, dict[str, Any]] | None = None,
    ) -> None:
        """
        Initializes a new VariantBuilder for generating variants and VCF files.

        Args:
            sample_ids: the name of the sample(s)
            sd: optional sequence dictionary
        """
        self.sample_ids: list[str] = list(sample_ids) if sample_ids is not None else []
        self.sd: dict[str, dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
        self.seq_idx_lookup: dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
        self.records: list[VariantRecord] = []
        self.header = VariantHeader()
        for line in VariantBuilder._build_header_string(sd=self.sd):
            self.header.add_line(line)
        if sample_ids is not None:
            self.header.add_samples(sample_ids)

    @classmethod
    def default_sd(cls) -> dict[str, dict[str, Any]]:
        """
        Generates the default sequence dictionary for VariantBuilder.

        Re-uses the dictionary from SamBuilder for consistency.

        Returns:
            A new copy of the sequence dictionary as a map of contig name to dictionary, one per
            contig.
        """
        sd: dict[str, dict[str, Any]] = {}
        for sequence in SamBuilder.default_sd():
            contig = sequence["SN"]
            sd[contig] = {"ID": contig, "length": sequence["LN"]}
        return sd

    @classmethod
    def _build_header_string(cls, sd: dict[str, dict[str, Any]] | None = None) -> Iterator[str]:
        """
        Builds the VCF header with the given sample name(s) and sequence dictionary.

        Args:
            sd: the sequence dictionary mapping the contig name to the key-value pairs for the
                given contig.  Must include "ID" and "length" for each contig.  If no sequence
                dictionary is given, will use the default dictionary.
        """
        if sd is None:
            sd = VariantBuilder.default_sd()
        # add mandatory VCF format
        yield "##fileformat=VCFv4.2"
        # add GT
        yield '##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">'
        # add additional common INFO lines
        yield '##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield (
            '##INFO=<ID=AR,Number=A,Type=Float,Description="Allele Ratio - ratio of AD for allele'
            ' vs. AD for modal allele.">'
        )
        yield '##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'
        # add additional common FORMAT lines
        yield (
            '##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt'
            ' alleles in the order listed">'
        )
        yield '##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">'
        yield '##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total Depth">'

        for d in sd.values():
            if "ID" not in d or "length" not in d:
                raise ValueError(
                    "Sequence dictionary must include 'ID' and 'length' for each contig."
                )
            contig_id = d["ID"]
            contig_length = d["length"]
            contig_header = f"##contig=<ID={contig_id},length={contig_length}"
            for key, value in d.items():
                if key == "ID" or key == "length":
                    continue
                contig_header += f",{key}={value}"
            contig_header += ">"
            yield contig_header

    @property
    def num_samples(self) -> int:
        """Returns the number of samples in the VCF."""
        return len(self.sample_ids)

    def add(
        self,
        contig: str | None = None,
        pos: int = 1000,
        end: int | None = None,
        id: str = ".",  # noqa: A002  # pysam is already shadowing the built-in
        ref: str = "A",
        alts: str | Iterable[str] | None = (".",),
        qual: int = 60,
        filter: str | Iterable[str] | None = None,  # noqa: A002
        info: dict[str, Any] | None = None,
        samples: dict[str, dict[str, Any]] | None = None,
    ) -> VariantRecord:
        """
        Generates a new variant and adds it to the internal collection.

        Notes:
        * Very little validation is done with respect to INFO and FORMAT keys being defined in the
        header.
        * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
        VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
        the property that should be accessed when using the records produced by this function (not
        "start").

        Args:
            contig: the chromosome name. If None, will use the first contig in the sequence
                    dictionary.
            pos: the 1-based position of the variant
            end: an optional 1-based inclusive END position; if not specified a value will be looked
                 for in info["END"], or calculated from the length of the reference allele
            id: the variant id
            ref: the reference allele
            alts: the list of alternate alleles, None if no alternates. If a single string is
                  passed, that will be used as the only alt.
            qual: the variant quality
            filter: the list of filters, None if no filters (ex. PASS). If a single string is
                    passed, that will be used as the only filter.
            info: the dictionary of INFO key-value pairs
            samples: the dictionary from sample name to FORMAT key-value pairs.
                     if a sample property is supplied for any sample but omitted in some, it will
                     be set to missing (".") for samples that don't have that property explicitly
                     assigned. If a sample in the VCF is omitted, all its properties will be set to
                     missing.
        """
        if contig is None:
            contig = next(iter(self.sd.keys()))

        if contig not in self.sd:
            raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
        # because there are a lot of slightly different objects related to samples or called
        # "samples" in this function, we alias samples to sample_formats
        # we still want to keep the API labeled "samples" because that keeps the naming scheme the
        # same as the pysam API
        sample_formats = samples
        if sample_formats is not None:
            unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
            if len(unknown_samples) > 0:
                raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

        if isinstance(alts, str):
            alts = (alts,)
        alleles = (ref,) if alts is None else (ref, *alts)
        if isinstance(filter, str):
            filter = (filter,)  # noqa: A001  # pysam already shadows the built-in

        # pysam expects a list of format dicts provided in the same order as the samples in the
        # header (self.sample_ids). (This is despite the fact that it will internally represent the
        # values as a map from sample ID to format values, as we do in this function.)
        # Convert to that form and rename to record_samples; to a) disambiguate from the input
        # values, and b) prevent mypy from complaining about the type changing from dict to list.
        if self.num_samples == 0:
            # this is a sites-only VCF
            record_samples = None
        elif sample_formats is None or len(sample_formats) == 0:
            # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
            # no fields)
            record_samples = None
        else:
            # convert to list form that pysam expects, in order pysam expects
            # note: the copy {**format_dict} below is present because pysam actually alters the
            # input values, which would be an unintended side-effect (in fact without this, tests
            # fail because the expected input values are changed)
            record_samples = [
                {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
            ]

        variant = self.header.new_record(
            contig=contig,
            start=pos - 1,  # start is 0-based
            stop=self._compute_and_check_end(pos, ref, end, info),
            id=id,
            alleles=alleles,
            qual=qual,
            filter=filter,
            info=info,
            samples=record_samples,
        )

        self.records.append(variant)
        return variant

    def _compute_and_check_end(
        self, pos: int, ref: str, end: int | None, info: dict[str, Any] | None
    ) -> int:
        """
        Derives the END/stop position for a new record.

        Uses the optionally provided `end` parameter, the presence/absence of END in the info
        dictionary and/or the length of the reference allele.

        Also checks that any given or calculated end position is at least greater than or equal
        to the record's position.

        Args:
            pos: the 1-based position of the record
            ref: the reference allele of the record
            end: the provided 1-based end position if one was given
            info: the info dictionary if one was given
        """
        if end is not None and info is not None and "END" in info:
            raise ValueError(f"Two end positions given; end={end} and info.END={info['END']}")
        elif end is None:
            if info is not None and "END" in info:
                end = int(info["END"])
            else:
                end = pos + len(ref) - 1

        if end < pos:
            raise ValueError(f"Invalid end position, {end}, given for variant as pos {pos}.")

        return end

    def to_path(self, path: Path | None = None) -> Path:
        """
        Returns a path to a VCF for variants added to this builder.

        If the path given ends in ".gz" then the generated file will be bgzipped and
        a tabix index generated for the file with the suffix ".gz.tbi".

        Args:
            path: optional path to the VCF
        """
        # update the path
        path = self._to_vcf_path(path)

        # Create a writer and write to it
        with pysam_writer(path, header=self.header) as writer:
            for variant in self.to_sorted_list():
                writer.write(variant)

        if str(path.suffix) == ".gz":
            pysam.tabix_index(str(path), preset="vcf", force=True)

        return path

    @staticmethod
    def _to_vcf_path(path: Path | None) -> Path:
        """
        Gets the path to a VCF file.

        If path is a directory, a temporary VCF will be created in that directory. If path is
        `None`, then a temporary VCF will be created.  Otherwise, the given path is simply
        returned.

        Args:
            path: optionally the path to the VCF, or a directory to create a temporary VCF.
        """
        if path is None:
            with NamedTemporaryFile(suffix=".vcf.gz", delete=False) as fp:
                path = Path(fp.name)
            assert path.is_file()
        return path

    def to_unsorted_list(self) -> list[VariantRecord]:
        """Returns the accumulated records in the order they were created."""
        return list(self.records)

    def to_sorted_list(self) -> list[VariantRecord]:
        """Returns the accumulated records in coordinate order."""
        return sorted(self.records, key=self._sort_key)

    def _sort_key(self, variant: VariantRecord) -> tuple[int, int, int]:
        return self.seq_idx_lookup[variant.contig], variant.start, variant.stop

    def add_header_line(self, line: str) -> None:
        """Adds a header line to the header."""
        self.header.add_line(line)

    def add_info_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: int | VcfFieldNumber = 1,
        description: str | None = None,
        source: str | None = None,
        version: str | None = None,
    ) -> None:
        """
        Add an INFO header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
            source: the source of the field
            version: the version of the field
        """
        if field_type == VcfFieldType.FLAG:
            num = "0"  # FLAGs always have number = 0
        elif isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        if source is not None:
            header_line += f",Source={source}"
        if version is not None:
            header_line += f",Version={version}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_format_header(
        self,
        name: str,
        field_type: VcfFieldType,
        number: int | VcfFieldNumber = VcfFieldNumber.NUM_GENOTYPES,
        description: str | None = None,
    ) -> None:
        """
        Add a FORMAT header field to the VCF header.

        Args:
            name: the name of the field
            field_type: the field_type of the field
            number: the number of the field
            description: the description of the field
        """
        if isinstance(number, VcfFieldNumber):
            num = number.value
        else:
            num = str(number)

        header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)

    def add_filter_header(
        self,
        name: str,
        description: str | None = None,
    ) -> None:
        """
        Add a FILTER header field to the VCF header.

        Args:
            name: the name of the field
            description: the description of the field
        """
        header_line = f"##FILTER=<ID={name}"
        if description is not None:
            header_line += f",Description={description}"
        header_line += ">"
        self.add_header_line(header_line)

Attributes¶

num_samples property ¶

num_samples: int

Returns the number of samples in the VCF.

Functions¶

__init__ ¶

__init__(sample_ids: Iterable[str] | None = None, sd: dict[str, dict[str, Any]] | None = None) -> None

Initializes a new VariantBuilder for generating variants and VCF files.

Parameters:

Name	Type	Description	Default
`sample_ids`	`Iterable[str] \| None`	the name of the sample(s)	`None`
`sd`	`dict[str, dict[str, Any]] \| None`	optional sequence dictionary	`None`

Source code in fgpyo/vcf/builder.py

def __init__(
    self,
    sample_ids: Iterable[str] | None = None,
    sd: dict[str, dict[str, Any]] | None = None,
) -> None:
    """
    Initializes a new VariantBuilder for generating variants and VCF files.

    Args:
        sample_ids: the name of the sample(s)
        sd: optional sequence dictionary
    """
    self.sample_ids: list[str] = list(sample_ids) if sample_ids is not None else []
    self.sd: dict[str, dict[str, Any]] = sd if sd is not None else VariantBuilder.default_sd()
    self.seq_idx_lookup: dict[str, int] = {name: i for i, name in enumerate(self.sd.keys())}
    self.records: list[VariantRecord] = []
    self.header = VariantHeader()
    for line in VariantBuilder._build_header_string(sd=self.sd):
        self.header.add_line(line)
    if sample_ids is not None:
        self.header.add_samples(sample_ids)

add ¶

add(contig: str | None = None, pos: int = 1000, end: int | None = None, id: str = '.', ref: str = 'A', alts: str | Iterable[str] | None = ('.',), qual: int = 60, filter: str | Iterable[str] | None = None, info: dict[str, Any] | None = None, samples: dict[str, dict[str, Any]] | None = None) -> VariantRecord

Generates a new variant and adds it to the internal collection.

Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is the property that should be accessed when using the records produced by this function (not "start").

Parameters:

Name	Type	Description	Default
`contig`	`str \| None`	the chromosome name. If None, will use the first contig in the sequence dictionary.	`None`
`pos`	`int`	the 1-based position of the variant	`1000`
`end`	`int \| None`	an optional 1-based inclusive END position; if not specified a value will be looked for in info["END"], or calculated from the length of the reference allele	`None`
`id`	`str`	the variant id	`'.'`
`ref`	`str`	the reference allele	`'A'`
`alts`	`str \| Iterable[str] \| None`	the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt.	`('.',)`
`qual`	`int`	the variant quality	`60`
`filter`	`str \| Iterable[str] \| None`	the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter.	`None`
`info`	`dict[str, Any] \| None`	the dictionary of INFO key-value pairs	`None`
`samples`	`dict[str, dict[str, Any]] \| None`	the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (".") for samples that don't have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing.	`None`

Source code in fgpyo/vcf/builder.py

def add(
    self,
    contig: str | None = None,
    pos: int = 1000,
    end: int | None = None,
    id: str = ".",  # noqa: A002  # pysam is already shadowing the built-in
    ref: str = "A",
    alts: str | Iterable[str] | None = (".",),
    qual: int = 60,
    filter: str | Iterable[str] | None = None,  # noqa: A002
    info: dict[str, Any] | None = None,
    samples: dict[str, dict[str, Any]] | None = None,
) -> VariantRecord:
    """
    Generates a new variant and adds it to the internal collection.

    Notes:
    * Very little validation is done with respect to INFO and FORMAT keys being defined in the
    header.
    * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the
    VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is
    the property that should be accessed when using the records produced by this function (not
    "start").

    Args:
        contig: the chromosome name. If None, will use the first contig in the sequence
                dictionary.
        pos: the 1-based position of the variant
        end: an optional 1-based inclusive END position; if not specified a value will be looked
             for in info["END"], or calculated from the length of the reference allele
        id: the variant id
        ref: the reference allele
        alts: the list of alternate alleles, None if no alternates. If a single string is
              passed, that will be used as the only alt.
        qual: the variant quality
        filter: the list of filters, None if no filters (ex. PASS). If a single string is
                passed, that will be used as the only filter.
        info: the dictionary of INFO key-value pairs
        samples: the dictionary from sample name to FORMAT key-value pairs.
                 if a sample property is supplied for any sample but omitted in some, it will
                 be set to missing (".") for samples that don't have that property explicitly
                 assigned. If a sample in the VCF is omitted, all its properties will be set to
                 missing.
    """
    if contig is None:
        contig = next(iter(self.sd.keys()))

    if contig not in self.sd:
        raise ValueError(f"Chromosome `{contig}` not in the sequence dictionary.")
    # because there are a lot of slightly different objects related to samples or called
    # "samples" in this function, we alias samples to sample_formats
    # we still want to keep the API labeled "samples" because that keeps the naming scheme the
    # same as the pysam API
    sample_formats = samples
    if sample_formats is not None:
        unknown_samples = set(sample_formats.keys()).difference(self.sample_ids)
        if len(unknown_samples) > 0:
            raise ValueError("Unknown sample(s) given: " + ", ".join(unknown_samples))

    if isinstance(alts, str):
        alts = (alts,)
    alleles = (ref,) if alts is None else (ref, *alts)
    if isinstance(filter, str):
        filter = (filter,)  # noqa: A001  # pysam already shadows the built-in

    # pysam expects a list of format dicts provided in the same order as the samples in the
    # header (self.sample_ids). (This is despite the fact that it will internally represent the
    # values as a map from sample ID to format values, as we do in this function.)
    # Convert to that form and rename to record_samples; to a) disambiguate from the input
    # values, and b) prevent mypy from complaining about the type changing from dict to list.
    if self.num_samples == 0:
        # this is a sites-only VCF
        record_samples = None
    elif sample_formats is None or len(sample_formats) == 0:
        # not a sites-only VCF, but no FORMAT values were passed. set FORMAT to missing (with
        # no fields)
        record_samples = None
    else:
        # convert to list form that pysam expects, in order pysam expects
        # note: the copy {**format_dict} below is present because pysam actually alters the
        # input values, which would be an unintended side-effect (in fact without this, tests
        # fail because the expected input values are changed)
        record_samples = [
            {**sample_formats.get(sample_id, {})} for sample_id in self.sample_ids
        ]

    variant = self.header.new_record(
        contig=contig,
        start=pos - 1,  # start is 0-based
        stop=self._compute_and_check_end(pos, ref, end, info),
        id=id,
        alleles=alleles,
        qual=qual,
        filter=filter,
        info=info,
        samples=record_samples,
    )

    self.records.append(variant)
    return variant

add_filter_header ¶

add_filter_header(name: str, description: str | None = None) -> None

Add a FILTER header field to the VCF header.

Parameters:

Name	Type	Description	Default
`name`	`str`	the name of the field	required
`description`	`str \| None`	the description of the field	`None`

Source code in fgpyo/vcf/builder.py

def add_filter_header(
    self,
    name: str,
    description: str | None = None,
) -> None:
    """
    Add a FILTER header field to the VCF header.

    Args:
        name: the name of the field
        description: the description of the field
    """
    header_line = f"##FILTER=<ID={name}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)

add_format_header ¶

add_format_header(name: str, field_type: VcfFieldType, number: int | VcfFieldNumber = NUM_GENOTYPES, description: str | None = None) -> None

Add a FORMAT header field to the VCF header.

Parameters:

Name	Type	Description	Default
`name`	`str`	the name of the field	required
`field_type`	`VcfFieldType`	the field_type of the field	required
`number`	`int \| VcfFieldNumber`	the number of the field	`NUM_GENOTYPES`
`description`	`str \| None`	the description of the field	`None`

Source code in fgpyo/vcf/builder.py

def add_format_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: int | VcfFieldNumber = VcfFieldNumber.NUM_GENOTYPES,
    description: str | None = None,
) -> None:
    """
    Add a FORMAT header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
    """
    if isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##FORMAT=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    header_line += ">"
    self.add_header_line(header_line)

add_header_line ¶

add_header_line(line: str) -> None

Adds a header line to the header.

Source code in fgpyo/vcf/builder.py

def add_header_line(self, line: str) -> None:
    """Adds a header line to the header."""
    self.header.add_line(line)

add_info_header ¶

add_info_header(name: str, field_type: VcfFieldType, number: int | VcfFieldNumber = 1, description: str | None = None, source: str | None = None, version: str | None = None) -> None

Add an INFO header field to the VCF header.

Parameters:

Name	Type	Description	Default
`name`	`str`	the name of the field	required
`field_type`	`VcfFieldType`	the field_type of the field	required
`number`	`int \| VcfFieldNumber`	the number of the field	`1`
`description`	`str \| None`	the description of the field	`None`
`source`	`str \| None`	the source of the field	`None`
`version`	`str \| None`	the version of the field	`None`

Source code in fgpyo/vcf/builder.py

def add_info_header(
    self,
    name: str,
    field_type: VcfFieldType,
    number: int | VcfFieldNumber = 1,
    description: str | None = None,
    source: str | None = None,
    version: str | None = None,
) -> None:
    """
    Add an INFO header field to the VCF header.

    Args:
        name: the name of the field
        field_type: the field_type of the field
        number: the number of the field
        description: the description of the field
        source: the source of the field
        version: the version of the field
    """
    if field_type == VcfFieldType.FLAG:
        num = "0"  # FLAGs always have number = 0
    elif isinstance(number, VcfFieldNumber):
        num = number.value
    else:
        num = str(number)

    header_line = f"##INFO=<ID={name},Number={num},Type={field_type.value}"
    if description is not None:
        header_line += f",Description={description}"
    if source is not None:
        header_line += f",Source={source}"
    if version is not None:
        header_line += f",Version={version}"
    header_line += ">"
    self.add_header_line(header_line)

default_sd classmethod ¶

default_sd() -> dict[str, dict[str, Any]]

Generates the default sequence dictionary for VariantBuilder.

Re-uses the dictionary from SamBuilder for consistency.

Returns:

Type	Description
`dict[str, dict[str, Any]]`	A new copy of the sequence dictionary as a map of contig name to dictionary, one per
`dict[str, dict[str, Any]]`	contig.

Source code in fgpyo/vcf/builder.py

@classmethod
def default_sd(cls) -> dict[str, dict[str, Any]]:
    """
    Generates the default sequence dictionary for VariantBuilder.

    Re-uses the dictionary from SamBuilder for consistency.

    Returns:
        A new copy of the sequence dictionary as a map of contig name to dictionary, one per
        contig.
    """
    sd: dict[str, dict[str, Any]] = {}
    for sequence in SamBuilder.default_sd():
        contig = sequence["SN"]
        sd[contig] = {"ID": contig, "length": sequence["LN"]}
    return sd

to_path ¶

to_path(path: Path | None = None) -> Path

Returns a path to a VCF for variants added to this builder.

If the path given ends in ".gz" then the generated file will be bgzipped and a tabix index generated for the file with the suffix ".gz.tbi".

Parameters:

Name	Type	Description	Default
`path`	`Path \| None`	optional path to the VCF	`None`

Source code in fgpyo/vcf/builder.py

def to_path(self, path: Path | None = None) -> Path:
    """
    Returns a path to a VCF for variants added to this builder.

    If the path given ends in ".gz" then the generated file will be bgzipped and
    a tabix index generated for the file with the suffix ".gz.tbi".

    Args:
        path: optional path to the VCF
    """
    # update the path
    path = self._to_vcf_path(path)

    # Create a writer and write to it
    with pysam_writer(path, header=self.header) as writer:
        for variant in self.to_sorted_list():
            writer.write(variant)

    if str(path.suffix) == ".gz":
        pysam.tabix_index(str(path), preset="vcf", force=True)

    return path

to_sorted_list ¶

to_sorted_list() -> list[VariantRecord]

Returns the accumulated records in coordinate order.

Source code in fgpyo/vcf/builder.py

def to_sorted_list(self) -> list[VariantRecord]:
    """Returns the accumulated records in coordinate order."""
    return sorted(self.records, key=self._sort_key)

to_unsorted_list ¶

to_unsorted_list() -> list[VariantRecord]

Returns the accumulated records in the order they were created.

Source code in fgpyo/vcf/builder.py

def to_unsorted_list(self) -> list[VariantRecord]:
    """Returns the accumulated records in the order they were created."""
    return list(self.records)

VcfFieldNumber ¶

Bases: Enum

Special codes for VCF field numbers.

Source code in fgpyo/vcf/builder.py

class VcfFieldNumber(Enum):
    """Special codes for VCF field numbers."""

    NUM_ALT_ALLELES = "A"
    NUM_ALLELES = "R"
    NUM_GENOTYPES = "G"
    UNKNOWN = "."

VcfFieldType ¶

Bases: Enum

Codes for VCF field types.

Source code in fgpyo/vcf/builder.py

class VcfFieldType(Enum):
    """Codes for VCF field types."""

    INTEGER = "Integer"
    FLOAT = "Float"
    FLAG = "Flag"
    CHARACTER = "Character"
    STRING = "String"

fgpyo

Classes¶

RequirementError ¶

Functions¶

require ¶

Modules¶

collections ¶

Custom Collections and Collection Functions.¶

Helpful Functions for Working with Collections¶

Examples of a "Peekable" Iterator¶

Attributes¶

LessThanOrEqualType module-attribute ¶

Classes¶

PeekableIterator ¶

Functions¶

SupportsLessThanOrEqual ¶

Functions¶

Functions¶

is_sorted ¶

fasta ¶

Modules¶

builder ¶

Classes for generating fasta files and records for testing.¶

Examples of creating sets of contigs for writing to fasta¶

Classes¶

Functions¶

sequence_dictionary ¶

Classes for representing sequencing dictionaries.¶

Examples of building and using sequence dictionaries¶

Attributes¶

Classes¶

Modules¶

fastx ¶

Zipping FASTX Files.¶

Classes¶

FastxZipped ¶

Functions¶

Functions¶

io ¶

Module for reading and writing files.¶

fgpyo.io Examples:¶

Functions¶

assert_directory_exists ¶

assert_fasta_indexed ¶

assert_path_is_readable ¶

assert_path_is_writable ¶

assert_path_is_writeable ¶

read_lines ¶

redirect_to_dev_null ¶

suppress_stderr ¶

to_reader ¶

to_writer ¶

write_lines ¶

platform ¶

Modules¶

illumina ¶

Attributes¶

Functions¶

read_structure ¶

Classes for representing Read Structures.¶

Examples¶

Attributes¶

ANY_LENGTH_CHAR module-attribute ¶

Classes¶

ReadSegment ¶

Attributes¶

Functions¶

ReadStructure ¶

Attributes¶

Functions¶

SegmentType ¶

Attributes¶

Functions¶

SubReadWithQuals ¶

Attributes¶

SubReadWithoutQuals ¶

Attributes¶

sam ¶

Utility Classes and Methods for SAM/BAM.¶

Motivation for Reader and Writer methods¶

LessThanOrEqualType `module-attribute` ¶

ANY_LENGTH_CHAR `module-attribute` ¶

DefaultProperlyPairedOrientations `module-attribute` ¶

NO_QUERY_BASES `module-attribute` ¶

NO_QUERY_QUALITIES `module-attribute` ¶

NO_REF_INDEX `module-attribute` ¶

NO_REF_NAME `module-attribute` ¶

NO_REF_POS `module-attribute` ¶

STRING_PLACEHOLDER `module-attribute` ¶

SamPath `module-attribute` ¶