fgpyo
Classes¶
RequirementError ¶
Functions¶
require ¶
Require a condition be satisfied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condition
|
bool
|
The condition to satisfy. |
required |
message
|
str | Callable[[], str] | None
|
An optional message to include with the error when the condition is false. The message may be provided as either a string literal or a function returning a string. The function will not be evaluated unless the condition is false. |
None
|
Raises:
| Type | Description |
|---|---|
RequirementError
|
If the condition is false. |
Source code in fgpyo/_requirements.py
Modules¶
collections ¶
Custom Collections and Collection Functions.¶
This module contains classes and functions for working with collections and iterators.
Helpful Functions for Working with Collections¶
To test if an iterable is sorted or not:
>>> from fgpyo.collections import is_sorted
>>> is_sorted([])
True
>>> is_sorted([1])
True
>>> is_sorted([1, 2, 2, 3])
True
>>> is_sorted([1, 2, 4, 3])
False
Examples of a "Peekable" Iterator¶
"Peekable" iterators are useful to "peek" at the next item in an iterator without consuming it.
For example, this is useful when consuming items in iterator while a predicate is true, and not
consuming the first element where the element is not true. See the
takewhile() and
dropwhile() methods.
An empty peekable iterator throws a
StopIteration:
>>> from fgpyo.collections import PeekableIterator
>>> piter = PeekableIterator(iter([]))
>>> piter.peek()
Traceback (most recent call last):
...
StopIteration
A peekable iterator will return the next item before consuming it.
>>> piter = PeekableIterator([1, 2, 3])
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]
The can_peek() function can be used to determine if
the iterator can be peeked without a
StopIteration from being
thrown:
>>> piter = PeekableIterator([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
Traceback (most recent call last):
...
StopIteration
PeekableIterator's constructor supports creation from
iterable objects as well as iterators.
Attributes¶
LessThanOrEqualType
module-attribute
¶
LessThanOrEqualType = TypeVar('LessThanOrEqualType', bound=SupportsLessThanOrEqual)
A type variable for an object that supports less-than-or-equal comparisons.
Classes¶
PeekableIterator ¶
Bases: Generic[IterType], Iterator[IterType]
A peekable iterator wrapping an iterator or iterable.
This allows returning the next item without consuming it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Iterator[IterType] | Iterable[IterType]
|
an iterator over the objects |
required |
Source code in fgpyo/collections/__init__.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
Functions¶
Initializes the PeekableIterator with the given source.
Source code in fgpyo/collections/__init__.py
dropwhile(pred: Callable[[IterType], bool]) -> PeekableIterator[IterType]
Drops elements from the iterator while the predicate is true.
Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
Callable[[V], bool]
|
a function that takes a value from the iterator and returns true or false. |
required |
Returns:
| Type | Description |
|---|---|
PeekableIterator[IterType]
|
PeekableIterator[V]: a reference to this iterator, so calls can be chained |
Source code in fgpyo/collections/__init__.py
Returns the next element without consuming it, or StopIteration otherwise.
Consumes from the iterator while pred is true, and returns the result as a List.
The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
Callable[[IterType], bool]
|
a function that takes the next value from the iterator and returns true or false. |
required |
Returns:
| Type | Description |
|---|---|
list[IterType]
|
List[V]: A list of the values from the iterator, in order, up until and excluding |
list[IterType]
|
the first value that does not match the predicate. |
Source code in fgpyo/collections/__init__.py
SupportsLessThanOrEqual ¶
Bases: Protocol
A structural type for objects that support less-than-or-equal comparison.
Source code in fgpyo/collections/__init__.py
Functions¶
is_sorted ¶
is_sorted(iterable: Iterable[LessThanOrEqualType]) -> bool
Tests lazily if an iterable of comparable objects is sorted or not.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iterable
|
Iterable[LessThanOrEqualType]
|
An iterable of comparable objects. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If there is more than 1 element in |
Source code in fgpyo/collections/__init__.py
fasta ¶
Modules¶
builder ¶
Classes for generating fasta files and records for testing.¶
This module contains utility classes for creating fasta files, indexed fasta files (.fai), and sequence dictionaries (.dict).
Examples of creating sets of contigs for writing to fasta¶
Writing a FASTA with two contigs each with 100 bases:
>>> from pathlib import Path
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder = builder.add("chr11").add("GGGGGGGGGG", 10)
>>> fasta_path = Path(getfixture("tmp_path")) / "test.fasta"
>>> builder.to_file(path=fasta_path)
Writing a FASTA with one contig with 100 A's and 50 T's:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> builder.add("chr10").add("AAAAAAAAAA", 10).add("TTTTTTTTTT", 5)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> builder.to_file(path=fasta_path)
Add bases to existing contig:
>>> from fgpyo.fasta.builder import FastaBuilder
>>> builder = FastaBuilder()
>>> contig_one = builder.add("chr10").add("AAAAAAAAAA", 1)
>>> contig_one.add("NNN", 1)
<fgpyo.fasta.builder.ContigBuilder object at ...>
>>> contig_one.bases
'AAAAAAAAAANNN'
Classes¶
Builder for constructing new contigs, and adding bases to existing contigs.
Existing contigs cannot be overwritten, each contig name in FastaBuilder must be unique. Instances of ContigBuilders should be created using FastaBuilder.add(), where species and assembly are optional parameters and will defualt to FastaBuilder.assembly and FastaBuilder.species.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
Unique contig ID, ie., "chr10" |
|
assembly |
Assembly information, if None default is 'testassembly' |
|
species |
Species information, if None default is 'testspecies' |
|
bases |
The bases to be added to the contig ex "A" |
Source code in fgpyo/fasta/builder.py
Initializes a ContigBuilder with the given name, assembly, and species.
Source code in fgpyo/fasta/builder.py
add(bases: str, times: int = 1) -> ContigBuilder
Method for adding bases to a new or existing instance of ContigBuilder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
The bases to be added to the contig |
required |
times
|
int
|
The number of times the bases should be repeated |
1
|
Example: add("AAA", 2) results in the following bases -> "AAAAAA"
Source code in fgpyo/fasta/builder.py
Builder for constructing sets of one or more contigs.
Provides the ability to manufacture sets of contigs from minimal input, and automatically generates the information necessary for writing the FASTA file, index, and dictionary.
A builder is constructed from an assembly, species, and line length. All attributes have defaults, however these can be overwritten.
Contigs are added to FastaBuilder using:
add()
Bases are added to existing contigs using:
add()
Once accumulated the contigs can be written to a file using:
to_file()
Calling to_file() will also generate the fasta index (.fai) and sequence dictionary (.dict).
Attributes:
| Name | Type | Description |
|---|---|---|
assembly |
str
|
Assembly information, if None default is 'testassembly' |
species |
str
|
Species, if None default is 'testspecies' |
line_length |
int
|
Desired line length, if None default is 80 |
contig_builders |
int
|
Private dictionary of contig names and instances of ContigBuilder |
Source code in fgpyo/fasta/builder.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | |
__getitem__(key: str) -> ContigBuilder
Initializes a FastaBuilder with the given assembly, species, and line length.
Source code in fgpyo/fasta/builder.py
add(name: str, assembly: str | None = None, species: str | None = None) -> ContigBuilder
Creates and returns a new ContigBuilder for a contig with the provided name.
Contig names must be unique, attempting to create two seperate contigs with the same name will result in an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique contig ID, ie., "chr10" |
required |
assembly
|
str | None
|
Assembly information, if None default is 'testassembly' |
None
|
species
|
str | None
|
Species information, if None default is 'testspecies' |
None
|
Source code in fgpyo/fasta/builder.py
Writes out the set of accumulated contigs to a FASTA file and returns an open FastaFile.
This is a convenience method that combines to_file() with opening the resulting
file as a pysam.FastaFile.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to which to write the FASTA file. |
required |
Yields:
| Type | Description |
|---|---|
FastaFile
|
An open |
Source code in fgpyo/fasta/builder.py
Writes out the set of accumulated contigs to a FASTA file at the path given.
Also generates the accompanying fasta index file (.fa.fai) and sequence
dictionary file (.dict).
Contigs are emitted in the order they were added to the builder. Sequence lines in the FASTA file are wrapped to the line length given when the builder was constructed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write files to. |
required |
Example: FastaBuilder.to_file(path = pathlib.Path("my_fasta.fa"))
Source code in fgpyo/fasta/builder.py
Functions¶
Calls pysam.dict and writes the sequence dictionary to the provided output path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
assembly
|
str
|
Assembly |
required |
species
|
str
|
Species |
required |
output_path
|
str
|
File path to write dictionary to |
required |
input_path
|
str
|
Path to fasta file |
required |
Source code in fgpyo/fasta/builder.py
Calls pysam.faidx and writes fasta index in the same file location as the fasta file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
str
|
Path to fasta file |
required |
sequence_dictionary ¶
Classes for representing sequencing dictionaries.¶
Examples of building and using sequence dictionaries¶
Building a sequence dictionary from a pysam.AlignmentHeader:
>>> import pysam
>>> from fgpyo.fasta.sequence_dictionary import SequenceDictionary
>>> sd: SequenceDictionary
>>> with pysam.AlignmentFile("./tests/fgpyo/sam/data/valid.sam") as fh:
... sd = SequenceDictionary.from_sam(fh.header)
>>> print(sd)
@SQ SN:chr1 LN:101
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Query based on index:
Query based on name:
Add, get, and delete attributes:
>>> from fgpyo.fasta.sequence_dictionary import Keys
>>> meta = sd[0]
>>> print(meta)
@SQ SN:chr1 LN:101
>>> meta[Keys.ASSEMBLY] = "hg38"
>>> print(meta)
@SQ SN:chr1 LN:101 AS:hg38
>>> meta.get(Keys.ASSEMBLY)
'hg38'
>>> meta.get(Keys.SPECIES) is None
True
>>> Keys.MD5 in meta
False
>>> del meta[Keys.ASSEMBLY]
>>> print(meta)
@SQ SN:chr1 LN:101
Get a sequence based on one of its aliases
>>> meta[Keys.ALIASES] = "foo,bar,car"
>>> sd = SequenceDictionary(infos=[meta] + sd.infos[1:])
>>> print(sd)
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
>>> print(sd["chr1"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
>>> print(sd["bar"])
@SQ SN:chr1 LN:101 AN:foo,bar,car
Create a pysam.AlignmentHeader from a sequence dictionary:
>>> sd.to_sam_header()
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header())
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
Create a pysam.AlignmentHeader from a sequence dictionary with extra header items:
>>> sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... )
<pysam.libcalignmentfile.AlignmentHeader object at ...>
>>> print(sd.to_sam_header(
... extra_header={"RG": [{"ID": "A", "LB": "a-library"}, {"ID": "B", "LB": "b-library"}]}
... ))
@HD VN:1.5
@SQ SN:chr1 LN:101 AN:foo,bar,car
@SQ SN:chr2 LN:101
@SQ SN:chr3 LN:101
@SQ SN:chr4 LN:101
@SQ SN:chr5 LN:101
@SQ SN:chr6 LN:101
@SQ SN:chr7 LN:404
@SQ SN:chr8 LN:202
@RG ID:A LB:a-library
@RG ID:B LB:b-library
Attributes¶
module-attribute
¶SEQUENCE_NAME_PATTERN: Pattern = compile('^[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*$')
Regular expression for valid reference sequence names according to the SAM spec
Classes¶
dataclass
¶Stores an alternate locus for an associated sequence (1-based inclusive).
Source code in fgpyo/fasta/sequence_dictionary.py
Any post initialization validation should go here.
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶parse(value: str) -> AlternateLocus
Parse the genomic interval of format: <contig>:<start>-<end>.
Source code in fgpyo/fasta/sequence_dictionary.py
Bases: StrEnum
Enumeration of tags/attributes available on a sequence record/metadata (SAM @SQ line).
Source code in fgpyo/fasta/sequence_dictionary.py
dataclass
¶
Bases: Mapping[str | int, SequenceMetadata]
Contains an ordered collection of sequences.
A specific SequenceMetadata may be retrieved by name (str) or index (int), either by
using the generic get method or by the correspondingly named by_name and by_index methods.
The latter methods provide faster retrieval when the type is known.
This mapping collection iterates over the keys. To iterate over each SequenceMetadata,
either use the typical values() method or access the metadata directly with infos.
Attributes:
| Name | Type | Description |
|---|---|---|
infos |
list[SequenceMetadata]
|
the ordered collection of sequence metadata |
Source code in fgpyo/fasta/sequence_dictionary.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 | |
__getitem__(key: str | int) -> SequenceMetadata
Builds the internal name-to-metadata lookup dictionary.
Source code in fgpyo/fasta/sequence_dictionary.py
by_index(index: int) -> SequenceMetadata
Gets a SequenceMetadata explicitly by name.
Raises:
| Type | Description |
|---|---|
IndexError
|
if the index is out of bounds. |
by_name(name: str) -> SequenceMetadata
staticmethod
¶from_sam(data: Path) -> SequenceDictionary
from_sam(data: AlignmentFile) -> SequenceDictionary
from_sam(data: AlignmentHeader) -> SequenceDictionary
from_sam(data: list[dict[str, Any]]) -> SequenceDictionary
from_sam(data: Path | AlignmentFile | AlignmentHeader | list[dict[str, Any]]) -> SequenceDictionary
Creates a SequenceDictionary from a SAM file or its header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Path | AlignmentFile | AlignmentHeader | list[dict[str, Any]]
|
The input may be any of:
- a path to a SAM file
- an open |
required |
Returns:
A SequenceDictionary mapping refrence names to their metadata.
Source code in fgpyo/fasta/sequence_dictionary.py
get_by_name(name: str) -> SequenceMetadata | None
Gets a SequenceMetadata explicitly by name.
Returns:
| Type | Description |
|---|---|
SequenceMetadata | None
|
The corresponding SequenceMetadata. |
SequenceMetadata | None
|
None if the name does not exist in this dictionary. |
Source code in fgpyo/fasta/sequence_dictionary.py
same_as(other: SequenceDictionary) -> bool
Returns True if all sequences in the two dictionaries are the same.
Sequences are considered the same if they share a common reference name (including aliases), have the same length, and have the same MD5 (if both have MD5s).
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence dictionary to a pysam.AlignmentHeader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
extra_header
|
dict[str, Any] | None
|
a dictionary of extra values to add to the header, None otherwise. See
|
None
|
Source code in fgpyo/fasta/sequence_dictionary.py
dataclass
¶
Bases: MutableMapping[Keys | str, str]
Stores information about a single Sequence (ex. chromosome, contig).
Implements the mutable mapping interface, which provides access to the attributes of this
sequence, including name, length, but not index. When using the mapping interface, for example
getting, setting, deleting, as well as iterating over keys, values, and items, the values will
always be strings (str type). For example, the length will be an str when accessing via
get; access the length directly or use len to return an int. Similarly, use the
alias property to return a List[str] of aliases, use the alternate property to return
an AlternativeLocus-typed instance, and topology property to return a Toplogy-typed
instance.
All attributes except name and length may be set. Use dataclasses.replace to create a new
copy in such cases.
Important: The len method returns the length of the sequence, not the length of the
attributes. Use len(meta.attributes) for the latter.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
the primary name of the sequence |
length |
int
|
the length of the sequence, or zero if unknown |
index |
int
|
the index in the sequence dictionary |
attributes |
dict[Keys | str, str]
|
attributes of this sequence |
Source code in fgpyo/fasta/sequence_dictionary.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 | |
property
¶A list of all names, including the primary name and aliases, in that order.
property
¶True if there is an alternate locus defined, False otherwise.
__delitem__(key: Keys | str) -> None
Deletes the given attribute key.
Source code in fgpyo/fasta/sequence_dictionary.py
__getitem__(key: Keys | str) -> Any
Returns the value for the given key.
Source code in fgpyo/fasta/sequence_dictionary.py
__iter__() -> Iterator[Keys | str]
Iterates over all keys, starting with name and length.
Any post initialization validation should go here.
Source code in fgpyo/fasta/sequence_dictionary.py
__setitem__(key: Keys | str, value: str) -> None
Sets the value for the given attribute key.
Source code in fgpyo/fasta/sequence_dictionary.py
staticmethod
¶from_sam(meta: dict[Keys | str, Any], index: int) -> SequenceMetadata
Builds a SequenceMetadata from a dictionary.
The keys must include the sequence name (Keys.SEQUENCE_NAME) and length
(Keys.SEQUENCE_LENGTH). All other keys from Keys will be stored in the resulting
attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta
|
dict[Keys | str, Any]
|
the python dictionary with keys from |
required |
index
|
int
|
the 0-based index to use for this sequence |
required |
Source code in fgpyo/fasta/sequence_dictionary.py
same_as(other: SequenceMetadata) -> bool
Returns True if the two sequences are the same.
Sequences are considered the same if they share a common reference name (including aliases), have the same length, and have the same MD5 (if both have MD5s).
Source code in fgpyo/fasta/sequence_dictionary.py
Converts the sequence metadata to a SAM-formatted dictionary.
Equivalent to one item in the list of sequences from
pysam.AlignmentHeader#to_dict()["SQ"].
Source code in fgpyo/fasta/sequence_dictionary.py
Modules¶
fastx ¶
Zipping FASTX Files.¶
Zipping a set of FASTA/FASTQ files into a single stream of data is a common task in bioinformatics
and can be achieved with the FastxZipped() context manager.
The context manager facilitates opening of all input FASTA/FASTQ files and closing them after
iteration is complete. For every iteration of FastxZipped(),
a tuple of the next FASTX records are returned (of type
pysam.FastxRecord()). An exception will be raised if any of the input
files are malformed or truncated and if record names are not equivalent and in sync.
Importantly, this context manager is optimized for fast streaming read-only usage and, by default,
any previous records saved while advancing the iterator will not be correct as the underlying
pointer in memory will refer to the most recent record only, and not any past records. To preserve
the state of all previously iterated records, set the parameter persist to True.
>>> from fgpyo.fastx import FastxZipped
>>> with FastxZipped("r1.fq", "r2.fq", persist=False) as zipped:
... for (r1, r2) in zipped:
... print(f"{r1.name}: {r1.sequence}, {r2.name}: {r2.sequence}")
seq1: AAAA, seq1: CCCC
seq2: GGGG, seq2: TTTT
Classes¶
FastxZipped ¶
Bases: AbstractContextManager, Iterator[tuple[FastxRecord, ...]]
A context manager that will lazily zip over any number of FASTA/FASTQ files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
Path | str
|
Paths to the FASTX files to zip over. |
()
|
persist
|
bool
|
Whether to persist the state of previous records during iteration. |
False
|
Source code in fgpyo/fastx/__init__.py
Functions¶
__exit__(exc_type: type[BaseException] | None, exc_val: BaseException | None, exc_tb: TracebackType | None) -> bool | None
Exit the FastxZipped context manager by closing all FASTX files.
Source code in fgpyo/fastx/__init__.py
Instantiate a FastxZipped context manager and iterator.
Source code in fgpyo/fastx/__init__.py
Return the next set of FASTX records from the zipped FASTX files.
Source code in fgpyo/fastx/__init__.py
Functions¶
io ¶
Module for reading and writing files.¶
The functions in this module make it easy to:
- check if a file exists and is writable
- check if a file and its parent directories exist and are writable
- check if a file exists and is readable
- check if a path exists and is a directory
- open an appropriate reader or writer based on the file extension
- write items to a file, one per line
- read lines from a file
fgpyo.io Examples:¶
>>> import fgpyo.io as fio
>>> from fgpyo.io import write_lines, read_lines
>>> from pathlib import Path
Assert that a path exists and is readable:
>>> tmp_dir = Path(getfixture("tmp_path"))
>>> path_flat: Path = tmp_dir / "example.txt"
>>> fio.assert_path_is_readable(path_flat)
Traceback (most recent call last):
...
AssertionError: Cannot read non-existent path: ...
Write to and read from path:
>>> path_flat = tmp_dir / "example.txt"
>>> path_compressed = tmp_dir / "example.txt.gz"
>>> write_lines(path=path_flat, lines_to_write=["flat file", 10])
>>> write_lines(path=path_compressed, lines_to_write=["gzip file", 10])
Read lines from a path into a generator:
>>> lines = read_lines(path=path_flat)
>>> next(lines)
'flat file'
>>> next(lines)
'10'
>>> lines = read_lines(path=path_compressed)
>>> next(lines)
'gzip file'
>>> next(lines)
'10'
Functions¶
assert_directory_exists ¶
Asserts that a path exist and is a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to check |
required |
Example
assert_directory_exists(path = Path("/example/directory/"))
Source code in fgpyo/io/__init__.py
assert_fasta_indexed ¶
Verify that a FASTA is readable and has the expected index files.
The existence of the FASTA index generated by samtools faidx will always be verified. The
existence of the index files generated by samtools dict and bwa index may be optionally
verified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta
|
Path
|
Path to the FASTA file. |
required |
dictionary
|
bool
|
If True, check for the index file generated by |
False
|
bwa
|
bool
|
If True, check for the index files generated by |
False
|
Raises:
| Type | Description |
|---|---|
AssertionError
|
If the FASTA or any of the expected index files are missing or not readable. |
Source code in fgpyo/io/__init__.py
assert_path_is_readable ¶
Checks that file exists and returns True, else raises AssertionError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
a Path to check |
required |
Example
assert_file_exists(path = Path("some_file.csv"))
Source code in fgpyo/io/__init__.py
assert_path_is_writable ¶
Assert that a filepath is writable.
Specifically:
- If the file exists then it must also be writable.
- Else if the path is not a file and parent_must_exist is true, then assert that the parent
directory exists and is writable.
- Else if the path is not a directory and parent_must_exist is false, then look at each parent
directory until one is found that exists and is writable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to check |
required |
parent_must_exist
|
bool
|
If True, the file's parent directory must exist. Otherwise, at least one directory in the path's components must exist. |
True
|
Raises:
| Type | Description |
|---|---|
AssertionError
|
If any of the above conditions are not met. |
Example
assert_path_is_writable(path = Path("example.txt"))
Source code in fgpyo/io/__init__.py
assert_path_is_writeable ¶
A deprecated alias for assert_path_is_writable().
Source code in fgpyo/io/__init__.py
read_lines ¶
Reads each line from a path into a generator, removing line terminators.
By default, only line terminators (CR/LF) are stripped. The strip
parameter may be used to strip both leading and trailing whitespace from each line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to read from |
required |
strip
|
bool
|
True to strip lines of all leading and trailing whitespace, False to only remove trailing CR/LF characters. |
False
|
threads
|
int | None
|
the number of threads to use when decompressing gzip files |
None
|
Example
import fgpyo.io as fio read_back = fio.read_lines(path)
Source code in fgpyo/io/__init__.py
redirect_to_dev_null ¶
A context manager that redirects output of file handle to /dev/null.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_num
|
int
|
number of filehandle to redirect. |
required |
Source code in fgpyo/io/__init__.py
suppress_stderr ¶
A context manager that redirects output of stderr to /dev/null.
to_reader ¶
Opens a Path for reading and based on extension uses open() or gzip_ng.open().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to read from |
required |
threads
|
int | None
|
the number of threads to use when decompressing gzip files |
None
|
Example
import fgpyo.io as fio reader = fio.to_reader(path=Path("reader.txt")).readlines().close()
Source code in fgpyo/io/__init__.py
to_writer ¶
Opens a Path for writing (or appending) and based on extension uses open() or gzip_ng.open().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write (or append) to |
required |
append
|
bool
|
open the file for appending |
False
|
threads
|
int | None
|
the number of threads to use when compressing gzip files |
None
|
Example
import fgpyo.io as fio writer = fio.to_writer(path=Path("writer.txt")).write("something\n").close()
Source code in fgpyo/io/__init__.py
write_lines ¶
write_lines(path: Path, lines_to_write: Iterable[Any], append: bool = False, threads: int | None = None) -> None
Writes (or appends) a file with one line per item in provided iterable.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to write (or append) to |
required |
lines_to_write
|
Iterable[Any]
|
items to write (or append) to file |
required |
append
|
bool
|
open the file for appending |
False
|
threads
|
int | None
|
the number of threads to use when compressing gzip files |
None
|
Example
lines: List[Any] = ["things to write", 100] path_to_write_to: Path = Path("file_to_write_to.txt") fio.write_lines(path = path_to_write_to, lines_to_write = lines)
Source code in fgpyo/io/__init__.py
platform ¶
Modules¶
illumina ¶
Methods for working with Illumina-specific UMIs in SAM files.
The functions in this module make it easy to:
- check whether a UMI is valid
- extract UMI(s) from an Illumina-style read name
- copy a UMI from an alignment's read name to its
RXSAM tag
Attributes¶
module-attribute
¶Multiple UMI delimiter, which SAM specification recommends should be a hyphen; see specification here: https://samtools.github.io/hts-specs/SAMtags.pdf
Functions¶
copy_umi_from_read_name(rec: AlignedSegment, strict: bool = False, remove_umi: bool = False, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER) -> bool
Copy a UMI from an alignment's read name to its RX SAM tag.
The UMI will not be copied to RX tag if it is invalid.
strict, read_name_delimiter, and umi_delimiter are forwarded to
extract_umis_from_read_name — see
that function for their semantics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
The alignment record to update. |
required |
strict
|
bool
|
If |
False
|
remove_umi
|
bool
|
If |
False
|
read_name_delimiter
|
str
|
The delimiter separating the components of the read name.
Also used to strip the UMI segment when |
_ILLUMINA_READ_NAME_DELIMITER
|
umi_delimiter
|
str
|
The delimiter separating multiple UMIs. |
_ILLUMINA_UMI_DELIMITER
|
Returns:
| Type | Description |
|---|---|
bool
|
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the read name does not end with a valid UMI. |
ValueError
|
If the record already has a populated |
Source code in fgpyo/platform/illumina.py
extract_umis_from_read_name(read_name: str, read_name_delimiter: str = _ILLUMINA_READ_NAME_DELIMITER, umi_delimiter: str = _ILLUMINA_UMI_DELIMITER, strict: bool = False) -> str | None
Extract UMI(s) from an Illumina-style read name.
The UMI is expected to be the final component of the read name, delimited by the
read_name_delimiter. Multiple UMIs may be present, delimited by the umi_delimiter. This
delimiter will be replaced by the SAM-standard -.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read_name
|
str
|
The read name to extract the UMI from. |
required |
read_name_delimiter
|
str
|
The delimiter separating the components of the read name. |
_ILLUMINA_READ_NAME_DELIMITER
|
umi_delimiter
|
str
|
The delimiter separating multiple UMIs. |
_ILLUMINA_UMI_DELIMITER
|
strict
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
str | None
|
The UMI extracted from the read name, or None if no UMI was found. Multiple UMIs are |
str | None
|
returned in a single string, separated by a hyphen ( |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the read name does not end with a valid UMI. |
Source code in fgpyo/platform/illumina.py
read_structure ¶
Classes for representing Read Structures.¶
A Read Structure refers to a String that describes how the bases in a sequencing run should be
allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's
bcltofastq software, but provides some additional capabilities.
A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last
segment in the string is allowed to use + instead of a number for its length. The + translates
to whatever bases are left after the other segments are processed and can be thought of as meaning
[0..infinity].
See more at: https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures
Examples¶
>>> from fgpyo.read_structure import ReadStructure
>>> rs = ReadStructure.from_string("75T8B75T")
>>> [str(segment) for segment in rs]
['75T', '8B', '75T']
>>> rs[0]
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>)
>>> rs = rs.with_variable_last_segment()
>>> [str(segment) for segment in rs]
['75T', '8B', '+T']
>>> rs[-1]
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>)
>>> rs = ReadStructure.from_string("1B2M+T")
>>> [s.bases for s in rs.extract("A"*6)]
['A', 'AA', 'AAA']
>>> [s.bases for s in rs.extract("A"*5)]
['A', 'AA', 'AA']
>>> [s.bases for s in rs.extract("A"*4)]
['A', 'AA', 'A']
>>> [s.bases for s in rs.extract("A"*3)]
['A', 'AA', '']
>>> rs.template_segments()
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),)
>>> [str(segment) for segment in rs.template_segments()]
['+T']
>>> try:
... ReadStructure.from_string("23T2TT23T")
... except ValueError as ex:
... print(str(ex))
Read structure missing length information: 23T2T[T]23T
Attributes¶
ANY_LENGTH_CHAR
module-attribute
¶
A character that can be put in place of a number in a read structure to mean "0 or more bases".
Classes¶
ReadSegment ¶
Encapsulates all the information about a segment within a read structure.
A segment can either have a definite length, in which case length must be Some(Int), or an indefinite length (can be any length, 0 or more) in which case length must be None.
Attributes:
| Name | Type | Description |
|---|---|---|
offset |
int
|
The offset of the read segment in the read. |
length |
int | None
|
The length of the segment, or None if it is variable length. |
kind |
SegmentType
|
The kind of read segment. |
Source code in fgpyo/read_structure.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 | |
Attributes¶
property
¶The fixed length of this segment.
Raises:
| Type | Description |
|---|---|
AttributeError
|
If the segment does not have a fixed length. |
Functions¶
Returns the string representation of this segment (e.g. '10T' or '+T').
extract(bases: str) -> SubReadWithoutQuals
Gets the bases associated with this read segment.
extract_with_quals(bases: str, quals: str) -> SubReadWithQuals
Gets the bases and qualities associated with this read segment.
Source code in fgpyo/read_structure.py
ReadStructure ¶
Bases: Iterable[ReadSegment]
Describes the structure of a given read.
A read contains one or more read segments. A read segment describes a contiguous stretch of bases of the same type (ex. template bases) of some length and some offset from the start of the read.
Attributes:
| Name | Type | Description |
|---|---|---|
segments |
tuple[ReadSegment, ...]
|
The segments composing the read structure |
Source code in fgpyo/read_structure.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 | |
Attributes¶
property
¶The fixed length of this read structure.
Raises:
| Type | Description |
|---|---|
AttributeError
|
If the read structure does not have a fixed length. |
property
¶True if the ReadStructure has a fixed (i.e. non-variable) length.
property
¶Length is defined as the number of segments (not bases!) in the read structure.
Functions¶
__getitem__(index: int) -> ReadSegment
__iter__() -> Iterator[ReadSegment]
cell_barcode_segments() -> tuple[ReadSegment, ...]
extract(bases: str) -> tuple[SubReadWithoutQuals, ...]
Splits the given bases into tuples with its associated read segment.
extract_with_quals(bases: str, quals: str) -> tuple[SubReadWithQuals, ...]
Splits the given bases and qualities into triples with its associated read segment.
Source code in fgpyo/read_structure.py
classmethod
¶from_segments(segments: tuple[ReadSegment, ...], reset_offsets: bool = False) -> ReadStructure
Creates a new ReadStructure, optionally resetting the offsets on each of the segments.
Source code in fgpyo/read_structure.py
classmethod
¶from_string(segments: str) -> ReadStructure
Parses a read structure from its string representation.
Source code in fgpyo/read_structure.py
molecular_barcode_segments() -> tuple[ReadSegment, ...]
sample_barcode_segments() -> tuple[ReadSegment, ...]
segments_by_kind(kind: SegmentType) -> tuple[ReadSegment, ...]
skip_segments() -> tuple[ReadSegment, ...]
template_segments() -> tuple[ReadSegment, ...]
with_variable_last_segment() -> ReadStructure
Returns a copy with the last segment changed to undefined length.
Source code in fgpyo/read_structure.py
SegmentType ¶
Bases: Enum
The type of segments that can show up in a read structure.
Source code in fgpyo/read_structure.py
Attributes¶
class-attribute
instance-attribute
¶The segment type for cell barcode bases.
class-attribute
instance-attribute
¶The segment type for molecular barcode bases.
class-attribute
instance-attribute
¶The segment type for sample barcode bases.
class-attribute
instance-attribute
¶The segment type for bases that need to be skipped.
Functions¶
SubReadWithQuals ¶
Contains the bases and qualities that correspond to the given read segment.
Source code in fgpyo/read_structure.py
Attributes¶
instance-attribute
¶The sub-read base qualities that correspond to the given read segment.
instance-attribute
¶segment: ReadSegment
The segment of the read structure that describes this sub-read.
SubReadWithoutQuals ¶
Contains the bases that correspond to the given read segment.
Source code in fgpyo/read_structure.py
Attributes¶
instance-attribute
¶segment: ReadSegment
The segment of the read structure that describes this sub-read.
sam ¶
Utility Classes and Methods for SAM/BAM.¶
This module contains utility classes for working with SAM/BAM files and the data contained within them. This includes i) utilities for opening SAM/BAM files for reading and writing, ii) functions for manipulating supplementary alignments, iii) classes and functions for maniuplating CIGAR strings, and iv) a class for building sam records and files for testing.
Motivation for Reader and Writer methods¶
The following are the reasons for choosing to implement methods to open a SAM/BAM file for
reading and writing, rather than relying on pysam.AlignmentFile directly:
- Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
- Makes the requirement to provide a header when opening a file for writing more explicit.
- Adds support for
pathlib.Path. - Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
- An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.
Examples of Opening a SAM/BAM for Reading or Writing¶
Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See
SamFileType() for the supported file types.
>>> from fgpyo.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
... for record in fh:
... print(record.query_name) # do something
>>> with reader("/path/to/sample.bam") as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for reading, explicitly passing the file type.
>>> from fgpyo.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
... for record in fh:
... print(record.query_name) # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for reading, using an existing file-like object
>>> with open("/path/to/sample.sam", "rb") as file_object:
... with reader(path=file_object, file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.query_name) # do something
Opening a SAM/BAM file for writing follows similar to the reader()
method, but the SAM file header object is required.
>>> from fgpyo.sam import writer
>>> header: Dict[str, Any] = {
... "HD": {"VN": "1.5", "SO": "coordinate"},
... "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
... "SQ": [
... {"SN": "chr1", "LN": 249250621},
... {"SN": "chr2", "LN": 243199373}
... ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
... pass # do something
Examples of Manipulating Cigars¶
Creating a Cigar from a pysam.AlignedSegment.
>>> from fgpyo.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
... record = next(fh)
... cigar = Cigar.from_cigartuples(record.cigartuples)
... print(str(cigar))
50M2D5M10S
Creating a Cigar from a str().
If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.
>>> cigar = Cigar.from_cigarstring("10M5U")
Traceback (most recent call last):
...
fgpyo.sam.CigarParsingException: Malformed cigar: 10M5[U]
The cigar contains a tuple of CigarElement()s. Each element
contains the cigar operator (CigarOp()) and associated operator
length. A number of useful methods are part of both classes.
The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]
Any particular element can be accessed directly via .elements with its index (and works with
negative indexes and slices):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> cigar.elements[0].length
50
>>> cigar.elements[1].operator
<CigarOp.D: (2, 'D', False, True)>
>>> cigar.elements[-1].operator
<CigarOp.S: (4, 'S', True, False)>
>>> tuple(x.operator.character for x in cigar.elements[1:3])
('D', 'M')
>>> tuple(x.operator.character for x in cigar.elements[-2:])
('I', 'S')
Examples of parsing the SA tag and individual supplementary alignments¶
>>> from fgpyo.sam import SupplementaryAlignment
>>> sup = SupplementaryAlignment.parse("chr1,123,+,50S100M,60,0")
>>> sup.reference_name
'chr1'
>>> sup.nm
0
>>> from typing import List
>>> sa_tag = "chr1,123,+,50S100M,60,0;chr2,456,-,75S75M,60,1"
>>> sups: List[SupplementaryAlignment] = SupplementaryAlignment.parse_sa_tag(tag=sa_tag)
>>> len(sups)
2
>>> [str(sup.cigar) for sup in sups]
['50S100M', '75S75M']
Attributes¶
DefaultProperlyPairedOrientations
module-attribute
¶
DefaultProperlyPairedOrientations: set[PairOrientation] = {FR}
The default orientations for properly paired reads.
NO_QUERY_BASES
module-attribute
¶
The string to use for a SAM record with missing query bases.
NO_QUERY_QUALITIES
module-attribute
¶
NO_QUERY_QUALITIES: array = cast(array, qualitystring_to_array(STRING_PLACEHOLDER))
The quality array corresponding to an unavailable query quality string ("*").
NO_REF_INDEX
module-attribute
¶
The reference index to use to indicate no reference in SAM/BAM.
NO_REF_NAME
module-attribute
¶
NO_REF_NAME: str = STRING_PLACEHOLDER
The reference name to use to indicate no reference in SAM/BAM.
NO_REF_POS
module-attribute
¶
The reference position to use to indicate no position in SAM/BAM.
STRING_PLACEHOLDER
module-attribute
¶
The value to use when a string field's information is unavailable.
SamPath
module-attribute
¶
The valid base classes for opening a SAM/BAM/CRAM file.
Classes¶
Cigar ¶
Class representing a cigar string.
Attributes:
| Name | Type | Description |
|---|---|---|
- |
elements (Tuple[CigarElement, ...]
|
zero or more cigar elements |
Source code in fgpyo/sam/__init__.py
522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 | |
Functions¶
coalesce() -> Cigar
Returns a new Cigar with adjacent elements of the same operator merged.
For example, Cigar.from_cigarstring("10M10M") would be coalesced to
Cigar.from_cigarstring("20M").
Returns:
| Type | Description |
|---|---|
Cigar
|
A new Cigar with adjacent same-operator elements merged, or this Cigar if |
Cigar
|
no coalescing is needed. |
Examples:
>>> str(Cigar.from_cigarstring("10M10M").coalesce())
'20M'
>>> str(Cigar.from_cigarstring("10M5I5I10M").coalesce())
'10M10I10M'
Source code in fgpyo/sam/__init__.py
classmethod
¶from_cigarstring(cigarstring: str) -> Cigar
Constructs a Cigar from a string returned by pysam.
If "*" is given, returns an empty Cigar.
Source code in fgpyo/sam/__init__.py
classmethod
¶from_cigartuples(cigartuples: list[tuple[int, int]] | None) -> Cigar
Returns a Cigar from a list of tuples returned by pysam.
Each tuple denotes the operation and length. See
CigarOp() for more information on the
various operators. If None is given, returns an empty Cigar.
Source code in fgpyo/sam/__init__.py
Gets the 0-based, end-exclusive positions of the first and last aligned base in the query.
The resulting range will contain the range of positions in the SEQ string for
the bases that are aligned.
If counting from the end of the query is desired, use
cigar.reversed().query_alignment_offsets()
Returns:
| Type | Description |
|---|---|
tuple[int, int]
|
A tuple (start, stop) containing the start and stop positions of the aligned part of the query. These offsets are 0-based and open-ended, with respect to the beginning of the query. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If according to the cigar, there are no aligned query bases. |
Source code in fgpyo/sam/__init__.py
truncate_to_query_length(length: int) -> Cigar
Truncates the CIGAR to the specified query sequence length.
Produces a new CIGAR that includes at most the specified number of bases from the query sequence. Only CIGAR operators that consume query bases (M, I, S, =, X) are counted toward the length limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int
|
The maximum number of query bases to include |
required |
Returns:
| Type | Description |
|---|---|
Cigar
|
A new Cigar truncated to the specified query length |
Examples:
>>> cigar = Cigar.from_cigarstring("10M5I10M")
>>> str(cigar.truncate_to_query_length(15))
'10M5I'
>>> str(cigar.truncate_to_query_length(12))
'10M2I'
Source code in fgpyo/sam/__init__.py
truncate_to_target_length(length: int) -> Cigar
Truncates the CIGAR to the specified reference/target sequence length.
Produces a new CIGAR that includes at most the specified number of bases from the reference/target sequence. Only CIGAR operators that consume reference bases (M, D, N, =, X) are counted toward the length limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
length
|
int
|
The maximum number of reference/target bases to include |
required |
Returns:
| Type | Description |
|---|---|
Cigar
|
A new Cigar truncated to the specified target length |
Examples:
>>> cigar = Cigar.from_cigarstring("10M5D10M")
>>> str(cigar.truncate_to_target_length(15))
'10M5D'
>>> str(cigar.truncate_to_target_length(12))
'10M2D'
Source code in fgpyo/sam/__init__.py
CigarElement ¶
Represents an element in a Cigar.
Attributes:
| Name | Type | Description |
|---|---|---|
- |
length (int
|
the length of the element |
- |
operator (CigarOp
|
the operator of the element |
Source code in fgpyo/sam/__init__.py
CigarOp ¶
Bases: Enum
Enumeration of operators that can appear in a Cigar string.
Attributes:
| Name | Type | Description |
|---|---|---|
code |
int
|
The |
character |
int
|
The single character cigar operator. |
consumes_query |
bool
|
True if this operator consumes query bases, False otherwise. |
consumes_target |
bool
|
True if this operator consumes target bases, False otherwise. |
Source code in fgpyo/sam/__init__.py
Attributes¶
property
¶Returns true if the operator is a soft/hard clip, false otherwise.
Functions¶
Initializes the CIGAR operator with the given code, character, and consumption flags.
Source code in fgpyo/sam/__init__.py
staticmethod
¶from_character(character: str) -> CigarOp
Returns the operator from the single character.
CigarParsingException ¶
PairOrientation ¶
Bases: Enum
Enumerations of read pair orientations.
Source code in fgpyo/sam/__init__.py
Attributes¶
class-attribute
instance-attribute
¶A pair orientation for forward-reverse reads ("innie").
class-attribute
instance-attribute
¶A pair orientation for reverse-forward reads ("outie").
class-attribute
instance-attribute
¶A pair orientation for tandem (forward-forward or reverse-reverse) reads.
Functions¶
classmethod
¶from_recs(rec1: AlignedSegment, rec2: AlignedSegment | None = None) -> PairOrientation | None
Returns the pair orientation if both reads are mapped to the same reference sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
AlignedSegment | None
|
The second record in the pair. If None, then mate info on |
None
|
Source code in fgpyo/sam/__init__.py
ReadEditInfo ¶
Counts various stats about how a read compares to a reference sequence.
Attributes:
| Name | Type | Description |
|---|---|---|
matches |
int
|
the number of bases in the read that match the reference |
mismatches |
int
|
the number of mismatches between the read sequence and the reference sequence as dictated by the alignment. Like as defined for the SAM NM tag computation, any base except A/C/G/T in the read is considered a mismatch. |
insertions |
int
|
the number of insertions in the read vs. the reference. I.e. the number of I operators in the CIGAR string. |
inserted_bases |
int
|
the total number of bases contained within insertions in the read |
deletions |
int
|
the number of deletions in the read vs. the reference. I.e. the number of D operators in the CIGAT string. |
deleted_bases |
int
|
the total number of that are deleted within the alignment (i.e. bases in the reference but not in the read). |
nm |
int
|
the computed value of the SAM NM tag, calculated as mismatches + inserted_bases + deleted_bases |
md |
str
|
the computed value of the SAM MD tag |
Source code in fgpyo/sam/__init__.py
SamFileType ¶
Bases: Enum
Enumeration of valid SAM/BAM/CRAM file types.
Attributes:
| Name | Type | Description |
|---|---|---|
mode |
str
|
The additional mode character to add when opening this file type. |
ext |
str
|
The standard file extension for this file type. |
Source code in fgpyo/sam/__init__.py
Attributes¶
Functions¶
classmethod
¶from_path(path: Path | str) -> SamFileType
Infers the file type based on the file extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
the path to the SAM/BAM/CRAM to read or write. |
required |
Source code in fgpyo/sam/__init__.py
SamOrder ¶
Bases: Enum
Enumerations of possible sort orders for a SAM file.
Source code in fgpyo/sam/__init__.py
SupplementaryAlignment ¶
Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.
Attributes:
| Name | Type | Description |
|---|---|---|
reference_name |
str
|
the name of the reference (i.e. contig, chromosome) aligned to |
start |
int
|
the 0-based start position of the alignment |
is_forward |
bool
|
true if the alignment is in the forward strand, false otherwise |
cigar |
Cigar
|
the cigar for the alignment |
mapq |
int
|
the mapping quality |
nm |
int
|
the number of edits |
Source code in fgpyo/sam/__init__.py
936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 | |
Attributes¶
Functions¶
Returns the comma-delimited SA tag representation.
Source code in fgpyo/sam/__init__.py
classmethod
¶from_read(read: AlignedSegment) -> list[SupplementaryAlignment]
Construct a list of SupplementaryAlignments from the SA tag in a pysam.AlignedSegment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read
|
AlignedSegment
|
An alignment. The presence of the "SA" tag is not required. |
required |
Returns:
| Type | Description |
|---|---|
list[SupplementaryAlignment]
|
A list of all SupplementaryAlignments present in the SA tag. |
list[SupplementaryAlignment]
|
If the SA tag is not present, or it is empty, an empty list will be returned. |
Source code in fgpyo/sam/__init__.py
staticmethod
¶parse(string: str) -> SupplementaryAlignment
Returns a supplementary alignment parsed from the given string.
The various fields should be comma-delimited (ex. chr1,123,-,100M50S,60,4).
Source code in fgpyo/sam/__init__.py
staticmethod
¶parse_sa_tag(tag: str) -> list[SupplementaryAlignment]
Parses an SA tag of supplementary alignments from a BAM file.
If the tag is empty or contains just a single semi-colon then an empty list will be returned. Otherwise a list containing a SupplementaryAlignment per ;-separated value in the tag will be returned.
Source code in fgpyo/sam/__init__.py
Template ¶
A container for alignment records corresponding to a single sequenced template or insert.
It is strongly preferred that new Template instances be created with Template.build()
which will ensure that reads are stored in the correct Template property, and run basic
validations of the Template by default. If constructing Template instances by construction
users are encouraged to use the validate method post-construction.
In the special cases there are alignments records that are both secondary and supplementary
then they will be stored upon the r1_supplementals and r2_supplementals fields only.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
the name of the template/query |
r1 |
AlignedSegment | None
|
Primary non-supplementary alignment for read 1, or None if there is none |
r2 |
AlignedSegment | None
|
Primary non-supplementary alignment for read 2, or None if there is none |
r1_supplementals |
list[AlignedSegment]
|
Supplementary alignments for read 1 |
r2_supplementals |
list[AlignedSegment]
|
Supplementary alignments for read 2 |
r1_secondaries |
list[AlignedSegment]
|
Secondary (non-primary, non-supplementary) alignments for read 1 |
r2_secondaries |
list[AlignedSegment]
|
Secondary (non-primary, non-supplementary) alignments for read 2 |
Source code in fgpyo/sam/__init__.py
1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 | |
Functions¶
Yields all R1 alignments of this template including secondary and supplementary.
Source code in fgpyo/sam/__init__.py
Yields all R2 alignments of this template including secondary and supplementary.
Source code in fgpyo/sam/__init__.py
Returns a list with all the records for the template.
Source code in fgpyo/sam/__init__.py
staticmethod
¶build(recs: Iterable[AlignedSegment], validate: bool = True) -> Template
Build a template from a set of records all with the same queryname.
Source code in fgpyo/sam/__init__.py
staticmethod
¶iterator(alns: Iterator[AlignedSegment]) -> Iterator[Template]
Returns an iterator over templates from queryname-grouped alignments.
Gathers consecutive runs of records sharing a common query name into templates.
Source code in fgpyo/sam/__init__.py
set_mate_info(is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> Self
Reset all mate information on every alignment in the template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
is_proper_pair
|
Callable[[AlignedSegment, AlignedSegment], bool]
|
A function that takes two alignments and determines proper pair status. |
is_proper_pair
|
isize
|
Callable[[AlignedSegment, AlignedSegment], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Source code in fgpyo/sam/__init__.py
Add a tag to all records associated with the template.
Setting a tag to None will remove the tag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tag
|
str
|
The name of the tag. |
required |
value
|
str | int | float | None
|
The value of the tag. |
required |
Source code in fgpyo/sam/__init__.py
Performs sanity checks that all the records in the Template are as expected.
Source code in fgpyo/sam/__init__.py
Write the records associated with the template to file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
writer
|
AlignmentFile
|
An open, writable AlignmentFile. |
required |
primary_only
|
bool
|
If True, only write primary alignments. |
False
|
Source code in fgpyo/sam/__init__.py
TemplateIterator ¶
Bases: Iterator[Template]
An iterator that converts query-grouped reads into templates.
Source code in fgpyo/sam/__init__.py
Functions¶
calculate_edit_info ¶
calculate_edit_info(rec: AlignedSegment, reference_sequence: str, match_htsjdk: bool = False, reference_offset: int | None = None) -> ReadEditInfo | None
Constructs a ReadEditInfo with summary stats about how the read aligns to the reference.
Computes the number of mismatches, indels, indel bases as well as the SAM NM and MD tags.
Calculation of NM and MD tags is based off of htsjdk: https://github.com/samtools/htsjdk/blob/7034b33636b4cb9fec300a2136588e7c12c7ccd5/src/main/java/htsjdk/samtools/util/SequenceUtil.java#L964:L1029
Per the SAM specification (https://samtools.github.io/hts-specs/SAMtags.pdf), the NM tag
encapsulates the number of differences between the query read and reference sequence, counting
only A, C, G and T bases (case-insensitive). Everything else should be considered a mismatch
(e.g., ambiguity codes like R and N). We set the default of n_as_match to False to be
concordant with the SAM specification. Conversely, htsjdk treats an N->N as a match.
If the read is unmapped or the query sequence contains missing bases (*), returns None, as it
is not possible to recalculate the MD and NM tags without access to the query sequence and
reference sequence.
The order of the CIGAR operator checks is for performance and modeled after htsjdk's
calculateMdAndNmTags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the read/record for which to calculate values |
required |
reference_sequence
|
str
|
the reference sequence (or fragment thereof) to which the read is aligned |
required |
match_htsjdk
|
bool
|
if True, mirror htsjdk |
False
|
reference_offset
|
int | None
|
if provided, assume that reference_sequence[reference_offset] is the first base aligned to in reference_sequence, otherwise use r.reference_start |
None
|
Returns:
| Type | Description |
|---|---|
ReadEditInfo | None
|
a ReadEditInfo with information about how the read differs from the reference |
Source code in fgpyo/sam/__init__.py
1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 | |
is_proper_pair ¶
is_proper_pair(rec1: AlignedSegment, rec2: AlignedSegment | None = None, max_insert_size: int = 1000, orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations, isize: Callable[[AlignedSegment, AlignedSegment | None], int] = isize) -> bool
Determines if a pair of records are properly paired or not.
Criteria for records in a proper pair are
- Both records are aligned
- Both records are aligned to the same reference sequence
- The pair orientation of the records is one of the valid pair orientations (default "FR")
- The inferred insert size is not more than a maximum length (default 1000)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
AlignedSegment | None
|
The second record in the pair. If None, then mate info on |
None
|
max_insert_size
|
int
|
The maximum insert size to consider a pair "proper". |
1000
|
orientations
|
Collection[PairOrientation]
|
The valid set of orientations to consider a pair "proper". |
DefaultProperlyPairedOrientations
|
isize
|
Callable[[AlignedSegment, AlignedSegment | None], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Source code in fgpyo/sam/__init__.py
isize ¶
Computes the insert size ("template length" or "TLEN") for a pair of records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
AlignedSegment | None
|
The second record in the pair. If None, then mate info on |
None
|
Source code in fgpyo/sam/__init__.py
reader ¶
reader(path: SamPath, file_type: SamFileType | None = None, unmapped: bool = False) -> AlignmentFile
Opens a SAM/BAM/CRAM for reading.
To read from standard input, provide any of "-", "stdin", or "/dev/stdin" as the input
path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
SamPath
|
a file handle or path to the SAM/BAM/CRAM to read or write. |
required |
file_type
|
SamFileType | None
|
the file type to assume when opening the file. If None, then the file type will be auto-detected. |
None
|
unmapped
|
bool
|
True if the file is unmapped and has no sequence dictionary, False otherwise. |
False
|
Source code in fgpyo/sam/__init__.py
set_mate_info ¶
set_mate_info(rec1: AlignedSegment, rec2: AlignedSegment, is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, isize: Callable[[AlignedSegment, AlignedSegment], int] = isize) -> None
Resets mate pair information between two primary alignments that share a query name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec1
|
AlignedSegment
|
The first record in the pair. |
required |
rec2
|
AlignedSegment
|
The second record in the pair. |
required |
is_proper_pair
|
Callable[[AlignedSegment, AlignedSegment], bool]
|
A function that takes the two alignments and determines proper pair status. |
is_proper_pair
|
isize
|
Callable[[AlignedSegment, AlignedSegment], int]
|
A function that takes the two alignments and calculates their isize. |
isize
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If rec1 and rec2 are of the same read ordinal. |
ValueError
|
If either rec1 or rec2 is secondary or supplementary. |
ValueError
|
If rec1 and rec2 do not share the same query name. |
Source code in fgpyo/sam/__init__.py
set_mate_info_on_secondary ¶
Set mate info on a secondary alignment from its mate's primary alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
secondary
|
AlignedSegment
|
The secondary alignment to set mate information upon. |
required |
mate_primary
|
AlignedSegment
|
The primary alignment of the secondary's mate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If secondary and mate_primary are of the same read ordinal. |
ValueError
|
If secondary and mate_primary do not share the same query name. |
ValueError
|
If mate_primary is secondary or supplementary. |
ValueError
|
If secondary is not marked as a secondary alignment. |
Source code in fgpyo/sam/__init__.py
set_mate_info_on_supplementary ¶
Set mate info on a supplementary alignment from its mate's primary alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
supp
|
AlignedSegment
|
The supplementary alignment to set mate information upon. |
required |
mate_primary
|
AlignedSegment
|
The primary alignment of the supplementary's mate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If supp and mate_primary are of the same read ordinal. |
ValueError
|
If supp and mate_primary do not share the same query name. |
ValueError
|
If mate_primary is secondary or supplementary. |
ValueError
|
If supp is not marked as a supplementary alignment. |
Source code in fgpyo/sam/__init__.py
set_pair_info ¶
Resets mate pair information between reads in a pair.
Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
r1
|
AlignedSegment
|
Read 1 (first read in the template). |
required |
r2
|
AlignedSegment
|
Read 2 with the same query name as r1 (second read in the template). |
required |
proper_pair
|
bool
|
whether the pair is proper or not. |
True
|
Source code in fgpyo/sam/__init__.py
sum_of_base_qualities ¶
Calculate the sum of base qualities score for an alignment record.
This function is useful for calculating the "mate score" as implemented in samtools fixmate.
Consistently with samtools fixmate, this function returns 0 if the record has no base
qualities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
The alignment record to calculate the sum of base qualities from. |
required |
min_quality_score
|
int
|
The minimum base quality score to use for summation. |
15
|
Returns:
| Type | Description |
|---|---|
int
|
The sum of base qualities on the input record. 0 if the record has no base qualities. |
Source code in fgpyo/sam/__init__.py
writer ¶
writer(path: SamPath, header: str | dict[str, Any] | AlignmentHeader, file_type: SamFileType | None = None) -> AlignmentFile
Opens a SAM/BAM/CRAM for writing.
To write to standard output, provide any of "-", "stdout", or "/dev/stdout" as the output
path. Note: When writing to stdout, the file_type must be given.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
SamPath
|
a file handle or path to the SAM/BAM/CRAM to read or write. |
required |
header
|
str | dict[str, Any] | AlignmentHeader
|
Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order. |
required |
file_type
|
SamFileType | None
|
the file type to assume when opening the file. If |
None
|
Source code in fgpyo/sam/__init__.py
Modules¶
builder ¶
Classes for generating SAM and BAM files and records for testing.¶
This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.
Classes¶
Builder for constructing one or more sam records (AlignmentSegments in pysam terms).
Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.
A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.
Records are then added using the add_pair()
method. Once accumulated the records can be accessed in the order in which they were created
through the to_unsorted_list()
function, or in a list sorted by coordinate order via
to_sorted_list(). The latter creates
a temporary file to do the sorting and is somewhat slower as a result. Lastly, the records can
be written to a temporary file using
to_path().
Source code in fgpyo/sam/builder.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 | |
__init__(r1_len: int | None = None, r2_len: int | None = None, base_quality: int = 30, mapping_quality: int = 60, sd: list[dict[str, Any]] | None = None, rg: dict[str, str] | None = None, extra_header: dict[str, Any] | None = None, seed: int = 42, sort_order: SamOrder = Coordinate) -> None
Initializes a new SamBuilder for generating alignment records and SAM/BAM files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
r1_len
|
int | None
|
The length of R1s to create unless otherwise specified |
None
|
r2_len
|
int | None
|
The length of R2s to create unless otherwise specified |
None
|
base_quality
|
int
|
The base quality of bases to create unless otherwise specified |
30
|
mapping_quality
|
int
|
The mapping quality of records to create unless otherwise specified |
60
|
sd
|
list[dict[str, Any]] | None
|
a sequence dictionary as a list of dicts; defaults to calling default_sd() if None |
None
|
rg
|
dict[str, str] | None
|
a single read group as a dict; defaults to calling default_sd() if None |
None
|
extra_header
|
dict[str, Any] | None
|
a dictionary of extra values to add to the header, None otherwise. See
|
None
|
seed
|
int
|
a seed value for random number/string generation |
42
|
sort_order
|
SamOrder
|
Order to sort records when writing to file, or output of to_sorted_list() |
Coordinate
|
Source code in fgpyo/sam/builder.py
add_pair(*, name: str | None = None, bases1: str | None = _UNSET, bases2: str | None = _UNSET, quals1: list[int] | None = _UNSET, quals2: list[int] | None = _UNSET, chrom: str | None = None, chrom1: str | None = None, chrom2: str | None = None, start1: int = NO_REF_POS, start2: int = NO_REF_POS, cigar1: str | None = None, cigar2: str | None = None, mapq1: int | None = None, mapq2: int | None = None, strand1: str = '+', strand2: str = '-', attrs: dict[str, Any] | None = None) -> tuple[AlignedSegment, AlignedSegment]
Generates a new pair of reads, adds them to the internal collection, and returns them.
Most fields are optional.
Mapped pairs can be created by specifying both start1 and start2 and either chrom, for
pairs where both reads map to the same contig, or both chrom1 and chrom2, for pairs
where reads map to different contigs. i.e.:
- `add_pair(chrom, start1, start2)` will create a mapped pair where both reads map to
the same contig (`chrom`).
- `add_pair(chrom1, start1, chrom2, start2)` will create a mapped pair where the reads
map to different contigs (`chrom1` and `chrom2`).
A pair with only one of the two reads mapped can be created by setting only one start position. Flags will automatically be set correctly for the unmapped mate.
- `add_pair(chrom, start1)`
- `add_pair(chrom1, start1)`
- `add_pair(chrom, start2)`
- `add_pair(chrom2, start2)`
An unmapped pair can be created by calling the method with no parameters (specifically,
not setting chrom, chrom1, start1, chrom2, or start2). If either cigar is
provided, it will be ignored.
For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
The name of the template. If None is given a unique name will be auto-generated. |
None
|
bases1
|
str | None
|
The bases for R1. If omitted, a random sequence is generated. Pass |
_UNSET
|
bases2
|
str | None
|
The bases for R2. If omitted, a random sequence is generated. Pass |
_UNSET
|
quals1
|
list[int] | None
|
The list of int qualities for R1. If omitted, the default base quality is used.
Pass |
_UNSET
|
quals2
|
list[int] | None
|
The list of int qualities for R2. If omitted, the default base quality is used.
Pass |
_UNSET
|
chrom
|
str | None
|
The chromosome to which both reads are mapped. Defaults to the unmapped value. |
None
|
chrom1
|
str | None
|
The chromosome to which R1 is mapped. If None, |
None
|
chrom2
|
str | None
|
The chromosome to which R2 is mapped. If None, |
None
|
start1
|
int
|
The start position of R1. Defaults to the unmapped value. |
NO_REF_POS
|
start2
|
int
|
The start position of R2. Defaults to the unmapped value. |
NO_REF_POS
|
cigar1
|
str | None
|
The cigar string for R1. Defaults to None for unmapped reads, otherwise all M. |
None
|
cigar2
|
str | None
|
The cigar string for R2. Defaults to None for unmapped reads, otherwise all M. |
None
|
mapq1
|
int | None
|
Mapping quality for R1. Defaults to self.mapping_quality if None. |
None
|
mapq2
|
int | None
|
Mapping quality for R2. Defaults to self.mapping_quality if None. |
None
|
strand1
|
str
|
The strand for R1, either "+" or "-". Defaults to "+". |
'+'
|
strand2
|
str
|
The strand for R2, either "+" or "-". Defaults to "-". |
'-'
|
attrs
|
dict[str, Any] | None
|
An optional dictionary of SAM attribute to place on both R1 and R2. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if either strand field is not "+" or "-" |
ValueError
|
if bases/quals/cigar are set in a way that is not self-consistent |
Returns:
| Type | Description |
|---|---|
tuple[AlignedSegment, AlignedSegment]
|
Tuple[AlignedSegment, AlignedSegment]: The pair of records created, R1 then R2. |
Source code in fgpyo/sam/builder.py
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
add_single(*, name: str | None = None, read_num: int | None = None, bases: str | None = _UNSET, quals: list[int] | None = _UNSET, chrom: str = NO_REF_NAME, start: int = NO_REF_POS, cigar: str | None = None, mapq: int | None = None, strand: str = '+', secondary: bool = False, supplementary: bool = False, attrs: dict[str, Any] | None = None) -> AlignedSegment
Generates a new single reads, adds them to the internal collection, and returns it.
Most fields are optional.
If read_num is None (the default) an unpaired read will be created. If read_num is
set to 1 or 2, the read will have it's paired flag set and read number flags set.
An unmapped read can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If cigar is provided, it will be ignored.
A mapped read is created by providing chrom and start.
The length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
The name of the template. If None is given a unique name will be auto-generated. |
None
|
read_num
|
int | None
|
Either None, 1 for R1 or 2 for R2 |
None
|
bases
|
str | None
|
The bases for the read. If omitted, a random sequence is generated. Pass
|
_UNSET
|
quals
|
list[int] | None
|
The list of qualities for the read. If omitted, the default base quality is
used. Pass |
_UNSET
|
chrom
|
str
|
The chromosome to which both reads are mapped. Defaults to the unmapped value. |
NO_REF_NAME
|
start
|
int
|
The start position of the read. Defaults to the unmapped value. |
NO_REF_POS
|
cigar
|
str | None
|
The cigar string for R1. Defaults to None for unmapped reads, otherwise all M. |
None
|
mapq
|
int | None
|
Mapping quality for the read. Default to self.mapping_quality if not given. |
None
|
strand
|
str
|
The strand for R1, either "+" or "-". Defaults to "+". |
'+'
|
secondary
|
bool
|
If true the read will be flagged as secondary |
False
|
supplementary
|
bool
|
If true the read will be flagged as supplementary |
False
|
attrs
|
dict[str, Any] | None
|
An optional dictionary of SAM attribute to place on both R1 and R2. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
if strand field is not "+" or "-" |
ValueError
|
if read_num is not None, 1 or 2 |
ValueError
|
if bases/quals/cigar are set in a way that is not self-consistent |
Returns:
| Name | Type | Description |
|---|---|---|
AlignedSegment |
AlignedSegment
|
The record created |
Source code in fgpyo/sam/builder.py
459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 | |
staticmethod
¶Returns the default read group used by the SamBuilder, as a dictionary.
staticmethod
¶Generates the sequence dictionary that is used by default by SamBuilder.
Matches the names and lengths of the HG19 reference in use in production.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
A new copy of the sequence dictionary as a list of dictionaries, one per chromosome. |
Source code in fgpyo/sam/builder.py
Returns the single read group that is defined in the header.
Source code in fgpyo/sam/builder.py
Returns the ID of the single read group that is defined in the header.
Source code in fgpyo/sam/builder.py
to_path(path: Path | None = None, index: bool = True, pred: Callable[[AlignedSegment], bool] = lambda _r: True, tmp_file_type: SamFileType | None = None) -> Path
Writes the accumulated records to a file, sorts & indexes it, and returns the Path.
If a path is provided, it will be written to, otherwise a temporary file is created and returned.
If path is provided, tmp_file_type may not be provided. In this case, the file type
(SAM/BAM/CRAM) will be automatically determined by the file extension when a path
is provided. See ~pysam for more details.
If path is not provided, the file type will default to BAM unless tmp_file_type is
provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | None
|
a path at which to write the file, otherwise a temp file is used. |
None
|
index
|
bool
|
if True and |
True
|
pred
|
Callable[[AlignedSegment], bool]
|
optional predicate to specify which reads should be output |
lambda _r: True
|
tmp_file_type
|
SamFileType | None
|
the file type to output when a path is not provided (default is BAM) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
Path |
Path
|
The path to the sorted (and possibly indexed) file. |
Source code in fgpyo/sam/builder.py
Returns the accumulated records in coordinate order.
Source code in fgpyo/sam/builder.py
clipping ¶
Utility Functions for Soft-Clipping records in SAM/BAM Files.¶
This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:
softclip_start_of_alignment_by_query()clips the start of the alignment in terms of query basessoftclip_end_of_alignment_by_query()clips the end of the alignment in terms of query basessoftclip_start_of_alignment_by_ref()clips the start of the alignment in terms of reference basessoftclip_end_of_alignment_by_ref()clips the end of the alignment in terms of reference bases
The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.
Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.
For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:
>>> from fgpyo.sam import reader, clipping
>>> with reader("./tests/fgpyo/sam/data/valid.sam") as fh:
... for rec in fh:
... before = rec.cigarstring
... info = clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
... after = rec.cigarstring
... print(f"before: {before} after: {after} info: {info}")
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 101M after: 91M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: 10M1D10M5I76M after: 10M1D10M5I66M10S info: ClippingInfo(query_bases_clipped=10, ref_bases_clipped=10)
before: None after: None info: ClippingInfo(query_bases_clipped=0, ref_bases_clipped=0)
It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:
- Mate pairs may have incorrect information about their mate's positions
- Even if the input was coordinate sorted, the output may be out of order
To rectify these problems it is necessary to do the equivalent of:
Classes¶
Bases: NamedTuple
Named tuple holding the number of bases clipped on the query and reference respectively.
Source code in fgpyo/sam/clipping.py
Functions¶
softclip_end_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the end of a read's alignment.
Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
int | None
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_end_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the end of an alignment by bases_to_clip bases on the reference.
Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
int | None
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_query(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Adds soft-clipping to the start of a read's alignment.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired in the read/query |
required |
clipped_base_quality
|
int | None
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
softclip_start_of_alignment_by_ref(rec: AlignedSegment, bases_to_clip: int, clipped_base_quality: int | None = None, tags_to_invalidate: Iterable[str] = TAGS_TO_INVALIDATE) -> ClippingInfo
Soft-clips the start of an alignment by bases_to_clip bases on the reference.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
the BAM record to clip |
required |
bases_to_clip
|
int
|
the number of additional bases of clipping desired on the reference |
required |
clipped_base_quality
|
int | None
|
if not None, set bases in the clipped region to this quality |
None
|
tags_to_invalidate
|
Iterable[str]
|
the set of extended attributes to remove upon clipping |
TAGS_TO_INVALIDATE
|
Returns:
| Name | Type | Description |
|---|---|---|
ClippingInfo |
ClippingInfo
|
a named tuple containing the number of query/read bases and the number of target/reference bases clipped. |
Source code in fgpyo/sam/clipping.py
sequence ¶
Utility Functions for Manipulating DNA and RNA sequences.¶
This module contains utility functions for manipulating DNA and RNA sequences.
levenshtein and hamming functions are included for convenience.
If you are performing many distance calculations, using a C based method is preferable.
ex. https://pypi.org/project/Distance/
Functions¶
complement ¶
gc_content ¶
Calculates the fraction of G and C bases in a sequence.
hamming ¶
Calculates hamming distance between two strings, case sensitive.
Strings must be of equal lengths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string1
|
str
|
first string for comparison |
required |
string2
|
str
|
second string for comparison |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If strings are of different lengths. |
Source code in fgpyo/sequence.py
levenshtein ¶
Calculates levenshtein distance between two strings, case sensitive.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string1
|
str
|
first string for comparison |
required |
string2
|
str
|
second string for comparison |
required |
Source code in fgpyo/sequence.py
longest_dinucleotide_run_length ¶
Number of bases in the longest dinucleotide run in a primer.
A dinucleotide run is when two nucleotides are repeated in tandem. For example, TTGG (length = 4) or AACCAACCAA (length = 10). If there are no such runs, returns 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the number of bases in the longest dinuc repeat (NOT the number of repeat units)
Source code in fgpyo/sequence.py
longest_homopolymer_length ¶
Calculates the length of the longest homopolymer in the input sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the length of the longest homopolymer
Source code in fgpyo/sequence.py
longest_hp_length ¶
Calculates the length of the longest homopolymer in the input sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
Return
the length of the longest homopolymer
Source code in fgpyo/sequence.py
longest_multinucleotide_run_length ¶
Number of bases in the longest multi-nucleotide run.
A multi-nucleotide run is when N nucleotides are repeated in tandem. For example, TTGG (length = 4, N=2) or TAGTAGTAG (length = 9, N = 3). If there are no such runs, returns 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases over which to compute |
required |
repeat_unit_length
|
int
|
the length of the multi-nucleotide repetitive unit (must be > 0) |
required |
Returns:
| Type | Description |
|---|---|
int
|
the number of bases in the longest multinucleotide repeat (NOT the number of repeat units) |
Source code in fgpyo/sequence.py
reverse_complement ¶
Reverse complements a base sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bases
|
str
|
the bases to be reverse complemented. |
required |
Returns:
| Type | Description |
|---|---|
str
|
the reverse complement of the provided base string |
Source code in fgpyo/sequence.py
util ¶
Modules¶
inspect ¶
Attributes¶
module-attribute
¶TypeAlias for dataclass Fields or attrs Attributes. It will correspond to the correct type for the corresponding _DataclassesOrAttrClass
Classes¶
Functions¶
attr_from(cls: type[_AttrFromType], kwargs: dict[str, str], parsers: dict[type, Callable[[str], Any]] | None = None) -> _AttrFromType
Builds an attr or dataclasses class from key-word arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
type[_AttrFromType]
|
the attr or dataclasses class to be built |
required |
kwargs
|
dict[str, str]
|
a dictionary of keyword arguments |
required |
parsers
|
dict[type, Callable[[str], Any]] | None
|
a dictionary of parser functions to apply to specific types |
None
|
Source code in fgpyo/util/inspect.py
dict_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial
Returns a function that parses a stringified dict into a Dict of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAnnotation
|
the type of the attribute to be parsed |
required |
parsers
|
dict[type, Callable[[str], Any]] | None
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
get_fields(cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass]) -> tuple[FieldType, ...]
Get the fields tuple from either a dataclasses or attr dataclass (or instance).
Source code in fgpyo/util/inspect.py
get_fields_dict(cls: _DataclassesOrAttrClass | type[_DataclassesOrAttrClass]) -> Mapping[str, FieldType]
Get the fields dict from either a dataclasses or attr dataclass (or instance).
Source code in fgpyo/util/inspect.py
list_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial
Returns a function that parses a "stringified" list into a List of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAnnotation
|
the type of the attribute to be parsed |
required |
parsers
|
dict[type, Callable[[str], Any]] | None
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
set_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial
Returns a function that parses a stringified set into a Set of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAnnotation
|
the type of the attribute to be parsed |
required |
parsers
|
dict[type, Callable[[str], Any]] | None
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
split_at_given_level(field: str, split_delim: str = ',', increase_depth_chars: Iterable[str] = ('{', '(', '['), decrease_depth_chars: Iterable[str] = ('}', ')', ']')) -> list[str]
Splits a nested field by its outer-most level.
Note that this method may produce incorrect results fields containing strings containing unpaired characters that increase or decrease the depth
Not currently smart enough to deal with fields enclosed in quotes ('' or "") - TODO
Source code in fgpyo/util/inspect.py
tuple_parser(cls: type, type_: TypeAnnotation, parsers: dict[type, Callable[[str], Any]] | None = None) -> partial
Returns a function that parses a stringified tuple into a Tuple of the correct type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cls
|
type
|
the type of the class object this is being parsed for (used to get default val for parsers) |
required |
type_
|
TypeAnnotation
|
the type of the attribute to be parsed |
required |
parsers
|
dict[type, Callable[[str], Any]] | None
|
an optional mapping from type to the function to use for parsing that type (allows for parsing of more complex types) |
None
|
Source code in fgpyo/util/inspect.py
Modules¶
logging ¶
Methods for setting up logging for tools.¶
Progress Logging Examples¶
Frequently input data (SAM/BAM/CRAM/VCF) are iterated in genomic coordinate order. Logging
progress is useful to not only log how many inputs have been consumed, but also their genomic
coordinate. ProgressLogger() can log progress every
fixed number of records. Logging can be written to logging.Logger as well as custom print
method.
>>> from fgpyo.util.logging import ProgressLogger
>>> logged_lines = []
>>> progress = ProgressLogger(
... printer=lambda s: logged_lines.append(s),
... verb="recorded",
... noun="items",
... unit=2
... )
>>> progress.record(reference_name="chr1", position=1) # does not log
False
>>> progress.record(reference_name="chr1", position=2) # logs
True
>>> progress.record(reference_name="chr1", position=3) # does not log
False
>>> progress.log_last() # will log the last recorded item, if not previously logged
True
>>> logged_lines # show the lines logged
['recorded 2 items: chr1:2', 'recorded 3 items: chr1:3']
Classes¶
Bases: AbstractContextManager
A little class to track progress.
This will output a log message every unit number times recorded.
Attributes:
| Name | Type | Description |
|---|---|---|
printer |
Callable[[str], Any]
|
either a Logger (in which case progress will be printed at Info) or a lambda that consumes a single string |
noun |
str
|
the noun to use in the log message |
verb |
str
|
the verb to use in the log message |
unit |
int
|
the number of items for every log message |
count |
int
|
the total count of items recorded |
Source code in fgpyo/util/logging.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | |
Logs the final count on exit if no exception occurred.
__init__(printer: Logger | Callable[[str], Any], noun: str = 'records', verb: str = 'Read', unit: int = 100000) -> None
Initializes the progress logger with the given printer and settings.
Source code in fgpyo/util/logging.py
Force logging the last record, for example when progress has completed.
Source code in fgpyo/util/logging.py
Record an item at a given genomic coordinate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_name
|
str | None
|
the reference name of the item |
None
|
position
|
int | None
|
the 1-based start position of the item |
None
|
Returns: true if a message was logged, false otherwise
Source code in fgpyo/util/logging.py
Correctly record pysam.AlignedSegments (zero-based coordinates).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rec
|
AlignedSegment
|
pysam.AlignedSegment object |
required |
Returns:
| Type | Description |
|---|---|
bool
|
true if a message was logged, false otherwise |
Source code in fgpyo/util/logging.py
Correctly record multiple pysam.AlignedSegments (zero-based coordinates).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
recs
|
Iterable[AlignedSegment]
|
pysam.AlignedSegment objects |
required |
Returns:
| Type | Description |
|---|---|
bool
|
true if a message was logged, false otherwise |
Source code in fgpyo/util/logging.py
Functions¶
Globally configure logging for all modules.
Configures logging to run at a specific level and output messages to stderr with useful information preceding the actual log message.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str
|
the default level for the logger |
'INFO'
|
name
|
str
|
the name of the logger |
'fgpyo'
|
Source code in fgpyo/util/logging.py
metric ¶
Metrics.¶
Module for storing, reading, and writing metric-like tab-delimited information.
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This
makes it easy for them to be read in languages like R. For example, a row per person, with
columns for age, gender, and address.
The Metric() class makes it easy to read, write, and store
one or metrics of the same type, all the while preserving types for each value in a metric. It is
an abstract base class decorated by
@dataclass, or
@attr.s, with attributes storing one or more
typed values. If using multiple layers of inheritance, keep in mind that it's not possible to mix
these dataclass utils, e.g. a dataclasses class derived from an attr class will not appropriately
initialize the values of the attr superclass.
Examples¶
Defining a new metric class:
>>> from fgpyo.util.metric import Metric
>>> import dataclasses
>>> @dataclasses.dataclass(frozen=True)
... class Person(Metric["Person"]):
... name: str
... age: int
or using attr:
>>> from fgpyo.util.metric import Metric
>>> import attr
>>> @attr.s(auto_attribs=True, frozen=True)
... class PersonAttr(Metric["PersonAttr"]):
... name: str
... age: int
... address: str | None = None
Getting the attributes for a metric class. These will be used for the header when reading and writing metric files.
Getting the values from a metric class instance. The values are in the same order as the header.
Writing a list of metrics to a file:
>>> metrics = [
... Person(name="Alice", age=47),
... Person(name="Bob", age=24)
... ]
>>> from pathlib import Path
>>> Person.write(Path("/path/to/metrics.txt"), *metrics)
Then the contents of the written metrics file:
Reading the metrics file back in:
>>> list(Person.read(Path("/path/to/metrics.txt")))
[Person(name='Alice', age=47), Person(name='Bob', age=24)]
Formatting and parsing the values for custom types is supported by overriding the _parsers() and
format_value() methods.
>>> @dataclasses.dataclass(frozen=True)
... class Name:
... first: str
... last: str
... @classmethod
... def parse(cls, value: str) -> "Name":
... fields = value.split(" ")
... return Name(first=fields[0], last=fields[1])
>>> from typing import Dict, Callable, Any
>>> @dataclasses.dataclass(frozen=True)
... class PersonWithName(Metric["PersonWithName"]):
... name: Name
... age: int
... @classmethod
... def _parsers(cls) -> Dict[type, Callable[[str], Any]]:
... return {Name: lambda value: Name.parse(value=value)}
... @classmethod
... def format_value(cls, value: Any) -> str:
... if isinstance(value, Name):
... return f"{value.first} {value.last}"
... else:
... return super().format_value(value=value)
>>> PersonWithName.parse(fields=["john doe", "42"])
PersonWithName(name=Name(first='john', last='doe'), age=42)
>>> PersonWithName(name=Name(first='john', last='doe'), age=42).formatted_values()
['john doe', '42']
Classes¶
Bases: ABC, Generic[MetricType]
Abstract base class for all metric-like tab-delimited files.
Metric files are tab-delimited, contain a header, and zero or more rows for metric values. This
makes it easy for them to be read in languages like R.
Subclasses of Metric() can support parsing and
formatting custom types with _parsers() and
format_value().
Source code in fgpyo/util/metric.py
167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 | |
staticmethod
¶Concatenates multiple metric files into one, validating headers match.
Source code in fgpyo/util/metric.py
classmethod
¶The default method to format values of a given type.
By default, this method will comma-delimit list, tuple, and set types, and apply
str to all others.
Dictionaries / mappings will have keys and vals separated by semicolons, and key val pairs delimited by commas.
In addition, lists will be flanked with '[]', tuples with '()' and sets and dictionaries with '{}'
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
the value to format. |
required |
Source code in fgpyo/util/metric.py
An iterator over formatted attribute values in the same order as the header.
An iterator over formatted attribute values in the same order as the header.
classmethod
¶An iterator over field names and values in the same order as the header.
classmethod
¶An iterator over field names in the same order as the header.
classmethod
¶Parses the string-representation of this metric.
One string per attribute should be given.
Source code in fgpyo/util/metric.py
classmethod
¶read(path: Path, ignore_extra_fields: bool = True, strip_whitespace: bool = False, threads: int | None = None) -> Iterator[Any]
Reads in zero or more metrics from the given path.
The metric file must contain a matching header.
Columns that are not present in the file but are optional in the metric class will be default values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
the path to the metrics file. |
required |
ignore_extra_fields
|
bool
|
True to ignore any extra columns, False to raise an exception. |
True
|
strip_whitespace
|
bool
|
True to strip leading and trailing whitespace from each field, False to keep as-is. |
False
|
threads
|
int | None
|
the number of threads to use when decompressing gzip files |
None
|
Source code in fgpyo/util/metric.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
classmethod
¶An iterator over attribute values in the same order as the header.
classmethod
¶Writes zero or more metrics to the given path.
The header will always be written.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the output file. |
required |
values
|
MetricType
|
Zero or more metrics. |
()
|
threads
|
int | None
|
the number of threads to use when compressing gzip files |
None
|
Source code in fgpyo/util/metric.py
dataclass
¶Header of a file.
A file's header contains an optional preamble, consisting of lines prefixed by a comment character and/or empty lines, and a required row of fieldnames before the data rows begin.
Attributes:
| Name | Type | Description |
|---|---|---|
preamble |
list[str]
|
A list of any lines preceding the fieldnames. |
fieldnames |
list[str]
|
The field names specified in the final line of the header. |
Source code in fgpyo/util/metric.py
Bases: Generic[MetricType], AbstractContextManager
Writes Metric instances to a delimited file.
Source code in fgpyo/util/metric.py
455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 | |
__enter__() -> MetricWriter
__exit__(exc_type: type[BaseException] | None, exc_value: BaseException | None, traceback: TracebackType | None) -> None
Closes the underlying writer on exit.
Source code in fgpyo/util/metric.py
__init__(filename: Path | str, metric_class: type[Metric], append: bool = False, delimiter: str = '\t', include_fields: list[str] | None = None, exclude_fields: list[str] | None = None, lineterminator: str = '\n', threads: int | None = None) -> None
Initializes the MetricWriter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
Path | str
|
Path to the file to write. |
required |
metric_class
|
type[Metric]
|
Metric class. |
required |
append
|
bool
|
If |
False
|
delimiter
|
str
|
The output file delimiter. |
'\t'
|
include_fields
|
list[str] | None
|
If specified, only the listed fieldnames will be included when writing
records to file. Fields will be written in the order provided.
May not be used together with |
None
|
exclude_fields
|
list[str] | None
|
If specified, any listed fieldnames will be excluded when writing
records to file.
May not be used together with |
None
|
lineterminator
|
str
|
The string used to terminate lines produced by the MetricWriter. Default = "\n". |
'\n'
|
threads
|
int | None
|
the number of threads to use when compressing gzip files. |
None
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If the provided metric class is not a dataclass- or attr-decorated
subclass of |
AssertionError
|
If the provided filepath is not writable. |
AssertionError
|
If |
ValueError
|
If |
ValueError
|
If |
ValueError
|
If |
Source code in fgpyo/util/metric.py
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 | |
Write a single Metric instance to file.
The Metric is converted to a dictionary and then written using the underlying
csv.DictWriter. If the MetricWriter was created using the include_fields or
exclude_fields arguments, the fields of the Metric are subset and/or reordered
accordingly before writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
MetricType
|
An instance of the specified Metric. |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the provided |
Source code in fgpyo/util/metric.py
Write multiple Metric instances to file.
Each Metric is converted to a dictionary and then written using the underlying
csv.DictWriter. If the MetricWriter was created using the include_fields or
exclude_fields arguments, the attributes of each Metric are subset and/or reordered
accordingly before writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
Iterable[MetricType]
|
A sequence of instances of the specified Metric. |
required |
Source code in fgpyo/util/metric.py
Modules¶
string ¶
Functions¶
A simple version of Unix's column utility. This assumes the table is NxM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rows
|
list[list[str]]
|
the rows to adjust. Each row must have the same number of delimited fields. |
required |
delimiter
|
str
|
the delimiter for each field in a row. |
' '
|
Source code in fgpyo/util/string.py
types ¶
Attributes¶
module-attribute
¶A function parameter's type annotation may be any of the following:
1) type, when declaring any of the built-in Python types
2) typing._GenericAlias, when declaring generic collection types or union types using pre-PEP
585 and pre-PEP 604 syntax (e.g. List[int], Optional[int], or Union[int, None])
3) types.UnionType, when declaring union types using PEP604 syntax (e.g. int | None)
4) types.GenericAlias, when declaring generic collection types using PEP 585 syntax (e.g.
list[int])
types.GenericAlias is a subclass of type, but typing._GenericAlias and types.UnionType are
not and must be considered explicitly.
Classes¶
Functions¶
Type guard that checks all Optional collection elements are not None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
Iterable[T | None]
|
Collection of Optional elements. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if no elements are None, False otherwise. When True, narrows the collection type from |
bool
|
|
Source code in fgpyo/util/types.py
is_constructible_from_str(type_: TypeAnnotation) -> TypeGuard[type]
Returns true if the provided type is a class constructible from a single str argument.
Source code in fgpyo/util/types.py
is_known_str_constructible(type_: TypeAnnotation) -> TypeGuard[type]
Returns true if type_ is one of the built-in types known to be constructible from a str.
Complements is_constructible_from_str, which detects str-constructibility via constructor
signature inspection. This predicate covers types whose constructors aren't annotated for
introspection (e.g. int, str, float) or whose subclasses don't all share an annotation
(e.g. PurePath).
Source code in fgpyo/util/types.py
make_literal_parser(literal: TypeAnnotation, parsers: Iterable[Callable[[str], LiteralType]]) -> partial
Generates a parser function for a literal type object.
make_union_parser(union: TypeAnnotation, parsers: Iterable[Callable[[str], UnionType]]) -> partial
Generates a parser function for a union type object.
Returns None if the value is 'None', else raises an error.
Parses strings into bools.
Accounts for the many different text representations of bools that can be used.
Source code in fgpyo/util/types.py
vcf ¶
Classes for generating VCF and records for testing.¶
This module contains utility classes for the generation of VCF files and variant records, for use in testing.
The module contains the following public classes:
VariantBuilder()-- A builder class that allows the accumulation of variant records and access as a list and writing to file.
Examples¶
Typically, we have pysam.VariantRecord records obtained from reading
from a VCF file. The VariantBuilder() class builds
such records.
Variants are added with the add() method,
which returns a pysam.VariantRecord.
>>> import pysam
>>> from fgpyo.vcf.builder import VariantBuilder
>>> builder: VariantBuilder = VariantBuilder()
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
... contig="chr2", pos=1001, id="rs1234", ref="C", alts=["T"],
... qual=40, filter=["PASS"]
... )
VariantBuilder can create sites-only, single-sample, or multi-sample VCF files. If not producing a sites-only VCF file, VariantBuilder must be created by passing a list of sample IDs
>>> builder: VariantBuilder = VariantBuilder(sample_ids=["sample1", "sample2"])
>>> new_record_1: pysam.VariantRecord = builder.add() # uses the defaults
>>> new_record_2: pysam.VariantRecord = builder.add(
... samples={"sample1": {"GT": "0|1"}, "sample2": {"GT": "0|0"}}
... )
The variants stored in the builder can be retrieved as a coordinate sorted VCF file via the
to_path() method:
The variants may also be retrieved in the order they were added via the
to_unsorted_list() method and
in coordinate sorted order via the
to_sorted_list() method.
Functions¶
reader ¶
Opens the given path for VCF reading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
VcfPath
|
the path to a VCF, or an open file handle |
required |
Source code in fgpyo/vcf/__init__.py
writer ¶
Opens the given path for VCF writing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
VcfPath
|
the path to a VCF, or an open filehandle |
required |
header
|
VariantHeader
|
the source for the output VCF header. If you are modifying a VCF file that you are reading from, you can pass reader.header |
required |
mode
|
str
|
the pysam write mode. The default |
'w'
|
Source code in fgpyo/vcf/__init__.py
Modules¶
builder ¶
Classes for generating VCF and records for testing.¶
Classes¶
Builder for constructing one or more variant records (pysam.VariantRecord) for a VCF.
The VCF can be sites-only, single-sample, or multi-sample.
Provides the ability to manufacture variants from minimal arguments, while generating any remaining attributes to ensure a valid variant.
A builder is constructed with a handful of defaults including the sample name and sequence dictionary. If the VCF will not be sites-only, the list of sample IDS ("sample_ids") must be provided to the VariantBuilder constructor.
Variants are then added using the add()
method.
Once accumulated the variants can be accessed in the order in which they were created through
the to_unsorted_list()
function, or in a list sorted by coordinate order via
to_sorted_list(). Lastly, the
records can be written to a temporary file using
to_path().
Attributes:
| Name | Type | Description |
|---|---|---|
sample_ids |
list[str]
|
the sample name(s) |
sd |
dict[str, dict[str, Any]]
|
sequence dictionary, implemented as python dict from contig name to dictionary with contig properties. At a minimum, each contig dict in sd must contain "ID" (the same as contig_name) and "length", the contig length. Other values will be added to the VCF header line for that contig. |
seq_idx_lookup |
dict[str, int]
|
dictionary mapping contig name to index of contig in sd |
records |
list[VariantRecord]
|
the list of variant records |
header |
VariantHeader
|
the pysam header |
Source code in fgpyo/vcf/builder.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | |
__init__(sample_ids: Iterable[str] | None = None, sd: dict[str, dict[str, Any]] | None = None) -> None
Initializes a new VariantBuilder for generating variants and VCF files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_ids
|
Iterable[str] | None
|
the name of the sample(s) |
None
|
sd
|
dict[str, dict[str, Any]] | None
|
optional sequence dictionary |
None
|
Source code in fgpyo/vcf/builder.py
add(contig: str | None = None, pos: int = 1000, end: int | None = None, id: str = '.', ref: str = 'A', alts: str | Iterable[str] | None = ('.',), qual: int = 60, filter: str | Iterable[str] | None = None, info: dict[str, Any] | None = None, samples: dict[str, dict[str, Any]] | None = None) -> VariantRecord
Generates a new variant and adds it to the internal collection.
Notes: * Very little validation is done with respect to INFO and FORMAT keys being defined in the header. * VCFs are 1-based, but pysam is (mostly) 0-based. We define the function in terms of the VCF property "pos", which is 1-based. pysam will also report "pos" as 1-based, so that is the property that should be accessed when using the records produced by this function (not "start").
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contig
|
str | None
|
the chromosome name. If None, will use the first contig in the sequence dictionary. |
None
|
pos
|
int
|
the 1-based position of the variant |
1000
|
end
|
int | None
|
an optional 1-based inclusive END position; if not specified a value will be looked for in info["END"], or calculated from the length of the reference allele |
None
|
id
|
str
|
the variant id |
'.'
|
ref
|
str
|
the reference allele |
'A'
|
alts
|
str | Iterable[str] | None
|
the list of alternate alleles, None if no alternates. If a single string is passed, that will be used as the only alt. |
('.',)
|
qual
|
int
|
the variant quality |
60
|
filter
|
str | Iterable[str] | None
|
the list of filters, None if no filters (ex. PASS). If a single string is passed, that will be used as the only filter. |
None
|
info
|
dict[str, Any] | None
|
the dictionary of INFO key-value pairs |
None
|
samples
|
dict[str, dict[str, Any]] | None
|
the dictionary from sample name to FORMAT key-value pairs. if a sample property is supplied for any sample but omitted in some, it will be set to missing (".") for samples that don't have that property explicitly assigned. If a sample in the VCF is omitted, all its properties will be set to missing. |
None
|
Source code in fgpyo/vcf/builder.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | |
Add a FILTER header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
description
|
str | None
|
the description of the field |
None
|
Source code in fgpyo/vcf/builder.py
add_format_header(name: str, field_type: VcfFieldType, number: int | VcfFieldNumber = NUM_GENOTYPES, description: str | None = None) -> None
Add a FORMAT header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
field_type
|
VcfFieldType
|
the field_type of the field |
required |
number
|
int | VcfFieldNumber
|
the number of the field |
NUM_GENOTYPES
|
description
|
str | None
|
the description of the field |
None
|
Source code in fgpyo/vcf/builder.py
add_info_header(name: str, field_type: VcfFieldType, number: int | VcfFieldNumber = 1, description: str | None = None, source: str | None = None, version: str | None = None) -> None
Add an INFO header field to the VCF header.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
the name of the field |
required |
field_type
|
VcfFieldType
|
the field_type of the field |
required |
number
|
int | VcfFieldNumber
|
the number of the field |
1
|
description
|
str | None
|
the description of the field |
None
|
source
|
str | None
|
the source of the field |
None
|
version
|
str | None
|
the version of the field |
None
|
Source code in fgpyo/vcf/builder.py
classmethod
¶Generates the default sequence dictionary for VariantBuilder.
Re-uses the dictionary from SamBuilder for consistency.
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, Any]]
|
A new copy of the sequence dictionary as a map of contig name to dictionary, one per |
dict[str, dict[str, Any]]
|
contig. |
Source code in fgpyo/vcf/builder.py
Returns a path to a VCF for variants added to this builder.
If the path given ends in ".gz" then the generated file will be bgzipped and a tabix index generated for the file with the suffix ".gz.tbi".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | None
|
optional path to the VCF |
None
|