karyohmm API Reference¶

This page documents the public API of the karyohmm package.

Core HMM Models¶

class karyohmm.karyohmm.MetaHMM(disomy=False, upd=False)[source]¶

Bases: AneuploidyHMM

HMM for whole-chromosome aneuploidy detection in trio PGT data.

Evaluates a joint state space covering nullisomy (0), maternal and paternal monosomy (1m / 1p), disomy (2), maternal and paternal trisomy (3m / 3p), and optionally uniparental disomy (UPD) states. Each hidden state encodes which pair of parental haplotypes were transmitted to the embryo; transitions follow an exponential kernel parameterised by an intra-karyotype recombination rate r and an inter-karyotype switching rate a.

The BAF emission at each site is a Gaussian mixture centred on the expected allele dosage for the current copying state, controlled by pi0 (fraction of sites with no signal) and std_dev (noise). LRR data are incorporated through a Gaussian emission centred on the expected copy-number fold change.

Parameters:

disomy (bool, optional) – Restrict the state space to disomy-only (4 states). Useful for genotyping embryos without calling aneuploidies. Default False.
upd (bool, optional) – Add uniparental disomy states to the full state space. Requires LRR data to distinguish UPD from normal disomy. Default False.

states¶

Active HMM states; each tuple is (m0, m1, p0, p1) where values are haplotype indices (0 or 1) or -1 when that haplotype is absent.

Type:: list of tuple

karyotypes¶

Karyotype label for each state (e.g. "2", "3m", "1p").

Type:: np.ndarray of str

aploid¶

One of "meta", "disomy", or "meta+upd".

Type:: str

__init__(disomy=False, upd=False)[source]¶

Initialize the MetaHMM class for determining chromosomal aneuploidy.

Parameters:

disomy (bool) – Assume disomy to limit the model state space.
upd (bool) – Add uniparental disomy states to the full state space.

forward_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.5, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶

Forward HMM algorithm under a multi-ploidy model.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

alphas (np.ndarray) – Forward variable from HMM across k states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the MetaHMM model.
loglik (float) – Total log-likelihood of B-allele frequency.

backward_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.5, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶

Backward HMM algorithm under a given statespace model.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

betas (np.ndarray) – Backward variables from HMM across the k states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the MetaHMM model.
loglik (float) – Total log-likelihood of B-allele frequency.

forward_backward(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶

Run the forward-backward algorithm across all states.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

gammas (np.ndarray) – Log posterior density of being in each of k hidden states.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the model.

viterbi_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶

Implement the Viterbi traceback through karyotypic states.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate.
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

path (np.ndarray) – Most likely copying path through k states in the model.
states (list) – Tuple representation of states.
deltas (np.ndarray) – Delta variable (maximum path probability at step m).
psi (np.ndarray) – Storage vector for psi variable.

marginal_posterior_karyotypes(gammas, karyotypes)[source]¶

Obtain the marginal posterior (not logged) probability over karyotypic states.

Parameters:

gammas (np.ndarray) – A k x m array of log-posterior density across sites.
karyotypes (np.ndarray) – Karyotype labels for all k states.

Returns:

gamma_karyo (np.ndarray) – Collapsed posteriors summed per unique karyotype.
uniq_karyo (np.ndarray) – Unique karyotype labels in the order they appear.

posterior_karyotypes(gammas, karyotypes)[source]¶

Obtain full posterior on karyotypes chromosome-wide.

This is the weighted proportion of time spent in each karyotypic state.

Parameters:

gammas (np.ndarray) – A k x m array of log-posterior density across sites.
karyotypes (np.ndarray) – Karyotype labels for all k states.

Returns:

kar_prob – Dictionary mapping karyotype label to chromosome-wide posterior probability.

Return type:

dict

genotype_embryo(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, viterbi=False)[source]¶

Obtain genotype dosages for a putative disomic embryo.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate.
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
viterbi (bool) – Estimate the embryo genotypes using the Viterbi algorithm.

Returns:

dosages – A 3 x m array of genotype probabilities (RR, RA, AA).

Return type:

np.ndarray

flag_parental_genotype_errors(gammas, states, bafs, mat_haps, pat_haps, pi0=0.2, std_dev=0.25)[source]¶

Identify per-site parental genotype calls that are inconsistent with the posterior.

For each site, computes a posterior-weighted log Bayes factor comparing the best alternative parental genotype (fixing the other parent) to the called one. A positive score means the data, given the inferred copying state, prefers a different parental genotype at that site.

Parameters:

gammas (np.ndarray) – k x m log-posterior array from forward_backward.
states (list) – List of state tuples (output of forward_backward).
bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
pi0 (float) – Sparsity parameter for BAF emission.
std_dev (float) – Noise parameter for BAF emission.

Returns:

mat_err_score (np.ndarray) – m-length array; positive values flag likely maternal genotype errors.
pat_err_score (np.ndarray) – m-length array; positive values flag likely paternal genotype errors.

class karyohmm.karyohmm.QuadHMM[source]¶

Bases: AneuploidyHMM

Joint HMM for two sibling embryos used to detect inter-sibling haplotype sharing.

Models the joint copying state of two euploid siblings as a pair of independent disomy states drawn from the same parental haplotypes. The 16-state product space (4 × 4 single-embryo disomy states) is collapsed into four Roach et al. 2010 identity-by-descent classes:

0: maternally haploidentical (siblings share the same maternal haplotype)
1: paternally haploidentical (siblings share the same paternal haplotype)
2: fully identical (same maternal and paternal haplotype)
3: non-identical (different maternal and paternal haplotypes)

Transitions between copying states are governed by a single recombination rate parameter r; the BAF emission for each sibling follows the same Gaussian mixture model as MetaHMM but with per-sibling pi0 and std_dev.

states¶

16 joint copying states, each ((m0, _, p0, _), (m1, _, p1, _)).

Type:: list of tuple of tuple

karyotypes¶

All entries are "2" (disomy assumed for both siblings).

Type:: np.ndarray of str

__init__()[source]¶: Initialize the QuadHMM model.

forward_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶

Implement the forward algorithm for QuadHMM model.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

alphas (np.ndarray) – Forward variable from HMM across the joint sibling states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of the 16 joint states.
karyotypes (np.ndarray) – Array of karyotypes (all entries "2").
loglik (float) – Total log-likelihood of joint sibling B-allele frequencies.

backward_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶

Implement the backward algorithm for QuadHMM model.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

betas (np.ndarray) – Backward variable from HMM across the joint sibling states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of the 16 joint states.
karyotypes (np.ndarray) – Array of karyotypes (all entries "2").
loglik (float) – Total log-likelihood of joint sibling B-allele frequencies.

forward_backward(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, a=0.01, mat_err=0.0, pat_err=0.0)[source]¶

Implement the forward-backward algorithm for the QuadHMM model.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.
a (float) – Unused; retained for API compatibility.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

gammas (np.ndarray) – Log posterior density of being in each of the joint sibling states.
states (list) – Tuple representation of the 16 joint states.
karyotypes (None) – Always returns None for QuadHMM.

viterbi_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶

Viterbi algorithm for the QuadHMM model.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.
mat_err (float) – Per-site maternal genotyping error rate.
pat_err (float) – Per-site paternal genotyping error rate.

Returns:

path (np.ndarray) – Most likely copying path through the joint sibling states.
states (list) – Tuple representation of the 16 joint states.
deltas (np.ndarray) – Delta variable (maximum path probability at each site).
psi (np.ndarray) – Storage vector for psi variable.

restrict_path(path)[source]¶

Break down states into the same categories as Roach et al for determining recombinations.

The state definitions used for tracing sibling haplotypes are:

0: maternal-haploidentical
1: paternal-haploidentical
2: identical
3: non-identical

Parameters:: path (np.ndarray) – A sequence of states output from the Viterbi algorithm (16 possible states).
Returns:: refined_path – A sequence of m states in 0, 1, 2, 3 indicating states from Roach et al.
Return type:: np.ndarray

restrict_states()[source]¶

Break down states into the same categories as Roach et al for determining recombinations.

Returns:

maternal_haploidentical (list) – Indexes of maternally haploidentical states.
paternal_haploidentical (list) – Indexes of paternally haploidentical states.
identical (list) – Indexes of identical states (siblings share same haplotypes).
non_identical (list) – Indexes of non-identical states.

viterbi_path(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶

Obtain the restricted Viterbi path for traceback.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.

Returns:

res_path – Most likely copying path restricted to 4 states from Roach et al.

Return type:

np.ndarray

map_path(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶

Obtain the Maximum A-Posteriori Path across restricted states.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.

Returns:

map_path – Max-a-posteriori copying path through 4 sibling copying states.

Return type:

np.ndarray

flag_parental_genotype_errors(gammas, states, bafs, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15))[source]¶

Identify per-site parental genotype calls that are inconsistent with the posterior.

For each site, computes a posterior-weighted log Bayes factor comparing the best alternative parental genotype (fixing the other parent) to the called one. Both siblings observe the same parental genotype, so the joint sibling BAF likelihood is used when scoring alternatives. A positive score means the data, given the inferred copying state, prefers a different parental genotype at that site.

Parameters:

gammas (np.ndarray) – k x m log-posterior array from forward_backward.
states (list) – List of paired state tuples (output of forward_backward).
bafs (list) – List of two m-length np.ndarray BAF arrays for the two siblings.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple) – (pi0_sib0, pi0_sib1) sparsity parameters for BAF emission.
std_dev (tuple) – (std_sib0, std_sib1) noise parameters for BAF emission.

Returns:

mat_err_score (np.ndarray) – m-length array; positive values flag likely maternal genotype errors.
pat_err_score (np.ndarray) – m-length array; positive values flag likely paternal genotype errors.

det_recomb_sex(i, j)[source]¶

Determine the parental origin of the recombination event.

Parameters:

i (int) – State index for the previous state.
j (int) – State index for the current state.

Returns:

m – Sex of haplotype on which the transition occurred (0: maternal, 1: paternal).

Return type:

int

isolate_recomb(path_xy, path_xzs, window=20)[source]¶

Isolate key recombination events from a pair of refined paths.

Parameters:

path_xy (np.ndarray) – Array of path indices for the focal pair of siblings.
path_xzs (list) – List of sibling paths (each a np.ndarray).
window (int) – Number of SNPs within which the closest transition must fall (minimum resolution).

Returns:

mat_recomb_lst (list) – List of maternal recombination SNP indexes.
pat_recomb_lst (list) – List of paternal recombination SNP indexes.
mat_recomb (dict) – Dictionary of SNP-index to sibling-count for maternal recombinations.
pat_recomb (dict) – Dictionary of SNP-index to sibling-count for paternal recombinations.

est_sharing(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶

Estimate the proportion of the chromosome that is shared identically across siblings.

Parameters:

bafs (list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.
pos (np.ndarray) – m-length vector of basepair positions for sites.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pi0 (tuple of float) – Sparsity parameters for B-allele emission model, one per sibling.
std_dev (tuple of float) – Standard deviation parameters for B-allele emission model, one per sibling.
r (float) – Inter-state transition rate.

Returns:

mat_haplo_len (float) – Total base-pair length of maternally haploidentical regions.
pat_haplo_len (float) – Total base-pair length of paternally haploidentical regions.
both_haplo_len (float) – Total base-pair length of fully identical (both-haplo) regions.
tot_len (float) – Total base-pair length of the typed region on the chromosome.

class karyohmm.karyohmm.PocHMM(disomy=False, upd=False)[source]¶

Bases: MetaHMM

Aneuploidy HMM for products-of-conception with a single observed parent.

Extends MetaHMM to handle mother–child or father–child duos in which only one parent’s haplotypes are available. The unobserved parent’s genotype is marginalised out at each site using population allele frequencies supplied as freqs. When freqs is None the model treats every unobserved-parent site as a 50 / 50 heterozygote.

The forward, backward, and Viterbi algorithms are re-implemented here via a dedicated Cython kernel (forward_algo_duo / backward_algo_duo) that handles the marginalisation efficiently.

Parameters:

disomy (bool, optional) – Restrict the state space to disomy (ignored in current implementation; the full MetaHMM state space is always used). Default False.
upd (bool, optional) – Add uniparental disomy states to the state space. As with MetaHMM, LRR data helps distinguish UPD from normal disomy. In the duo setting, only the UPD states on the observed parent’s side (see maternal in forward_algorithm()) are identifiable from copy-number-neutral signal alone; UPD states on the unobserved side are marginalised over population allele frequencies like any other unobserved-parent state. Default False.

__init__(disomy=False, upd=False)[source]¶: Initialize the PocHMM with the full MetaHMM state space.

est_sigma_pi0(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 0.5), **kwargs)[source]¶

Estimate sigma and pi0 under the B-Allele Frequency model using optimization of forward algorithm likelihood.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – Basepair positions of the SNPs.
haps (np.ndarray) – A 2 x m array of 0/1 parental haplotypes.
freqs (np.ndarray or None) – An m-length array of allele frequencies as priors.
maternal (bool) – If True, haps are maternal haplotypes; otherwise paternal.
algo (str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.
pi0_bounds (tuple) – Bounds for acceptable values of pi0 parameter.
sigma_bounds (tuple) – Bounds for acceptable values for sigma.

Returns:

pi0_est (float) – Estimate of sparsity parameter (pi0) for B-allele emission model.
sigma_est (float) – Estimate of noise parameter (sigma) for B-allele emission model.

forward_algorithm(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶

Forward algorithm for duos.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
haps (np.ndarray) – A 2 x m array of 0/1 parental haplotypes.
freqs (np.ndarray or None) – An m-length array of population allele frequencies.
maternal (bool) – If True, haps are maternal haplotypes; otherwise paternal.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
obs_err (float) – Per-site genotyping error rate for the observed parent.

Returns:

alphas (np.ndarray) – Forward variable from HMM across k states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the MetaHMM model.
loglik (float) – Total log-likelihood of B-allele frequency.

backward_algorithm(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶

Backward algorithm for duos.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
haps (np.ndarray) – A 2 x m array of 0/1 parental haplotypes.
freqs (np.ndarray or None) – An m-length array of population allele frequencies.
maternal (bool) – If True, haps are maternal haplotypes; otherwise paternal.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
obs_err (float) – Per-site genotyping error rate for the observed parent.

Returns:

betas (np.ndarray) – Backward variable from HMM across k states.
scaler (np.ndarray) – m-length array of scale parameters.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the MetaHMM model.
loglik (float) – Total log-likelihood of B-allele frequency.

forward_backward(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶

Forward-backward algorithm for duos.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
lrrs (np.ndarray) – m-length array of log-R ratio values.
sigmas (np.ndarray) – m-length array of per-site LRR noise standard deviations.
pos (np.ndarray) – m-length vector of basepair positions for sites.
haps (np.ndarray) – A 2 x m array of 0/1 parental haplotypes.
freqs (np.ndarray or None) – An m-length array of population allele frequencies.
maternal (bool) – If True, haps are maternal haplotypes; otherwise paternal.
pi0 (float) – Sparsity parameter for B-allele emission model.
std_dev (float) – Standard deviation for B-allele emission model.
r (float) – Intra-karyotype transition rate (recombination).
a (float) – Inter-karyotype transition rate.
unphased (bool) – Run the model in unphased mode.
obs_err (float) – Per-site genotyping error rate for the observed parent.

Returns:

gammas (np.ndarray) – Log posterior density of being in each of k hidden states.
states (list) – Tuple representation of states.
karyotypes (np.ndarray) – Array of karyotypes in the MetaHMM model.

genotype_parent(bafs, lrrs, sigmas, haps, gammas, freqs=None, maternal=True, pi0=0.2, std_dev=0.25)[source]¶

Compute posterior genotype dosages for the unobserved parent.

For each site integrates the BAF and LRR likelihood over all four possible unobserved-parent genotypes (AA, AB, BA, BB), weighted by Hardy-Weinberg priors from freqs and the HMM posterior gammas.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
lrrs (np.ndarray) – m-length array of log-R ratios.
sigmas (np.ndarray) – m-length array of per-site LRR noise estimates.
haps (np.ndarray) – 2 x m array of 0/1 haplotypes for the observed parent.
gammas (np.ndarray) – k x m log-posterior array from forward_backward().
freqs (np.ndarray, optional) – m-length array of population allele frequencies; defaults to 0.5 at every site.
maternal (bool) – If True (default), haps are the maternal haplotypes and the paternal genotype is imputed; set False for the reverse.
pi0 (float) – Sparsity parameter for the BAF emission model.
std_dev (float) – Noise parameter for the BAF emission model.

Returns:

geno_dosage_rev – 3 x m array of posterior genotype probabilities (log scale) for the unobserved parent; rows correspond to homozygous-ref (0), heterozygous (1), and homozygous-alt (2).

Return type:

np.ndarray

infer_missing_af(bafs, geno, hap_matrix, pos, eps=0.2)[source]¶

Estimate allele frequencies for the unobserved parent using a haplotype reference panel.

Identifies sites of opposite homozygosity — where the observed parent is homozygous and BAF indicates the unobserved parent carries the opposite allele — then, treating adjacent haplotype-row pairs (2i, 2i+1) in hap_matrix as the two haplotypes of reference individual i, tags every reference individual carrying at least one copy of that allele. Frequencies at non-anchor sites are estimated via distance-weighted interpolation between adjacent anchor sites (Eq. 9 in the PocHMM methods document), averaging over both haplotypes of each tagged individual — not just the matching one — so heterozygous carriers contribute their other, non-matching allele too. Falls back to the reference-panel mean at sites with no nearby anchors.

Parameters:

bafs (np.ndarray) – m-length B-allele frequency array.
geno (np.ndarray) – m-length observed-parent genotype dosage array (0, 1, or 2).
hap_matrix (np.ndarray) – (2N, m) haplotype reference panel matrix (0/1 encoded). Rows must be paired as (2i, 2i+1) = the two haplotypes of reference individual i (i.e. an even number of rows), since tagging operates at the individual level.
pos (np.ndarray) – m-length array of genomic positions in basepairs.
eps (float) – BAF threshold for calling opposite homozygosity (default 0.2).

Returns:

freqs – m-length array of estimated allele frequencies for the unobserved parent.

Return type:

np.ndarray

Mosaic and Contamination Estimators¶

class karyohmm.karyohmm.MosaicEst(mat_haps, pat_haps, bafs, pos, lrrs=None, sigmas=None, switch_err=0.01, t_rate=0.0001, **phase_kwargs)[source]¶

Bases: object

Estimator of mosaic copy-number cell fraction using a joint BAF + LRR HMM.

A 5-state HMM over all m genomic sites estimates the mosaic cell fraction cf and identifies the parental origin of the event. The five states and their emission means are:

State 0 (neutral): LRR = 0, phased BAF = 0
State 1 (maternal gain): LRR = log₂(1 + cf/2), phased BAF = +cf/6
State 2 (paternal gain): LRR = log₂(1 + cf/2), phased BAF = −cf/6
State 3 (maternal loss): LRR = log₂(1 − cf/2), phased BAF = −cf/2
State 4 (paternal loss): LRR = log₂(1 − cf/2), phased BAF = +cf/2

The BAF means follow from the discrete mosaic mixture and the phase convention (sign = +1 when the maternal haplotype carries the alt allele):

Maternal gain: extra maternal copy at a mat=alt site gives BAF 2/3, so phased BAF = +cf/6; at mat=ref sites the sign corrects to the same value.
Paternal gain: extra paternal copy reverses the BAF shift — phased BAF = −cf/6 coherently across all het sites.
Maternal loss: losing the maternal copy leaves only paternal alleles, giving phased BAF = −cf/2 (3× larger than gain because BAF moves to 0 or 1).
Paternal loss: symmetrically, phased BAF = +cf/2.

LRR is evaluated at every site; phased BAF is evaluated only at expected-heterozygous sites (one parent hom-ref, the other hom-alt).

All preprocessing (het-site identification, phase assignment, transition matrix construction) happens in __init__, so est_mle_cf can be called immediately after construction.

Parameters:

mat_haps (np.ndarray) – 2 × m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 × m array of 0/1 paternal haplotypes.
bafs (np.ndarray) – m-length array of B-allele frequencies.
pos (np.ndarray) – m-length array of base-pair positions.
lrrs (np.ndarray, optional) – m-length array of log-R ratio values. When provided, LRR emission is included at all m sites (strongly recommended).
sigmas (np.ndarray, optional) – m-length array of per-site LRR noise standard deviations. Required when lrrs is provided.
switch_err (float) – Within-type origin-switch probability (maternal↔paternal for the same gain or loss direction; default 0.01).
t_rate (float) – Neutral↔aneuploid transition probability (default 1e-4).

STATE_NAMES = ('neutral', 'maternal-gain', 'paternal-gain', 'maternal-loss', 'paternal-loss')¶: Ordered state labels returned by infer_origin().

__init__(mat_haps, pat_haps, bafs, pos, lrrs=None, sigmas=None, switch_err=0.01, t_rate=0.0001, **phase_kwargs)[source]¶: Validate inputs, identify and phase het sites, build transition matrix.

create_transition_matrix(switch_err=0.01, t_rate=0.0001)[source]¶

Build the 5-state log-transition matrix.

States: 0=neutral, 1=mat-gain, 2=pat-gain, 3=mat-loss, 4=pat-loss.

Transitions allowed: - neutral → each aneuploid state with probability t_rate / 4 - any aneuploid → neutral with probability t_rate - within-gain origin switch (1↔2) with probability switch_err - within-loss origin switch (3↔4) with probability switch_err - cross-type transitions (gain↔loss) are not modelled (log-prob = −∞)

Parameters:

switch_err (float) – Maternal-paternal origin-switch probability within the same gain or loss direction.
t_rate (float) – Neutral-aneuploid entry/exit probability.

forward_algo_full(cf=0.0, std_dev_baf=0.1)[source]¶

Forward algorithm over all m sites with joint BAF + LRR emission.

When lrrs was provided at construction, LRR emission is added at every site. Phased BAF emission is always added at expected-het sites.

State-dependent emission means (see class docstring for derivation):

State 0 (neutral): LRR = 0, phased BAF = 0
State 1 (maternal gain): LRR = log₂(1 + cf/2), phased BAF = +cf/6
State 2 (paternal gain): LRR = log₂(1 + cf/2), phased BAF = −cf/6
State 3 (maternal loss): LRR = log₂(1 − cf/2), phased BAF = −cf/2
State 4 (paternal loss): LRR = log₂(1 − cf/2), phased BAF = +cf/2

Parameters:

cf (float) – Mosaic cell fraction in [0, 1).
std_dev_baf (float) – BAF noise standard deviation at het sites.

Returns:

alphas (np.ndarray) – 5 x m log forward variable.
scaler (np.ndarray) – m-length per-site log normalisation constants.
loglik (float) – Total log-likelihood.

est_mle_cf(std_dev_baf=0.1)[source]¶

Estimate the MLE mosaic cell fraction.

Maximises forward_algo_full() log-likelihood over cf in [0, 1) using Powell’s method. Stores the result in self.mle_cf (nan on optimisation failure).

Parameters:: std_dev_baf (float) – BAF noise standard deviation.

ci_mle_cf(std_dev_baf=0.1, h=0.0001)[source]¶

95% confidence interval for mle_cf via observed Fisher information.

Requires est_mle_cf() to have been called first.

Parameters:

std_dev_baf (float) – BAF noise standard deviation.
h (float) – Finite-difference step size.

Returns:

ci – [lower_95, mle_cf, upper_95], clamped to [0, 1].

Return type:

list

lrt_cf(std_dev_baf=0.1)[source]¶

Likelihood-ratio test statistic for cf > 0 vs cf = 0.

Computes -2 x (l(cf=0) - l(cf_MLE)). Under the null hypothesis of no mosaicism this is approximately chi-squared with 1 degree of freedom. Calls est_mle_cf() if it has not been run yet.

Parameters:: std_dev_baf (float) – BAF noise standard deviation.
Returns:: lrt – LRT statistic, or nan on failure.
Return type:: float

infer_origin(std_dev_baf=0.1)[source]¶

Identify the most likely parental origin of the mosaic event.

Runs the forward algorithm at mle_cf and computes the chromosome-wide log-evidence for each of the four aneuploid states (marginalised over sites via logsumexp of the forward variable). The aneuploid state with the highest evidence is returned as the inferred origin.

Requires est_mle_cf() to have been called first. Returns 'neutral' when mle_cf is below 0.01.

Parameters:: std_dev_baf (float) – BAF noise standard deviation.
Returns:: origin – One of MosaicEst.STATE_NAMES.
Return type:: str

class karyohmm.karyohmm.MccEst[source]¶

Bases: object

Maximum-likelihood estimator of maternal-cell contamination (MCC) from BAF data.

Maternal-cell contamination occurs when maternal DNA co-purifies with the embryo sample, biasing B-allele frequencies towards the maternal genotype. This class provides log-likelihood functions and MLE routines for estimating the contamination fraction c (the proportion of signal attributable to the mother) under two data configurations:

Trio (_trio suffix): both maternal and paternal haplotypes are known, so the paternal genotype at each site is observed directly.
POC / duo (_poc suffix): only maternal haplotypes are available; the paternal genotype is marginalised using Hardy–Weinberg allele frequencies.

Each configuration is further available in an unphased form (sites modelled independently) and a phase-aware form (a 2-state HMM tracks which maternal haplotype was transmitted to the embryo).

Genome-wide inference across multiple chromosomes is supported through loglik_mcc_genome_* and est_mcc_genome_* methods; per-chromosome estimates — useful for identifying aneuploid chromosomes before a genome-wide fit — are provided by est_mcc_per_chrom_* methods.

All estimation methods jointly optimise the contamination fraction c and the BAF noise standard deviation std_dev.

__init__()[source]¶: Initialize the MCC estimator (stateless; all state is passed per call).

loglik_mcc_poc(bafs, mat_haps, freqs, c=0.0, std_dev=0.1)[source]¶

Log-likelihood of MCC for a single chromosome in a mother-child duo (POC).

At each site the paternal genotype is unknown and is marginalised over Hardy-Weinberg frequencies derived from freqs. The contamination fraction c shifts the expected BAF mean towards the maternal genotype.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies in [0, 1].
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies in [0, 1].
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).

Returns:

logll – Total log-likelihood summed across all m sites.

Return type:

float

loglik_mcc_trio(bafs, mat_haps, pat_haps, c=0.0, std_dev=0.1)[source]¶

Log-likelihood of MCC for a single chromosome in a full trio.

With both parents phased the paternal genotype at every site is known, giving a simpler per-site likelihood with no HWE marginalisation.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies in [0, 1].
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).

Returns:

logll – Total log-likelihood summed across all m sites.

Return type:

float

est_mcc_poc(bafs, mat_haps, freqs, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE for a single chromosome in a mother-child duo.

Jointly optimises contamination fraction c and noise std_dev by maximising loglik_mcc_poc().

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_trio(bafs, mat_haps, pat_haps, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE for a single chromosome in a full trio.

Jointly optimises contamination fraction c and noise std_dev by maximising loglik_mcc_trio().

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

mcc_ci_poc(bafs, mat_haps, freqs, c_hat=0.0, std_dev=0.1, alpha=0.95, df=1)[source]¶

Profile-likelihood confidence interval for MCC in the POC model.

Uses Wilks’ theorem: the CI boundary is the set of c values where twice the log-likelihood drop from the MLE equals the alpha quantile of a chi-squared distribution with df degrees of freedom.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies.
c_hat (float) – MLE contamination fraction (the point estimate).
std_dev (float) – MLE BAF noise standard deviation (held fixed during profile).
alpha (float) – Confidence level in (0, 1); default 0.95.
df (int) – Degrees of freedom for chi-squared quantile; default 1.

Returns:

lower_CI (float) – Lower confidence bound (falls back to 0 if optimiser fails).
c_hat (float) – The supplied point estimate.
upper_CI (float) – Upper confidence bound (falls back to 0.5 if optimiser fails).

mcc_ci_trio(bafs, mat_haps, pat_haps, c_hat=0.0, std_dev=0.1, h=1e-05, alpha=0.95, df=1)[source]¶

Profile-likelihood confidence interval for MCC in the trio model.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
c_hat (float) – MLE contamination fraction (the point estimate).
std_dev (float) – MLE BAF noise standard deviation (held fixed during profile).
h (float) – Finite-difference step (unused; retained for API compatibility).
alpha (float) – Confidence level in (0, 1); default 0.95.
df (int) – Degrees of freedom for chi-squared quantile; default 1.

Returns:

lower_CI (float) – Lower confidence bound (falls back to 0 if optimiser fails).
c_hat (float) – The supplied point estimate.
upper_CI (float) – Upper confidence bound (falls back to 0.5 if optimiser fails).

loglik_mcc_phased_trio(bafs, mat_haps, pat_haps, pos, c=0.0, std_dev=0.1, r=1e-08)[source]¶

Phase-aware log-likelihood of MCC for a single chromosome in a trio.

A 2-state HMM tracks which maternal haplotype (0 or 1) was transmitted to the embryo. At each site the emission is a Gaussian centred on the BAF expected for the copied maternal allele plus the contamination contribution. Transitions follow an exponential recombination kernel: rho = 1 - exp(-r * d) where d is the inter-site distance in base pairs.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies in [0, 1].
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).
r (float) – Per-base-pair recombination rate (> 0); default 1e-8.

Returns:

loglik – Total forward-algorithm log-likelihood across m sites.

Return type:

float

loglik_mcc_phased_poc(bafs, mat_haps, freqs, pos, c=0.0, std_dev=0.1, r=1e-08)[source]¶

Phase-aware log-likelihood of MCC for a single chromosome in a duo (POC).

Identical to loglik_mcc_phased_trio() but marginalises over the unobserved paternal genotype at each site using Hardy-Weinberg weights derived from freqs.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies in [0, 1].
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies in [0, 1].
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).
r (float) – Per-base-pair recombination rate (> 0); default 1e-8.

Returns:

loglik – Total forward-algorithm log-likelihood across m sites.

Return type:

float

est_mcc_phased_trio(bafs, mat_haps, pat_haps, pos, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using the phase-aware HMM for a single trio chromosome.

Jointly optimises c and std_dev by maximising loglik_mcc_phased_trio().

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_phased_poc(bafs, mat_haps, freqs, pos, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using the phase-aware HMM for a single duo (POC) chromosome.

Jointly optimises c and std_dev by maximising loglik_mcc_phased_poc().

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies.
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

mcc_ci_phased_trio(bafs, mat_haps, pat_haps, pos, c_hat=0.0, std_dev=0.1, r=1e-08, alpha=0.95, df=1)[source]¶

Profile-likelihood CI for MCC using the phase-aware trio HMM.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – 2 x m array of 0/1 paternal haplotypes.
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
c_hat (float) – MLE contamination fraction (the point estimate).
std_dev (float) – MLE BAF noise standard deviation (held fixed during profile).
r (float) – Per-base-pair recombination rate; default 1e-8.
alpha (float) – Confidence level in (0, 1); default 0.95.
df (int) – Degrees of freedom for chi-squared quantile; default 1.

Returns:

lower_CI (float) – Lower confidence bound (falls back to 0 if optimiser fails).
c_hat (float) – The supplied point estimate.
upper_CI (float) – Upper confidence bound (falls back to 0.5 if optimiser fails).

mcc_ci_phased_poc(bafs, mat_haps, freqs, pos, c_hat=0.0, std_dev=0.1, r=1e-08, alpha=0.95, df=1)[source]¶

Profile-likelihood CI for MCC using the phase-aware POC (duo) HMM.

Parameters:

bafs (np.ndarray) – m-length array of B-allele frequencies.
mat_haps (np.ndarray) – 2 x m array of 0/1 maternal haplotypes.
freqs (np.ndarray) – m-length array of population allele frequencies.
pos (np.ndarray) – m-length strictly increasing array of base-pair positions.
c_hat (float) – MLE contamination fraction (the point estimate).
std_dev (float) – MLE BAF noise standard deviation (held fixed during profile).
r (float) – Per-base-pair recombination rate; default 1e-8.
alpha (float) – Confidence level in (0, 1); default 0.95.
df (int) – Degrees of freedom for chi-squared quantile; default 1.

Returns:

lower_CI (float) – Lower confidence bound (falls back to 0 if optimiser fails).
c_hat (float) – The supplied point estimate.
upper_CI (float) – Upper confidence bound (falls back to 0.5 if optimiser fails).

loglik_mcc_genome_poc(baf_list, mat_haps_list, freqs_list, c=0.0, std_dev=0.1)[source]¶

Genome-wide log-likelihood of MCC summed across all chromosomes (POC model).

Chromosomes are conditionally independent given c and std_dev, so the genome-wide log-likelihood is the sum of per-chromosome likelihoods from loglik_mcc_poc(). Sex chromosomes should be excluded before calling.

Implemented by concatenating all chromosome arrays into a single call to loglik_mcc_poc(), which is equivalent to summing independent per-chromosome log-likelihoods but avoids repeated Python function-call overhead and assertion checks during optimisation.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).

Returns:

loglik – Total log-likelihood summed across all chromosomes.

Return type:

float

loglik_mcc_genome_trio(baf_list, mat_haps_list, pat_haps_list, c=0.0, std_dev=0.1)[source]¶

Genome-wide log-likelihood of MCC summed across all chromosomes (trio model).

Implemented by concatenating all chromosome arrays into a single call to loglik_mcc_trio() (valid because sites are independent across chromosomes in the unphased model).

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).

Returns:

loglik – Total log-likelihood summed across all chromosomes.

Return type:

float

loglik_mcc_genome_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, c=0.0, std_dev=0.1, r=1e-08)[source]¶

Genome-wide phase-aware log-likelihood of MCC summed across all chromosomes (POC model).

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).
r (float) – Per-base-pair recombination rate; default 1e-8.

Returns:

loglik – Total forward-algorithm log-likelihood summed across all chromosomes.

Return type:

float

loglik_mcc_genome_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, c=0.0, std_dev=0.1, r=1e-08)[source]¶

Genome-wide phase-aware log-likelihood of MCC summed across all chromosomes (trio model).

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
c (float) – Maternal contamination fraction in [0, 0.5].
std_dev (float) – BAF noise standard deviation (> 0).
r (float) – Per-base-pair recombination rate; default 1e-8.

Returns:

loglik – Total forward-algorithm log-likelihood summed across all chromosomes.

Return type:

float

est_mcc_genome_poc(baf_list, mat_haps_list, freqs_list, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using all chromosomes jointly (POC model).

Maximises loglik_mcc_genome_poc() to obtain a single estimate of c and std_dev shared across all supplied chromosomes. Aneuploid chromosomes should be excluded from baf_list before calling; use est_mcc_per_chrom_poc() to identify outliers first.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_genome_trio(baf_list, mat_haps_list, pat_haps_list, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using all chromosomes jointly (trio model).

Maximises loglik_mcc_genome_trio() to obtain a single estimate of c and std_dev shared across all supplied chromosomes.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_genome_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using all chromosomes jointly (phase-aware POC model).

Maximises loglik_mcc_genome_phased_poc().

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_genome_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC by MLE using all chromosomes jointly (phase-aware trio model).

Maximises loglik_mcc_genome_phased_trio().

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

c_est (float) – MLE of the contamination fraction.
sigma_est (float) – MLE of the BAF noise standard deviation.

est_mcc_per_chrom_poc(baf_list, mat_haps_list, freqs_list, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC independently per chromosome (POC model).

Runs est_mcc_poc() separately on each chromosome and returns a per-chromosome estimate. Chromosomes whose c estimate is an outlier relative to the genome-wide median are candidates for aneuploidy and should be excluded before a joint genome-wide fit with est_mcc_genome_poc().

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

results – List of (c_hat, std_dev_hat) tuples, one per chromosome.

Return type:

list of tuple

est_mcc_per_chrom_trio(baf_list, mat_haps_list, pat_haps_list, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC independently per chromosome (trio model).

Runs est_mcc_trio() separately on each chromosome.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

results – List of (c_hat, std_dev_hat) tuples, one per chromosome.

Return type:

list of tuple

est_mcc_per_chrom_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC independently per chromosome (phase-aware POC model).

Runs est_mcc_phased_poc() separately on each chromosome.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
freqs_list (list of np.ndarray) – Per-chromosome population allele frequency arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

results – List of (c_hat, std_dev_hat) tuples, one per chromosome.

Return type:

list of tuple

est_mcc_per_chrom_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶

Estimate MCC independently per chromosome (phase-aware trio model).

Runs est_mcc_phased_trio() separately on each chromosome.

Parameters:

baf_list (list of np.ndarray) – Per-chromosome BAF arrays.
mat_haps_list (list of np.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.
pat_haps_list (list of np.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.
pos_list (list of np.ndarray) – Per-chromosome strictly increasing position arrays.
r (float) – Per-base-pair recombination rate; default 1e-8.
algo (str) – Scipy minimisation algorithm; default "Nelder-Mead".
**kwargs – Additional keyword arguments forwarded to scipy.optimize.minimize.

Returns:

results – List of (c_hat, std_dev_hat) tuples, one per chromosome.

Return type:

list of tuple

Recombination and Phase Correction¶

class karyohmm.karyohmm.RecombEst(**kwargs)[source]¶

Bases: PhaseCorrect

Crossover detection via allele-transmission comparison across sibling embryos.

Implements the simplified crossover-detection approach of Coop et al. (2007), extended with QuadHMM-based interval refinement. For each pair of (template, non-template) sibling embryos, the log-likelihood ratio of copying the same vs different parental allele is computed at informative markers (one parent heterozygous, the other homozygous). Crossover locations are identified as positions where the sign of this ratio changes consistently across sibling pairs.

Inherits the embryo storage, noise-parameter estimation, and phase-correction infrastructure from PhaseCorrect.

__init__(**kwargs)[source]¶

Initialize RecombEst inheriting from PhaseCorrect.

Parameters:: **kwargs – Forwarded to PhaseCorrect.__init__() (mat_haps, pat_haps, pos).

informative_markers(maternal=True)[source]¶

Return a boolean mask of allele-transmission-informative sites.

A site is informative for the maternal haplotype when the mother is heterozygous (genotype 1) and the father is homozygous (genotype 0 or 2), so the transmitted maternal allele is unambiguously determined by the embryo BAF. The paternal case is the mirror image.

Parameters:: maternal (bool) – If True (default) return sites informative for maternal crossovers; if False return sites informative for paternal crossovers.
Returns:: info_idx – m-length boolean mask; True at informative sites.
Return type:: np.ndarray

refine_recomb_events(potential_switches, npad=5)[source]¶

Refine recombination estimation using the switch-clusters approach of Coop et al 2007.

Parameters:

potential_switches (np.ndarray) – Array of potential switch indices at informative markers.
npad (int) – Half-width (in informative SNPs) of the switch cluster window.

Returns:

subset_co – Isolated crossover indices surviving the switch-cluster filter.

Return type:

list

isolate_recomb_events(template_embryo=0, maternal=True, ll_thresh=0, npad=5)[source]¶

Isolate specific recombination events.

Parameters:

template_embryo (int) – Index of the template embryo.
maternal (bool) – If True, estimate maternal crossovers; if False, estimate paternal crossovers.
ll_thresh (float) – Log-likelihood ratio threshold for transmission assignment.
npad (int) – Half-width (in informative SNPs) of the switch cluster window.

Returns:

Z (np.ndarray) – n_sib x n_info array of switch indicators for allele copying in non-template embryos.
llr_z (np.ndarray) – n_sib x n_info raw LLR values for allele-copying in non-template embryos.
potential_switches_filt (np.ndarray) – Majority-vote filtered switch indices at informative markers.
switch_cnts_filt (np.ndarray) – Support counts for each filtered switch.

identify_switch_intervals(Z, npad=5)[source]¶

Identify sign-change intervals in transmission indicator matrix Z.

Missing or uncertain sites (Z = 0) are imputed by forward-propagating the last non-zero sign before applying refine_recomb_events() to each row.

Parameters:

Z (np.ndarray) – n_sib x n_info matrix of transmission indicators (+1 same allele, -1 different allele, 0 uncertain).
npad (int) – Half-width of switch cluster used by refine_recomb_events().

Returns:

isolated_switches – Per-sibling lists of refined switch indices.

Return type:

list of list

second_refine_recomb(template_embryo=0, maternal=True, start=None, end=None)[source]¶

Refine a crossover interval using QuadHMM on the bracketing SNP window.

Uses QuadHMM pairwise Viterbi paths within the interval [start, end] to narrow the crossover location to the tightest flanking heterozygous-site pair that is consistent across all sibling pairs.

Parameters:

template_embryo (int) – Index of the reference embryo; default 0.
maternal (bool) – If True refine maternal crossovers; default True.
start (int) – Left SNP index (inclusive) of the coarse interval.
end (int) – Right SNP index (inclusive) of the coarse interval.

Returns:

p1 (float) – Refined left boundary position in base pairs.
p2 (float) – Refined right boundary position in base pairs.

finalize_recomb_events(potential_switches, template_embryo=0, maternal=True)[source]¶

Localise potential crossovers to the tightest flanking informative-marker interval.

For each entry in potential_switches calls second_refine_recomb() on the flanking informative-marker pair to obtain a refined base-pair interval.

Parameters:

potential_switches (list) – List of informative-marker indices at which a crossover is suspected (output of isolate_recomb_events()).
template_embryo (int) – Index of the reference embryo; default 0.
maternal (bool) – If True refine maternal crossovers; default True.

Returns:

rec_locations – List of (p1, p2) base-pair intervals bounding each crossover; empty list if potential_switches is empty.

Return type:

list of tuple

estimate_crossovers(template_embryo=0, ll_thresh=0, maternal=True, npad=5)[source]¶

Full crossover detection pipeline: isolation, majority-rule filtering, and refinement.

Calls isolate_recomb_events() to identify candidate switch positions, applies majority-rule filtering, then passes surviving candidates to finalize_recomb_events() for QuadHMM-based interval refinement.

Requires add_baf(), est_sigma_pi0s() to have been called first.

Parameters:

template_embryo (int) – Index of the reference embryo; default 0.
ll_thresh (float) – Log-likelihood ratio threshold separating “same” from “different” allele transmission; default 0 (LLR sign flip).
maternal (bool) – If True detect maternal crossovers; default True.
npad (int) – Half-width of the switch-cluster filter; default 5.

Returns:

rec_loc (list of tuple) – Refined base-pair intervals (p1, p2) bounding each crossover.
switch_cnts (np.ndarray) – Support counts for each filtered switch.

Notes

To obtain the raw transmission-indicator matrix and LLR values call isolate_recomb_events() directly.

class karyohmm.karyohmm.PhaseCorrect(mat_haps, pat_haps, pos)[source]¶

Bases: object

Module for implementing Mendelian phase correction using BAF data.

Implements the key methods for phasing parental haplotypes using Mendelian phasing via a LOD-score

__init__(mat_haps, pat_haps, pos)[source]¶

Initialize the class for phase correction.

Parameters:

mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
pos (np.ndarray) – An m-length array of base-pair positions.

add_true_haps(true_mat_haps, true_pat_haps)[source]¶

Add in true haplotypes if available from a simulation.

Parameters:

true_mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
true_pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.

add_baf(embryo_bafs=[])[source]¶

Add in BAF estimates for each embryo.

Parameters:: embryo_bafs (list) – A list of per-embryo BAF arrays from the same parental individuals.

est_sigma_pi0s(**kwargs)[source]¶

Estimate per-embryo BAF noise parameters under a disomy assumption.

Runs MetaHMM.est_sigma_pi0() (disomy model) on each embryo BAF array and stores the results in self.embryo_pi0s and self.embryo_sigmas. Must be called before viterbi_phase_correct() or flag_parental_genotype_errors().

Parameters:: **kwargs – Forwarded to MetaHMM.est_sigma_pi0() (e.g. pi0_bounds, sigma_bounds, algo).

correct_haps_viterbi(haps, paths)[source]¶

Correct haplotype phase using majority-rule consensus across copying paths.

At each inter-site transition the number of siblings showing a haplotype switch is counted. If a majority of siblings switch, the haplotype orientations prior to that site are flipped for both strands.

Parameters:

haps (np.ndarray) – 2 x m array of 0/1 haplotypes to be corrected.
paths (np.ndarray) – n_sib x m boolean array where entry [j, i] is 1 if sibling j copies haplotype 0 at site i (output of Viterbi decoding).

Returns:

n_mis (np.ndarray) – (m-1)-length array of per-interval mismatch counts.
fixed_haps (np.ndarray) – 2 x m phase-corrected haplotype array.

viterbi_phase_correct(niter=5, r=1e-08)[source]¶

Use the Viterbi decoding for phase correction in multiple iterations.

Parameters:

niter (int) – Number of iterations for Viterbi-based phase correction.
r (float) – Recombination rate per base pair.

Returns:

mat_haps (np.ndarray) – Phase-corrected 2 x m maternal haplotype array.
pat_haps (np.ndarray) – Phase-corrected 2 x m paternal haplotype array.
n_mis_mat_tot (np.ndarray) – niter-length array of total maternal mismatch counts per iteration.
n_mis_pat_tot (np.ndarray) – niter-length array of total paternal mismatch counts per iteration.

flag_parental_genotype_errors(use_fixed=False, r=1e-08)[source]¶

Flag potential parental genotype errors using multiple euploid siblings.

Runs forward-backward under disomy for each sibling embryo and accumulates the posterior-weighted log Bayes factor error scores across all siblings. Genuine genotype errors are consistently flagged across siblings (same parental genotype), while per-embryo noise is uncorrelated and averages out.

Parameters:

use_fixed (bool) – Use phase-corrected haplotypes if available (default: False).
r (float) – Recombination rate per base pair.

Returns:

mat_err_scores (np.ndarray) – m-length array of summed maternal error scores.
pat_err_scores (np.ndarray) – m-length array of summed paternal error scores.

estimate_switch_err_true(maternal=True, fixed=False)[source]¶

Estimate the switch error from true and inferred haplotypes.

The switch error is defined as consecutive heterozygotes that are in the incorrect orientation.

Parameters:

maternal (bool) – Apply the function to the maternal haplotypes (default: True).
fixed (bool) – Apply the function to the fixed haplotypes (default: False).

Returns:

n_switches (int) – Number of switches between consecutive heterozygotes.
n_consecutive_hets (int) – Number of consecutive heterozygotes.
switch_err_rate (float) – Number of switches per consecutive heterozygote.
switch_idx (list) – SNP pairs where the variant is out of phase with its predecessor.
het_idx (np.ndarray) – Locations/indices of the heterozygotes.
lods (None) – LOD-scores for the estimated switches (currently always None).

Base Class¶

class karyohmm.karyohmm.AneuploidyHMM[source]¶

Bases: object

Abstract base class shared by all karyohmm HMM models.

Defines the common interface (forward_algorithm, backward_algorithm, forward_backward, viterbi_algorithm) and shared parameter-estimation helpers. Concrete subclasses (MetaHMM, QuadHMM, PocHMM) implement the state space and emission model appropriate for their data type.

ploidy¶

Expected ploidy of the embryo (0 = meta/unknown, 2 = disomy).

Type:: int

aploid¶

String tag identifying the active model variant (e.g. "meta", "disomy", "meta+upd").

Type:: str or None

__init__()[source]¶: Initialize the base aneuploidy HMM class.

get_state_str(state)[source]¶

Obtain a string representation of the HMM state.

Parameters:: state (tuple) – A tuple of integers (typically length 4) encoding haplotype indices.
Returns:: state_str – String representation of the state.
Return type:: str

est_sigma_pi0(bafs, pos, mat_haps, pat_haps, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 1.0), opt_tol=0.0001, opt_options=None, **kwargs)[source]¶

Estimate sigma and pi0 under the B-Allele Frequency model using optimization of forward algorithm likelihood.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
pos (np.ndarray) – Basepair positions of the SNPs.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
algo (str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.
opt_tol (float) – Master tolerance passed to scipy minimize (default 1e-4).
opt_options (dict or None) – Algorithm-specific options passed to scipy minimize; merged over per-algorithm defaults, so only keys you want to override are needed.

Returns:

pi0_est (float) – Estimate of sparsity parameter (pi0) for B-allele emission model.
sigma_est (float) – Estimate of noise parameter (sigma) for B-allele emission model.

full_param_inference(bafs, pos, mat_haps, pat_haps, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 1.0))[source]¶

Full parameter inference under the B-allele frequency model using naive optimization of the forward algorithm.

Parameters:

bafs (np.ndarray) – B-allele frequencies across all m sites.
pos (np.ndarray) – Basepair positions of the SNPs.
mat_haps (np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.
pat_haps (np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
algo (str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.

Notes

Not implemented yet.

Simulators¶

class karyohmm.simulator.PGTSim[source]¶

Bases: PGTSimBase

Simulation of meiotic-origin whole chromosome ploidy changes.

__init__()[source]¶

Initialize the full Aneuploidy Simulator.

Returns: a PGTSim object

full_ploidy_sim(afs=None, reads=False, ploidy=2, m=10000, upd=None, length=10000000.0, rec_rate=0.0001, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate a single embryo with a given ploidy status.

sibling_euploid_sim(afs=None, reads=False, ploidy=2, m=10000, length=10000000.0, nsibs=3, rec_rate=1e-08, std_dev=0.15, mix_prop=0.3, alpha=1.0, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate euploid sibling embryos.

sim_from_haps(mat_haps, pat_haps, pos, reads=False, afs=None, ploidy=2, rec_rate=1e-08, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=0.5, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate data from pre-existing haplotype data.

sim_cell_contamination(baf, haps, fraction=0.01, seed=42)[source]¶: Simulate parental cell contamination (typically maternal)

class karyohmm.simulator.PGTSimMosaic[source]¶

Bases: PGTSimBase

Simulator for mosaic aneuploidies.

__init__()[source]¶

Initialize the mosaic aneuploidy simulator.

Returns: a PGTSim object

mixed_ploidy_sim(afs=None, ploidies=array([0, 1, 2, 3]), props=array([0., 0., 1., 0.]), ncells=10, m=10000, length=10000000.0, rec_rate=1e-08, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate BAF from a mixture of ploidies.

sim_from_haps(mat_haps, pat_haps, pos, afs=None, ploidies=array([0, 1, 2, 3]), props=array([0., 0., 1., 0.]), ncells=10, rec_rate=1e-08, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate mosaic aneuploidies from known parental haplotypes.

class karyohmm.simulator.PGTSimSegmental[source]¶

Bases: PGTSimBase

Simulator for segmental aneuploidies.

__init__()[source]¶: Initialize simulation of segmental data.

seg_aneuploidy(m=100, mean_size=10, ploidy=3, mat_skew=0.5, seed=42)[source]¶: Choose the position and type of the segmental aneuploidy.

sim_haplotype_paths_segmental(mat_haps, pat_haps, pos, start, end, aneu_type='3m', rec_rate=0.0001, seed=42)[source]¶

Simulate haplotypes and segmental aneuploidies.

This simulates a local segmental aneuploidy in an otherwise disomic background.

sim_b_allele_freq_segmental(mat_hap, pat_hap, ploidies, std_dev=0.1, mix_prop=0.6, seed=42)[source]¶: Simulate B-allele frequency conditional on parental haplotypes.

sim_read_counts_segmental(mat_hap, pat_hap, ploidies, coverage=1.0, a=10.0, b=10.0, eps=0.01, seed=42)[source]¶

Simulate read counts for segmental embryo biopsies.

Parameters:

mat_hap (np.array) – 1 x M numpy array for maternal haplotype.
pat_hap (np.array) – 1 x M numpy array for paternal haplotype.
ploidies (np.array) – ploidy count of number of chromosomes being simulated.
coverage (float) – mean coverage for a standard diploid individual.
a (float) – shape parameter for beta-binomial.
b (float) – scale parameter for beta-binomial.
eps (float) – small noise parameter.
seed (int) – random number seed.

Output:: true_geno (np.array): the true genotype of the embryo. alt_reads (np.array): the number of reads with alt. allele. ref_reads (np.array): the number of reads with ref. allele.

full_segmental_sim(reads=False, afs=None, ploidy=2, m=10000, length=50000000.0, mat_skew=0.5, rec_rate=1e-08, std_dev=0.1, mix_prop=0.7, mean_size=10, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Conduct a full simulation of segmental aneuploidies conditional on parental haplotypes.

sim_from_haps(mat_haps, pat_haps, pos, reads=False, afs=None, rec_rate=0.0001, std_dev=0.1, mix_prop=0.7, mean_size=10, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶: Simulate a segmental aneuploidy from known haplotypes.

class karyohmm.simulator.PGTSimVCF[source]¶

Bases: PGTSim

Implements PGT-simulation from parental haplotypes.

__init__()[source]¶: Initialize the class.

gen_parental_haplotypes(vcf_fp, maternal_id=None, paternal_id=None, **kwargs)[source]¶

Generate parental haplotypes from an actual VCF.

Parameters:

vcf_fp (str) – path to an input VCF file.
maternal_id (str) – ID of maternal individual.
paternal_id (str) – ID of paternal individual.

Output:: mat_haps (np.array): paternal haplotypes. pat_haps (np.array): maternal haplotypes. pos (np.array): basepair positions of variants. afs (np.array): alternative allele frequency. df (pd.DataFrame): pandas dataframe of cleaned options.

class karyohmm.simulator.PGTSimBase[source]¶

Bases: object

Base-class for simulation of PGT-A data.

__init__()[source]¶: Initialize the PGT-A Simulator.

draw_parental_genotypes(afs=None, m=100, seed=42)[source]¶

Draw parental genotypes from a beta distribution.

Parameters:

afs (np.array) – realized allele frequencies.
m (int) – number of variants to simulate.
seed (int) – random number seed.

Output:: maternal_haps (np.array): maternal haplotypes paternal_haps (np.array): paternal haplotypes ps (np.array): allele frequency of variants

create_genotyping_errors(haps, err_rate=0.001, seed=42)[source]¶

Create genotyping errors in haplotypes sampled.

Parameters:

haps (np.array) – haplotype array from either maternal or paternal
err_rate (float) – site-level error rate in genotyping
seed (int) – random seed

Output:: err_idx (np.array): m-length array indicating simulated errors err_haps (np.array): haplotypes which reflect newly simulated errors

create_switch_errors_help(haps, err_rate=0.001, seed=42)[source]¶

Revised method to create switch errors.

Parameters:

haps (np.array) – 2 x M numpy array of haplotypes for bi-allelic variants.
err_rate (float) – switch error rate between heterozygotes.
seed (int) – random number seed.

Output:: haps_prime (np.array): switched haplotypes. switches (np.array): snp-indices of haplotype switches.

create_switch_errors(mat_haps, pat_haps, err_rate=0.001, seed=42)[source]¶

Create switch errors to evaluate impact of poor phasing.

Parameters:

mat_haps (np.array) – 2 x M numpy array for maternal haplotypes.
pat_haps (np.array) – 2 x M numpy array for paternal haplotypes.
err_rate (float) – switch error rate between heterozygotes.
seed (int) – random number seed.

Output:: mat_haps_prime (np.array): switched maternal haplotypes. pat_haps_prime (np.array): switched paternal haplotypes. m_switches (np.array): snp-indices of maternal haplotype switches. p_switches (np.array): snp-indices of maternal haplotype switches.

sim_haplotype_paths(mat_haps, pat_haps, pos, ploidy=2, rec_rate=1e-08, mat_skew=0.5, seed=42)[source]¶

Simulate copying paths through the maternal and paternal haplotypes.

Parameters:

mat_haps (np.array) – 2 x M numpy array for maternal haplotypes.
pat_haps (np.array) – 2 x M numpy array for paternal haplotypes.
pos (np.array) – position of individual variants.
ploidy (int) – integer of number of chromosomes being simulated.
rec_rate (float) – uniform recombination rate per basepair.
mat_skew (float) – probability of maternal-origin aneuploidy.
seed (int) – random number seed.

Output:: zs_maternal (np.array): copying path through maternal haplotypes. zs_paternal (np.array): copying path through paternal haplotypes. mat_real_hap (np.array): simulated embryo maternal haplotype(s). pat_real_hap (np.array): simulated embryo paternal haplotype(s). aploid (str): indicator of aneuploidy status.

sim_haplotype_paths_upd(mat_haps, pat_haps, pos, ploidy=2, aploid='2p0', rec_rate=1e-08, mat_skew=0.5, seed=42)[source]¶: Extension to simulate haplotype-copying paths under uniparental disomy.

sim_b_allele_freq(mat_hap, pat_hap, ploidy=2, std_dev=0.2, mix_prop=0.3, seed=42)[source]¶: Simulate of B-allele frequency.

sim_logR_ratio(mat_hap, pat_hap, ploidy=2, alpha=0.5, seed=42)[source]¶

Simulate logR-ratio conditional on ploidy.

Alpha is the degree to which the variance is increased for the LRR.

sim_read_counts(mat_hap, pat_hap, ploidy, coverage=1.0, a=10.0, b=10.0, eps=0.01, seed=42)[source]¶

Simulate read counts for embryo biopsies.

Parameters:

mat_hap (np.array) – 1 x M numpy array for maternal haplotype.
pat_hap (np.array) – 1 x M numpy array for paternal haplotype.
pos (np.array) – position of individual variants.
ploidy (int) – integer of number of chromosomes being simulated.
coverage (float) – mean coverage for a standard diploid individual.
a (float) – shape parameter for beta-binomial.
b (float) – scale parameter for beta-binomial.
eps (float) – small noise parameter.
seed (int) – random number seed.

Output:: true_geno (np.array): the true genotype of the embryo. alt_reads (np.array): the number of reads with alt. allele. ref_reads (np.array): the number of reads with ref. allele.

sim_joint_het(switch=False, nsibs=1, meta_seed=42, **kwargs)[source]¶: Simulate a joint heterozygote and potential switch error.

sim_haplotype_ref_panel(haps, pos, panel_size=10, seed=42, **kwargs)[source]¶: Simulate a haplotype reference panel from haplotypes.

I/O Utilities¶

Classes for reading in data for karyoHMM.

Implements methods for both MetaHMM and PoCHMM-based datasets.

class karyohmm.io.DataReader(mode='Meta', duo_maternal=None)[source]¶

Bases: object

Input / output class for reading in data for karyohmm.

__init__(mode='Meta', duo_maternal=None)[source]¶: Initialize the DataReader class for a given mode.

read_data_np(input_fp)[source]¶

Read data from an .npy or npz file and reformat for karyohmm.

Parameters:: input_fp (str) – path to input NPZ or NPY file.

Output:: df (pl.DataFrame): polars dataframe of cleaned options.

read_data_df(input_fp)[source]¶

Read in data from a pre-existing text-based dataset.

Parameters:: input_fp (str) – path to input TSV/CSV/TXT file.

Output:: df (pl.DataFrame): polars dataframe of cleaned options.

read_data(input_fp)[source]¶

Read in data in either pandas/numpy format.

Parameters:: input_fp (str) – path to input file.

Output:: df (pd.DataFrame): pandas dataframe of cleaned options.