karyohmm API Reference¶
This page documents the public API of the karyohmm package.
Core HMM Models¶
- class karyohmm.karyohmm.MetaHMM(disomy=False, upd=False)[source]¶
Bases:
AneuploidyHMMHMM for whole-chromosome aneuploidy detection in trio PGT data.
Evaluates a joint state space covering nullisomy (0), maternal and paternal monosomy (1m / 1p), disomy (2), maternal and paternal trisomy (3m / 3p), and optionally uniparental disomy (UPD) states. Each hidden state encodes which pair of parental haplotypes were transmitted to the embryo; transitions follow an exponential kernel parameterised by an intra-karyotype recombination rate
rand an inter-karyotype switching ratea.The BAF emission at each site is a Gaussian mixture centred on the expected allele dosage for the current copying state, controlled by
pi0(fraction of sites with no signal) andstd_dev(noise). LRR data are incorporated through a Gaussian emission centred on the expected copy-number fold change.- Parameters:
disomy (
bool, optional) – Restrict the state space to disomy-only (4 states). Useful for genotyping embryos without calling aneuploidies. DefaultFalse.upd (
bool, optional) – Add uniparental disomy states to the full state space. Requires LRR data to distinguish UPD from normal disomy. DefaultFalse.
- states¶
Active HMM states; each tuple is
(m0, m1, p0, p1)where values are haplotype indices (0 or 1) or -1 when that haplotype is absent.
- __init__(disomy=False, upd=False)[source]¶
Initialize the MetaHMM class for determining chromosomal aneuploidy.
- forward_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.5, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶
Forward HMM algorithm under a multi-ploidy model.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
- backward_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.5, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶
Backward HMM algorithm under a given statespace model.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
betas (
np.ndarray) – Backward variables from HMM across the k states.scaler (
np.ndarray) – m-length array of scale parameters.states (
list) – Tuple representation of states.karyotypes (
np.ndarray) – Array of karyotypes in the MetaHMM model.loglik (
float) – Total log-likelihood of B-allele frequency.
- forward_backward(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶
Run the forward-backward algorithm across all states.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
gammas (
np.ndarray) – Log posterior density of being in each of k hidden states.states (
list) – Tuple representation of states.karyotypes (
np.ndarray) – Array of karyotypes in the model.
- viterbi_algorithm(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, mat_err=0.0, pat_err=0.0)[source]¶
Implement the Viterbi traceback through karyotypic states.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate.a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
path (
np.ndarray) – Most likely copying path through k states in the model.states (
list) – Tuple representation of states.deltas (
np.ndarray) – Delta variable (maximum path probability at step m).psi (
np.ndarray) – Storage vector for psi variable.
- marginal_posterior_karyotypes(gammas, karyotypes)[source]¶
Obtain the marginal posterior (not logged) probability over karyotypic states.
- Parameters:
gammas (
np.ndarray) – A k x m array of log-posterior density across sites.karyotypes (
np.ndarray) – Karyotype labels for all k states.
- Returns:
gamma_karyo (
np.ndarray) – Collapsed posteriors summed per unique karyotype.uniq_karyo (
np.ndarray) – Unique karyotype labels in the order they appear.
- posterior_karyotypes(gammas, karyotypes)[source]¶
Obtain full posterior on karyotypes chromosome-wide.
This is the weighted proportion of time spent in each karyotypic state.
- Parameters:
gammas (
np.ndarray) – A k x m array of log-posterior density across sites.karyotypes (
np.ndarray) – Karyotype labels for all k states.
- Returns:
kar_prob – Dictionary mapping karyotype label to chromosome-wide posterior probability.
- Return type:
- genotype_embryo(bafs, lrrs, sigmas, pos, mat_haps, pat_haps, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, viterbi=False)[source]¶
Obtain genotype dosages for a putative disomic embryo.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate.a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.viterbi (
bool) – Estimate the embryo genotypes using the Viterbi algorithm.
- Returns:
dosages – A 3 x m array of genotype probabilities (RR, RA, AA).
- Return type:
np.ndarray
- flag_parental_genotype_errors(gammas, states, bafs, mat_haps, pat_haps, pi0=0.2, std_dev=0.25)[source]¶
Identify per-site parental genotype calls that are inconsistent with the posterior.
For each site, computes a posterior-weighted log Bayes factor comparing the best alternative parental genotype (fixing the other parent) to the called one. A positive score means the data, given the inferred copying state, prefers a different parental genotype at that site.
- Parameters:
gammas (
np.ndarray) – k x m log-posterior array from forward_backward.states (
list) – List of state tuples (output of forward_backward).bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.pi0 (
float) – Sparsity parameter for BAF emission.std_dev (
float) – Noise parameter for BAF emission.
- Returns:
mat_err_score (
np.ndarray) – m-length array; positive values flag likely maternal genotype errors.pat_err_score (
np.ndarray) – m-length array; positive values flag likely paternal genotype errors.
- class karyohmm.karyohmm.QuadHMM[source]¶
Bases:
AneuploidyHMMJoint HMM for two sibling embryos used to detect inter-sibling haplotype sharing.
Models the joint copying state of two euploid siblings as a pair of independent disomy states drawn from the same parental haplotypes. The 16-state product space (4 × 4 single-embryo disomy states) is collapsed into four Roach et al. 2010 identity-by-descent classes:
0: maternally haploidentical (siblings share the same maternal haplotype)
1: paternally haploidentical (siblings share the same paternal haplotype)
2: fully identical (same maternal and paternal haplotype)
3: non-identical (different maternal and paternal haplotypes)
Transitions between copying states are governed by a single recombination rate parameter
r; the BAF emission for each sibling follows the same Gaussian mixture model asMetaHMMbut with per-siblingpi0andstd_dev.- states¶
16 joint copying states, each
((m0, _, p0, _), (m1, _, p1, _)).
- forward_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶
Implement the forward algorithm for QuadHMM model.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
alphas (
np.ndarray) – Forward variable from HMM across the joint sibling states.scaler (
np.ndarray) – m-length array of scale parameters.states (
list) – Tuple representation of the 16 joint states.karyotypes (
np.ndarray) – Array of karyotypes (all entries"2").loglik (
float) – Total log-likelihood of joint sibling B-allele frequencies.
- backward_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶
Implement the backward algorithm for QuadHMM model.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
betas (
np.ndarray) – Backward variable from HMM across the joint sibling states.scaler (
np.ndarray) – m-length array of scale parameters.states (
list) – Tuple representation of the 16 joint states.karyotypes (
np.ndarray) – Array of karyotypes (all entries"2").loglik (
float) – Total log-likelihood of joint sibling B-allele frequencies.
- forward_backward(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, a=0.01, mat_err=0.0, pat_err=0.0)[source]¶
Implement the forward-backward algorithm for the QuadHMM model.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.a (
float) – Unused; retained for API compatibility.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
- viterbi_algorithm(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08, mat_err=0.0, pat_err=0.0)[source]¶
Viterbi algorithm for the QuadHMM model.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.mat_err (
float) – Per-site maternal genotyping error rate.pat_err (
float) – Per-site paternal genotyping error rate.
- Returns:
path (
np.ndarray) – Most likely copying path through the joint sibling states.states (
list) – Tuple representation of the 16 joint states.deltas (
np.ndarray) – Delta variable (maximum path probability at each site).psi (
np.ndarray) – Storage vector for psi variable.
- restrict_path(path)[source]¶
Break down states into the same categories as Roach et al for determining recombinations.
The state definitions used for tracing sibling haplotypes are:
0: maternal-haploidentical
1: paternal-haploidentical
2: identical
3: non-identical
- Parameters:
path (
np.ndarray) – A sequence of states output from the Viterbi algorithm (16 possible states).- Returns:
refined_path – A sequence of m states in 0, 1, 2, 3 indicating states from Roach et al.
- Return type:
np.ndarray
- restrict_states()[source]¶
Break down states into the same categories as Roach et al for determining recombinations.
- Returns:
- viterbi_path(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶
Obtain the restricted Viterbi path for traceback.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.
- Returns:
res_path – Most likely copying path restricted to 4 states from Roach et al.
- Return type:
np.ndarray
- map_path(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶
Obtain the Maximum A-Posteriori Path across restricted states.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.
- Returns:
map_path – Max-a-posteriori copying path through 4 sibling copying states.
- Return type:
np.ndarray
- flag_parental_genotype_errors(gammas, states, bafs, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15))[source]¶
Identify per-site parental genotype calls that are inconsistent with the posterior.
For each site, computes a posterior-weighted log Bayes factor comparing the best alternative parental genotype (fixing the other parent) to the called one. Both siblings observe the same parental genotype, so the joint sibling BAF likelihood is used when scoring alternatives. A positive score means the data, given the inferred copying state, prefers a different parental genotype at that site.
- Parameters:
gammas (
np.ndarray) – k x m log-posterior array from forward_backward.states (
list) – List of paired state tuples (output of forward_backward).bafs (
list) – List of two m-length np.ndarray BAF arrays for the two siblings.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.pi0 (
tuple) – (pi0_sib0, pi0_sib1) sparsity parameters for BAF emission.std_dev (
tuple) – (std_sib0, std_sib1) noise parameters for BAF emission.
- Returns:
mat_err_score (
np.ndarray) – m-length array; positive values flag likely maternal genotype errors.pat_err_score (
np.ndarray) – m-length array; positive values flag likely paternal genotype errors.
- isolate_recomb(path_xy, path_xzs, window=20)[source]¶
Isolate key recombination events from a pair of refined paths.
- Parameters:
- Returns:
mat_recomb_lst (
list) – List of maternal recombination SNP indexes.pat_recomb_lst (
list) – List of paternal recombination SNP indexes.mat_recomb (
dict) – Dictionary of SNP-index to sibling-count for maternal recombinations.pat_recomb (
dict) – Dictionary of SNP-index to sibling-count for paternal recombinations.
- est_sharing(bafs, pos, mat_haps, pat_haps, pi0=(0.7, 0.7), std_dev=(0.15, 0.15), r=1e-08)[source]¶
Estimate the proportion of the chromosome that is shared identically across siblings.
- Parameters:
bafs (
list) – List of two np.ndarray arrays of B-allele frequencies across m sites for two siblings.pos (
np.ndarray) – m-length vector of basepair positions for sites.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pi0 (
tupleoffloat) – Sparsity parameters for B-allele emission model, one per sibling.std_dev (
tupleoffloat) – Standard deviation parameters for B-allele emission model, one per sibling.r (
float) – Inter-state transition rate.
- Returns:
mat_haplo_len (
float) – Total base-pair length of maternally haploidentical regions.pat_haplo_len (
float) – Total base-pair length of paternally haploidentical regions.both_haplo_len (
float) – Total base-pair length of fully identical (both-haplo) regions.tot_len (
float) – Total base-pair length of the typed region on the chromosome.
- class karyohmm.karyohmm.PocHMM(disomy=False)[source]¶
Bases:
MetaHMMAneuploidy HMM for products-of-conception with a single observed parent.
Extends
MetaHMMto handle mother–child or father–child duos in which only one parent’s haplotypes are available. The unobserved parent’s genotype is marginalised out at each site using population allele frequencies supplied asfreqs. WhenfreqsisNonethe model treats every unobserved-parent site as a 50 / 50 heterozygote.The forward, backward, and Viterbi algorithms are re-implemented here via a dedicated Cython kernel (
forward_algo_duo/backward_algo_duo) that handles the marginalisation efficiently.- Parameters:
disomy (
bool, optional) – Restrict the state space to disomy (ignored in current implementation; the full MetaHMM state space is always used). DefaultFalse.
- est_sigma_pi0(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 0.5), **kwargs)[source]¶
Estimate sigma and pi0 under the B-Allele Frequency model using optimization of forward algorithm likelihood.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – Basepair positions of the SNPs.haps (
np.ndarray) – A 2 x m array of 0/1 parental haplotypes.freqs (
np.ndarrayorNone) – An m-length array of allele frequencies as priors.maternal (
bool) – If True, haps are maternal haplotypes; otherwise paternal.algo (
str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.pi0_bounds (
tuple) – Bounds for acceptable values of pi0 parameter.sigma_bounds (
tuple) – Bounds for acceptable values for sigma.
- Returns:
- forward_algorithm(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶
Forward algorithm for duos.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.haps (
np.ndarray) – A 2 x m array of 0/1 parental haplotypes.freqs (
np.ndarrayorNone) – An m-length array of population allele frequencies.maternal (
bool) – If True, haps are maternal haplotypes; otherwise paternal.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.obs_err (
float) – Per-site genotyping error rate for the observed parent.
- Returns:
- backward_algorithm(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶
Backward algorithm for duos.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.haps (
np.ndarray) – A 2 x m array of 0/1 parental haplotypes.freqs (
np.ndarrayorNone) – An m-length array of population allele frequencies.maternal (
bool) – If True, haps are maternal haplotypes; otherwise paternal.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.obs_err (
float) – Per-site genotyping error rate for the observed parent.
- Returns:
- forward_backward(bafs, lrrs, sigmas, pos, haps, freqs=None, maternal=True, pi0=0.2, std_dev=0.25, r=1e-08, a=0.01, unphased=False, obs_err=0.0)[source]¶
Forward-backward algorithm for duos.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.lrrs (
np.ndarray) – m-length array of log-R ratio values.sigmas (
np.ndarray) – m-length array of per-site LRR noise standard deviations.pos (
np.ndarray) – m-length vector of basepair positions for sites.haps (
np.ndarray) – A 2 x m array of 0/1 parental haplotypes.freqs (
np.ndarrayorNone) – An m-length array of population allele frequencies.maternal (
bool) – If True, haps are maternal haplotypes; otherwise paternal.pi0 (
float) – Sparsity parameter for B-allele emission model.std_dev (
float) – Standard deviation for B-allele emission model.r (
float) – Intra-karyotype transition rate (recombination).a (
float) – Inter-karyotype transition rate.unphased (
bool) – Run the model in unphased mode.obs_err (
float) – Per-site genotyping error rate for the observed parent.
- Returns:
gammas (
np.ndarray) – Log posterior density of being in each of k hidden states.states (
list) – Tuple representation of states.karyotypes (
np.ndarray) – Array of karyotypes in the MetaHMM model.
- genotype_parent(bafs, lrrs, sigmas, haps, gammas, freqs=None, maternal=True, pi0=0.2, std_dev=0.25)[source]¶
Compute posterior genotype dosages for the unobserved parent.
For each site integrates the BAF and LRR likelihood over all four possible unobserved-parent genotypes (AA, AB, BA, BB), weighted by Hardy-Weinberg priors from
freqsand the HMM posteriorgammas.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.lrrs (
np.ndarray) – m-length array of log-R ratios.sigmas (
np.ndarray) – m-length array of per-site LRR noise estimates.haps (
np.ndarray) – 2 x m array of 0/1 haplotypes for the observed parent.gammas (
np.ndarray) – k x m log-posterior array fromforward_backward().freqs (
np.ndarray, optional) – m-length array of population allele frequencies; defaults to 0.5 at every site.maternal (
bool) – If True (default), haps are the maternal haplotypes and the paternal genotype is imputed; set False for the reverse.pi0 (
float) – Sparsity parameter for the BAF emission model.std_dev (
float) – Noise parameter for the BAF emission model.
- Returns:
geno_dosage_rev – 3 x m array of posterior genotype probabilities (log scale) for the unobserved parent; rows correspond to homozygous-ref (0), heterozygous (1), and homozygous-alt (2).
- Return type:
np.ndarray
- infer_missing_af(bafs, geno, hap_matrix, pos, eps=0.2)[source]¶
Estimate allele frequencies for the unobserved parent using a haplotype reference panel.
Identifies sites of opposite homozygosity — where the observed parent is homozygous and BAF indicates the unobserved parent carries the opposite allele — then tags the corresponding haplotypes in the reference panel. Frequencies at non-anchor sites are estimated via distance-weighted interpolation between adjacent anchor sites (Eq. 9 in the PocHMM methods document). Falls back to the reference-panel mean at sites with no nearby anchors.
- Parameters:
bafs (
np.ndarray) – m-length B-allele frequency array.geno (
np.ndarray) – m-length observed-parent genotype dosage array (0, 1, or 2).hap_matrix (
np.ndarray) – (2N, m) haplotype reference panel matrix (0/1 encoded).pos (
np.ndarray) – m-length array of genomic positions in basepairs.eps (
float) – BAF threshold for calling opposite homozygosity (default 0.2).
- Returns:
freqs – m-length array of estimated allele frequencies for the unobserved parent.
- Return type:
np.ndarray
Mosaic and Contamination Estimators¶
- class karyohmm.karyohmm.MosaicEst(mat_haps, pat_haps, bafs, pos, lrrs=None, sigmas=None, switch_err=0.01, t_rate=0.0001, **phase_kwargs)[source]¶
Bases:
objectEstimator of mosaic copy-number cell fraction using a joint BAF + LRR HMM.
A 5-state HMM over all m genomic sites estimates the mosaic cell fraction
cfand identifies the parental origin of the event. The five states and their emission means are:State 0 (neutral): LRR = 0, phased BAF = 0
State 1 (maternal gain): LRR = log₂(1 + cf/2), phased BAF = +cf/6
State 2 (paternal gain): LRR = log₂(1 + cf/2), phased BAF = −cf/6
State 3 (maternal loss): LRR = log₂(1 − cf/2), phased BAF = −cf/2
State 4 (paternal loss): LRR = log₂(1 − cf/2), phased BAF = +cf/2
The BAF means follow from the discrete mosaic mixture and the phase convention (sign = +1 when the maternal haplotype carries the alt allele):
Maternal gain: extra maternal copy at a mat=alt site gives BAF 2/3, so phased BAF = +cf/6; at mat=ref sites the sign corrects to the same value.
Paternal gain: extra paternal copy reverses the BAF shift — phased BAF = −cf/6 coherently across all het sites.
Maternal loss: losing the maternal copy leaves only paternal alleles, giving phased BAF = −cf/2 (3× larger than gain because BAF moves to 0 or 1).
Paternal loss: symmetrically, phased BAF = +cf/2.
LRR is evaluated at every site; phased BAF is evaluated only at expected-heterozygous sites (one parent hom-ref, the other hom-alt).
All preprocessing (het-site identification, phase assignment, transition matrix construction) happens in
__init__, soest_mle_cfcan be called immediately after construction.- Parameters:
mat_haps (
np.ndarray) – 2 × m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 × m array of 0/1 paternal haplotypes.bafs (
np.ndarray) – m-length array of B-allele frequencies.pos (
np.ndarray) – m-length array of base-pair positions.lrrs (
np.ndarray, optional) – m-length array of log-R ratio values. When provided, LRR emission is included at all m sites (strongly recommended).sigmas (
np.ndarray, optional) – m-length array of per-site LRR noise standard deviations. Required whenlrrsis provided.switch_err (
float) – Within-type origin-switch probability (maternal↔paternal for the same gain or loss direction; default 0.01).t_rate (
float) – Neutral↔aneuploid transition probability (default 1e-4).
- STATE_NAMES = ('neutral', 'maternal-gain', 'paternal-gain', 'maternal-loss', 'paternal-loss')¶
Ordered state labels returned by
infer_origin().
- __init__(mat_haps, pat_haps, bafs, pos, lrrs=None, sigmas=None, switch_err=0.01, t_rate=0.0001, **phase_kwargs)[source]¶
Validate inputs, identify and phase het sites, build transition matrix.
- create_transition_matrix(switch_err=0.01, t_rate=0.0001)[source]¶
Build the 5-state log-transition matrix.
States: 0=neutral, 1=mat-gain, 2=pat-gain, 3=mat-loss, 4=pat-loss.
Transitions allowed: - neutral → each aneuploid state with probability
t_rate / 4- any aneuploid → neutral with probabilityt_rate- within-gain origin switch (1↔2) with probabilityswitch_err- within-loss origin switch (3↔4) with probabilityswitch_err- cross-type transitions (gain↔loss) are not modelled (log-prob = −∞)
- forward_algo_full(cf=0.0, std_dev_baf=0.1)[source]¶
Forward algorithm over all m sites with joint BAF + LRR emission.
When
lrrswas provided at construction, LRR emission is added at every site. Phased BAF emission is always added at expected-het sites.State-dependent emission means (see class docstring for derivation):
State 0 (neutral): LRR = 0, phased BAF = 0
State 1 (maternal gain): LRR = log₂(1 + cf/2), phased BAF = +cf/6
State 2 (paternal gain): LRR = log₂(1 + cf/2), phased BAF = −cf/6
State 3 (maternal loss): LRR = log₂(1 − cf/2), phased BAF = −cf/2
State 4 (paternal loss): LRR = log₂(1 − cf/2), phased BAF = +cf/2
- est_mle_cf(std_dev_baf=0.1)[source]¶
Estimate the MLE mosaic cell fraction.
Maximises
forward_algo_full()log-likelihood over cf in [0, 1) using Powell’s method. Stores the result inself.mle_cf(nanon optimisation failure).- Parameters:
std_dev_baf (
float) – BAF noise standard deviation.
- ci_mle_cf(std_dev_baf=0.1, h=0.0001)[source]¶
95% confidence interval for
mle_cfvia observed Fisher information.Requires
est_mle_cf()to have been called first.
- lrt_cf(std_dev_baf=0.1)[source]¶
Likelihood-ratio test statistic for cf > 0 vs cf = 0.
Computes -2 x (l(cf=0) - l(cf_MLE)). Under the null hypothesis of no mosaicism this is approximately chi-squared with 1 degree of freedom. Calls
est_mle_cf()if it has not been run yet.
- infer_origin(std_dev_baf=0.1)[source]¶
Identify the most likely parental origin of the mosaic event.
Runs the forward algorithm at
mle_cfand computes the chromosome-wide log-evidence for each of the four aneuploid states (marginalised over sites via logsumexp of the forward variable). The aneuploid state with the highest evidence is returned as the inferred origin.Requires
est_mle_cf()to have been called first. Returns'neutral'whenmle_cfis below 0.01.
- class karyohmm.karyohmm.MccEst[source]¶
Bases:
objectMaximum-likelihood estimator of maternal-cell contamination (MCC) from BAF data.
Maternal-cell contamination occurs when maternal DNA co-purifies with the embryo sample, biasing B-allele frequencies towards the maternal genotype. This class provides log-likelihood functions and MLE routines for estimating the contamination fraction
c(the proportion of signal attributable to the mother) under two data configurations:Trio (
_triosuffix): both maternal and paternal haplotypes are known, so the paternal genotype at each site is observed directly.POC / duo (
_pocsuffix): only maternal haplotypes are available; the paternal genotype is marginalised using Hardy–Weinberg allele frequencies.
Each configuration is further available in an unphased form (sites modelled independently) and a phase-aware form (a 2-state HMM tracks which maternal haplotype was transmitted to the embryo).
Genome-wide inference across multiple chromosomes is supported through
loglik_mcc_genome_*andest_mcc_genome_*methods; per-chromosome estimates — useful for identifying aneuploid chromosomes before a genome-wide fit — are provided byest_mcc_per_chrom_*methods.All estimation methods jointly optimise the contamination fraction
cand the BAF noise standard deviationstd_dev.- loglik_mcc_poc(bafs, mat_haps, freqs, c=0.0, std_dev=0.1)[source]¶
Log-likelihood of MCC for a single chromosome in a mother-child duo (POC).
At each site the paternal genotype is unknown and is marginalised over Hardy-Weinberg frequencies derived from
freqs. The contamination fractioncshifts the expected BAF mean towards the maternal genotype.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies in [0, 1].mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies in [0, 1].c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).
- Returns:
logll – Total log-likelihood summed across all m sites.
- Return type:
- loglik_mcc_trio(bafs, mat_haps, pat_haps, c=0.0, std_dev=0.1)[source]¶
Log-likelihood of MCC for a single chromosome in a full trio.
With both parents phased the paternal genotype at every site is known, giving a simpler per-site likelihood with no HWE marginalisation.
- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies in [0, 1].mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).
- Returns:
logll – Total log-likelihood summed across all m sites.
- Return type:
- est_mcc_poc(bafs, mat_haps, freqs, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE for a single chromosome in a mother-child duo.
Jointly optimises contamination fraction
cand noisestd_devby maximisingloglik_mcc_poc().- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_trio(bafs, mat_haps, pat_haps, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE for a single chromosome in a full trio.
Jointly optimises contamination fraction
cand noisestd_devby maximisingloglik_mcc_trio().- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- mcc_ci_poc(bafs, mat_haps, freqs, c_hat=0.0, std_dev=0.1, alpha=0.95, df=1)[source]¶
Profile-likelihood confidence interval for MCC in the POC model.
Uses Wilks’ theorem: the CI boundary is the set of
cvalues where twice the log-likelihood drop from the MLE equals thealphaquantile of a chi-squared distribution withdfdegrees of freedom.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies.c_hat (
float) – MLE contamination fraction (the point estimate).std_dev (
float) – MLE BAF noise standard deviation (held fixed during profile).alpha (
float) – Confidence level in (0, 1); default 0.95.df (
int) – Degrees of freedom for chi-squared quantile; default 1.
- Returns:
- mcc_ci_trio(bafs, mat_haps, pat_haps, c_hat=0.0, std_dev=0.1, h=1e-05, alpha=0.95, df=1)[source]¶
Profile-likelihood confidence interval for MCC in the trio model.
Uses Wilks’ theorem: the CI boundary is the set of
cvalues where twice the log-likelihood drop from the MLE equals thealphaquantile of a chi-squared distribution withdfdegrees of freedom.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.c_hat (
float) – MLE contamination fraction (the point estimate).std_dev (
float) – MLE BAF noise standard deviation (held fixed during profile).h (
float) – Finite-difference step (unused; retained for API compatibility).alpha (
float) – Confidence level in (0, 1); default 0.95.df (
int) – Degrees of freedom for chi-squared quantile; default 1.
- Returns:
- loglik_mcc_phased_trio(bafs, mat_haps, pat_haps, pos, c=0.0, std_dev=0.1, r=1e-08)[source]¶
Phase-aware log-likelihood of MCC for a single chromosome in a trio.
A 2-state HMM tracks which maternal haplotype (0 or 1) was transmitted to the embryo. At each site the emission is a Gaussian centred on the BAF expected for the copied maternal allele plus the contamination contribution. Transitions follow an exponential recombination kernel:
rho = 1 - exp(-r * d)wheredis the inter-site distance in base pairs.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies in [0, 1].mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).r (
float) – Per-base-pair recombination rate (> 0); default 1e-8.
- Returns:
loglik – Total forward-algorithm log-likelihood across m sites.
- Return type:
- loglik_mcc_phased_poc(bafs, mat_haps, freqs, pos, c=0.0, std_dev=0.1, r=1e-08)[source]¶
Phase-aware log-likelihood of MCC for a single chromosome in a duo (POC).
Identical to
loglik_mcc_phased_trio()but marginalises over the unobserved paternal genotype at each site using Hardy-Weinberg weights derived fromfreqs.- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies in [0, 1].mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies in [0, 1].pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).r (
float) – Per-base-pair recombination rate (> 0); default 1e-8.
- Returns:
loglik – Total forward-algorithm log-likelihood across m sites.
- Return type:
- est_mcc_phased_trio(bafs, mat_haps, pat_haps, pos, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using the phase-aware HMM for a single trio chromosome.
Jointly optimises
candstd_devby maximisingloglik_mcc_phased_trio().- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_phased_poc(bafs, mat_haps, freqs, pos, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using the phase-aware HMM for a single duo (POC) chromosome.
Jointly optimises
candstd_devby maximisingloglik_mcc_phased_poc().- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies.pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- mcc_ci_phased_trio(bafs, mat_haps, pat_haps, pos, c_hat=0.0, std_dev=0.1, r=1e-08, alpha=0.95, df=1)[source]¶
Profile-likelihood CI for MCC using the phase-aware trio HMM.
- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – 2 x m array of 0/1 paternal haplotypes.pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.c_hat (
float) – MLE contamination fraction (the point estimate).std_dev (
float) – MLE BAF noise standard deviation (held fixed during profile).r (
float) – Per-base-pair recombination rate; default 1e-8.alpha (
float) – Confidence level in (0, 1); default 0.95.df (
int) – Degrees of freedom for chi-squared quantile; default 1.
- Returns:
- mcc_ci_phased_poc(bafs, mat_haps, freqs, pos, c_hat=0.0, std_dev=0.1, r=1e-08, alpha=0.95, df=1)[source]¶
Profile-likelihood CI for MCC using the phase-aware POC (duo) HMM.
- Parameters:
bafs (
np.ndarray) – m-length array of B-allele frequencies.mat_haps (
np.ndarray) – 2 x m array of 0/1 maternal haplotypes.freqs (
np.ndarray) – m-length array of population allele frequencies.pos (
np.ndarray) – m-length strictly increasing array of base-pair positions.c_hat (
float) – MLE contamination fraction (the point estimate).std_dev (
float) – MLE BAF noise standard deviation (held fixed during profile).r (
float) – Per-base-pair recombination rate; default 1e-8.alpha (
float) – Confidence level in (0, 1); default 0.95.df (
int) – Degrees of freedom for chi-squared quantile; default 1.
- Returns:
- loglik_mcc_genome_poc(baf_list, mat_haps_list, freqs_list, c=0.0, std_dev=0.1)[source]¶
Genome-wide log-likelihood of MCC summed across all chromosomes (POC model).
Chromosomes are conditionally independent given
candstd_dev, so the genome-wide log-likelihood is the sum of per-chromosome likelihoods fromloglik_mcc_poc(). Sex chromosomes should be excluded before calling.Implemented by concatenating all chromosome arrays into a single call to
loglik_mcc_poc(), which is equivalent to summing independent per-chromosome log-likelihoods but avoids repeated Python function-call overhead and assertion checks during optimisation.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).
- Returns:
loglik – Total log-likelihood summed across all chromosomes.
- Return type:
- loglik_mcc_genome_trio(baf_list, mat_haps_list, pat_haps_list, c=0.0, std_dev=0.1)[source]¶
Genome-wide log-likelihood of MCC summed across all chromosomes (trio model).
Implemented by concatenating all chromosome arrays into a single call to
loglik_mcc_trio()(valid because sites are independent across chromosomes in the unphased model).- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).
- Returns:
loglik – Total log-likelihood summed across all chromosomes.
- Return type:
- loglik_mcc_genome_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, c=0.0, std_dev=0.1, r=1e-08)[source]¶
Genome-wide phase-aware log-likelihood of MCC summed across all chromosomes (POC model).
- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).r (
float) – Per-base-pair recombination rate; default 1e-8.
- Returns:
loglik – Total forward-algorithm log-likelihood summed across all chromosomes.
- Return type:
- loglik_mcc_genome_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, c=0.0, std_dev=0.1, r=1e-08)[source]¶
Genome-wide phase-aware log-likelihood of MCC summed across all chromosomes (trio model).
- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.c (
float) – Maternal contamination fraction in [0, 0.5].std_dev (
float) – BAF noise standard deviation (> 0).r (
float) – Per-base-pair recombination rate; default 1e-8.
- Returns:
loglik – Total forward-algorithm log-likelihood summed across all chromosomes.
- Return type:
- est_mcc_genome_poc(baf_list, mat_haps_list, freqs_list, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using all chromosomes jointly (POC model).
Maximises
loglik_mcc_genome_poc()to obtain a single estimate ofcandstd_devshared across all supplied chromosomes. Aneuploid chromosomes should be excluded frombaf_listbefore calling; useest_mcc_per_chrom_poc()to identify outliers first.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_genome_trio(baf_list, mat_haps_list, pat_haps_list, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using all chromosomes jointly (trio model).
Maximises
loglik_mcc_genome_trio()to obtain a single estimate ofcandstd_devshared across all supplied chromosomes.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_genome_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using all chromosomes jointly (phase-aware POC model).
Maximises
loglik_mcc_genome_phased_poc().- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_genome_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC by MLE using all chromosomes jointly (phase-aware trio model).
Maximises
loglik_mcc_genome_phased_trio().- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
- est_mcc_per_chrom_poc(baf_list, mat_haps_list, freqs_list, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC independently per chromosome (POC model).
Runs
est_mcc_poc()separately on each chromosome and returns a per-chromosome estimate. Chromosomes whosecestimate is an outlier relative to the genome-wide median are candidates for aneuploidy and should be excluded before a joint genome-wide fit withest_mcc_genome_poc().- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
results – List of
(c_hat, std_dev_hat)tuples, one per chromosome.- Return type:
- est_mcc_per_chrom_trio(baf_list, mat_haps_list, pat_haps_list, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC independently per chromosome (trio model).
Runs
est_mcc_trio()separately on each chromosome.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
results – List of
(c_hat, std_dev_hat)tuples, one per chromosome.- Return type:
- est_mcc_per_chrom_phased_poc(baf_list, mat_haps_list, freqs_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC independently per chromosome (phase-aware POC model).
Runs
est_mcc_phased_poc()separately on each chromosome.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.freqs_list (
listofnp.ndarray) – Per-chromosome population allele frequency arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
results – List of
(c_hat, std_dev_hat)tuples, one per chromosome.- Return type:
- est_mcc_per_chrom_phased_trio(baf_list, mat_haps_list, pat_haps_list, pos_list, r=1e-08, algo='Nelder-Mead', **kwargs)[source]¶
Estimate MCC independently per chromosome (phase-aware trio model).
Runs
est_mcc_phased_trio()separately on each chromosome.- Parameters:
baf_list (
listofnp.ndarray) – Per-chromosome BAF arrays.mat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m maternal haplotype arrays.pat_haps_list (
listofnp.ndarray) – Per-chromosome 2 x m paternal haplotype arrays.pos_list (
listofnp.ndarray) – Per-chromosome strictly increasing position arrays.r (
float) – Per-base-pair recombination rate; default 1e-8.algo (
str) – Scipy minimisation algorithm; default"Nelder-Mead".**kwargs – Additional keyword arguments forwarded to
scipy.optimize.minimize.
- Returns:
results – List of
(c_hat, std_dev_hat)tuples, one per chromosome.- Return type:
Recombination and Phase Correction¶
- class karyohmm.karyohmm.RecombEst(**kwargs)[source]¶
Bases:
PhaseCorrectCrossover detection via allele-transmission comparison across sibling embryos.
Implements the simplified crossover-detection approach of Coop et al. (2007), extended with
QuadHMM-based interval refinement. For each pair of (template, non-template) sibling embryos, the log-likelihood ratio of copying the same vs different parental allele is computed at informative markers (one parent heterozygous, the other homozygous). Crossover locations are identified as positions where the sign of this ratio changes consistently across sibling pairs.Inherits the embryo storage, noise-parameter estimation, and phase-correction infrastructure from
PhaseCorrect.- __init__(**kwargs)[source]¶
Initialize RecombEst inheriting from PhaseCorrect.
- Parameters:
**kwargs – Forwarded to
PhaseCorrect.__init__()(mat_haps,pat_haps,pos).
- informative_markers(maternal=True)[source]¶
Return a boolean mask of allele-transmission-informative sites.
A site is informative for the maternal haplotype when the mother is heterozygous (genotype 1) and the father is homozygous (genotype 0 or 2), so the transmitted maternal allele is unambiguously determined by the embryo BAF. The paternal case is the mirror image.
- Parameters:
maternal (
bool) – If True (default) return sites informative for maternal crossovers; if False return sites informative for paternal crossovers.- Returns:
info_idx – m-length boolean mask; True at informative sites.
- Return type:
np.ndarray
- refine_recomb_events(potential_switches, npad=5)[source]¶
Refine recombination estimation using the switch-clusters approach of Coop et al 2007.
- isolate_recomb_events(template_embryo=0, maternal=True, ll_thresh=0, npad=5)[source]¶
Isolate specific recombination events.
- Parameters:
template_embryo (
int) – Index of the template embryo.maternal (
bool) – If True, estimate maternal crossovers; if False, estimate paternal crossovers.ll_thresh (
float) – Log-likelihood ratio threshold for transmission assignment.npad (
int) – Half-width (in informative SNPs) of the switch cluster window.
- Returns:
Z (
np.ndarray) – n_sib x n_info array of switch indicators for allele copying in non-template embryos.llr_z (
np.ndarray) – n_sib x n_info raw LLR values for allele-copying in non-template embryos.potential_switches_filt (
np.ndarray) – Majority-vote filtered switch indices at informative markers.switch_cnts_filt (
np.ndarray) – Support counts for each filtered switch.
- identify_switch_intervals(Z, npad=5)[source]¶
Identify sign-change intervals in transmission indicator matrix Z.
Missing or uncertain sites (Z = 0) are imputed by forward-propagating the last non-zero sign before applying
refine_recomb_events()to each row.- Parameters:
Z (
np.ndarray) – n_sib x n_info matrix of transmission indicators (+1 same allele, -1 different allele, 0 uncertain).npad (
int) – Half-width of switch cluster used byrefine_recomb_events().
- Returns:
isolated_switches – Per-sibling lists of refined switch indices.
- Return type:
- second_refine_recomb(template_embryo=0, maternal=True, start=None, end=None)[source]¶
Refine a crossover interval using QuadHMM on the bracketing SNP window.
Uses
QuadHMMpairwise Viterbi paths within the interval [start, end] to narrow the crossover location to the tightest flanking heterozygous-site pair that is consistent across all sibling pairs.- Parameters:
- Returns:
- finalize_recomb_events(potential_switches, template_embryo=0, maternal=True)[source]¶
Localise potential crossovers to the tightest flanking informative-marker interval.
For each entry in
potential_switchescallssecond_refine_recomb()on the flanking informative-marker pair to obtain a refined base-pair interval.- Parameters:
potential_switches (
list) – List of informative-marker indices at which a crossover is suspected (output ofisolate_recomb_events()).template_embryo (
int) – Index of the reference embryo; default 0.maternal (
bool) – If True refine maternal crossovers; default True.
- Returns:
rec_locations – List of
(p1, p2)base-pair intervals bounding each crossover; empty list ifpotential_switchesis empty.- Return type:
- estimate_crossovers(template_embryo=0, ll_thresh=0, maternal=True, npad=5)[source]¶
Full crossover detection pipeline: isolation, majority-rule filtering, and refinement.
Calls
isolate_recomb_events()to identify candidate switch positions, applies majority-rule filtering, then passes surviving candidates tofinalize_recomb_events()for QuadHMM-based interval refinement.Requires
add_baf(),est_sigma_pi0s()to have been called first.- Parameters:
template_embryo (
int) – Index of the reference embryo; default 0.ll_thresh (
float) – Log-likelihood ratio threshold separating “same” from “different” allele transmission; default 0 (LLR sign flip).maternal (
bool) – If True detect maternal crossovers; default True.npad (
int) – Half-width of the switch-cluster filter; default 5.
- Returns:
Notes
To obtain the raw transmission-indicator matrix and LLR values call
isolate_recomb_events()directly.
- class karyohmm.karyohmm.PhaseCorrect(mat_haps, pat_haps, pos)[source]¶
Bases:
objectModule for implementing Mendelian phase correction using BAF data.
Implements the key methods for phasing parental haplotypes using Mendelian phasing via a LOD-score
- __init__(mat_haps, pat_haps, pos)[source]¶
Initialize the class for phase correction.
- Parameters:
mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.pos (
np.ndarray) – An m-length array of base-pair positions.
- add_true_haps(true_mat_haps, true_pat_haps)[source]¶
Add in true haplotypes if available from a simulation.
- Parameters:
true_mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.true_pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.
- add_baf(embryo_bafs=[])[source]¶
Add in BAF estimates for each embryo.
- Parameters:
embryo_bafs (
list) – A list of per-embryo BAF arrays from the same parental individuals.
- est_sigma_pi0s(**kwargs)[source]¶
Estimate per-embryo BAF noise parameters under a disomy assumption.
Runs
MetaHMM.est_sigma_pi0()(disomy model) on each embryo BAF array and stores the results inself.embryo_pi0sandself.embryo_sigmas. Must be called beforeviterbi_phase_correct()orflag_parental_genotype_errors().- Parameters:
**kwargs – Forwarded to
MetaHMM.est_sigma_pi0()(e.g.pi0_bounds,sigma_bounds,algo).
- correct_haps_viterbi(haps, paths)[source]¶
Correct haplotype phase using majority-rule consensus across copying paths.
At each inter-site transition the number of siblings showing a haplotype switch is counted. If a majority of siblings switch, the haplotype orientations prior to that site are flipped for both strands.
- Parameters:
haps (
np.ndarray) – 2 x m array of 0/1 haplotypes to be corrected.paths (
np.ndarray) – n_sib x m boolean array where entry[j, i]is 1 if sibling j copies haplotype 0 at site i (output of Viterbi decoding).
- Returns:
n_mis (
np.ndarray) – (m-1)-length array of per-interval mismatch counts.fixed_haps (
np.ndarray) – 2 x m phase-corrected haplotype array.
- viterbi_phase_correct(niter=5, r=1e-08)[source]¶
Use the Viterbi decoding for phase correction in multiple iterations.
- Parameters:
- Returns:
mat_haps (
np.ndarray) – Phase-corrected 2 x m maternal haplotype array.pat_haps (
np.ndarray) – Phase-corrected 2 x m paternal haplotype array.n_mis_mat_tot (
np.ndarray) – niter-length array of total maternal mismatch counts per iteration.n_mis_pat_tot (
np.ndarray) – niter-length array of total paternal mismatch counts per iteration.
- flag_parental_genotype_errors(use_fixed=False, r=1e-08)[source]¶
Flag potential parental genotype errors using multiple euploid siblings.
Runs forward-backward under disomy for each sibling embryo and accumulates the posterior-weighted log Bayes factor error scores across all siblings. Genuine genotype errors are consistently flagged across siblings (same parental genotype), while per-embryo noise is uncorrelated and averages out.
- Parameters:
- Returns:
mat_err_scores (
np.ndarray) – m-length array of summed maternal error scores.pat_err_scores (
np.ndarray) – m-length array of summed paternal error scores.
- estimate_switch_err_true(maternal=True, fixed=False)[source]¶
Estimate the switch error from true and inferred haplotypes.
The switch error is defined as consecutive heterozygotes that are in the incorrect orientation.
- Parameters:
- Returns:
n_switches (
int) – Number of switches between consecutive heterozygotes.n_consecutive_hets (
int) – Number of consecutive heterozygotes.switch_err_rate (
float) – Number of switches per consecutive heterozygote.switch_idx (
list) – SNP pairs where the variant is out of phase with its predecessor.het_idx (
np.ndarray) – Locations/indices of the heterozygotes.lods (
None) – LOD-scores for the estimated switches (currently always None).
Base Class¶
- class karyohmm.karyohmm.AneuploidyHMM[source]¶
Bases:
objectAbstract base class shared by all karyohmm HMM models.
Defines the common interface (
forward_algorithm,backward_algorithm,forward_backward,viterbi_algorithm) and shared parameter-estimation helpers. Concrete subclasses (MetaHMM,QuadHMM,PocHMM) implement the state space and emission model appropriate for their data type.- aploid¶
String tag identifying the active model variant (e.g.
"meta","disomy","meta+upd").
- est_sigma_pi0(bafs, pos, mat_haps, pat_haps, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 1.0), opt_tol=0.0001, opt_options=None, **kwargs)[source]¶
Estimate sigma and pi0 under the B-Allele Frequency model using optimization of forward algorithm likelihood.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.pos (
np.ndarray) – Basepair positions of the SNPs.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.algo (
str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.opt_tol (
float) – Master tolerance passed to scipy minimize (default 1e-4).opt_options (
dictorNone) – Algorithm-specific options passed to scipy minimize; merged over per-algorithm defaults, so only keys you want to override are needed.
- Returns:
- full_param_inference(bafs, pos, mat_haps, pat_haps, algo='L-BFGS-B', pi0_bounds=(0.01, 0.99), sigma_bounds=(0.01, 1.0))[source]¶
Full parameter inference under the B-allele frequency model using naive optimization of the forward algorithm.
- Parameters:
bafs (
np.ndarray) – B-allele frequencies across all m sites.pos (
np.ndarray) – Basepair positions of the SNPs.mat_haps (
np.ndarray) – A 2 x m array of 0/1 maternal haplotypes.pat_haps (
np.ndarray) – A 2 x m array of 0/1 paternal haplotypes.algo (
str) – One of Nelder-Mead, L-BFGS-B, or Powell algorithms for optimization.
Notes
Not implemented yet.
Simulators¶
- class karyohmm.simulator.PGTSim[source]¶
Bases:
PGTSimBaseSimulation of meiotic-origin whole chromosome ploidy changes.
- full_ploidy_sim(afs=None, reads=False, ploidy=2, m=10000, upd=None, length=10000000.0, rec_rate=0.0001, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶
Simulate a single embryo with a given ploidy status.
- sibling_euploid_sim(afs=None, reads=False, ploidy=2, m=10000, length=10000000.0, nsibs=3, rec_rate=1e-08, std_dev=0.15, mix_prop=0.3, alpha=1.0, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶
Simulate euploid sibling embryos.
- class karyohmm.simulator.PGTSimMosaic[source]¶
Bases:
PGTSimBaseSimulator for mosaic aneuploidies.
- mixed_ploidy_sim(afs=None, ploidies=array([0, 1, 2, 3]), props=array([0., 0., 1., 0.]), ncells=10, m=10000, length=10000000.0, rec_rate=1e-08, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶
Simulate BAF from a mixture of ploidies.
- sim_from_haps(mat_haps, pat_haps, pos, afs=None, ploidies=array([0, 1, 2, 3]), props=array([0., 0., 1., 0.]), ncells=10, rec_rate=1e-08, mat_skew=0.5, std_dev=0.15, mix_prop=0.3, alpha=1.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶
Simulate mosaic aneuploidies from known parental haplotypes.
- class karyohmm.simulator.PGTSimSegmental[source]¶
Bases:
PGTSimBaseSimulator for segmental aneuploidies.
- seg_aneuploidy(m=100, mean_size=10, ploidy=3, mat_skew=0.5, seed=42)[source]¶
Choose the position and type of the segmental aneuploidy.
- sim_haplotype_paths_segmental(mat_haps, pat_haps, pos, start, end, aneu_type='3m', rec_rate=0.0001, seed=42)[source]¶
Simulate haplotypes and segmental aneuploidies.
This simulates a local segmental aneuploidy in an otherwise disomic background.
- sim_b_allele_freq_segmental(mat_hap, pat_hap, ploidies, std_dev=0.1, mix_prop=0.6, seed=42)[source]¶
Simulate B-allele frequency conditional on parental haplotypes.
- sim_read_counts_segmental(mat_hap, pat_hap, ploidies, coverage=1.0, a=10.0, b=10.0, eps=0.01, seed=42)[source]¶
Simulate read counts for segmental embryo biopsies.
- Parameters:
mat_hap (np.array) – 1 x M numpy array for maternal haplotype.
pat_hap (np.array) – 1 x M numpy array for paternal haplotype.
ploidies (np.array) – ploidy count of number of chromosomes being simulated.
coverage (float) – mean coverage for a standard diploid individual.
a (float) – shape parameter for beta-binomial.
b (float) – scale parameter for beta-binomial.
eps (float) – small noise parameter.
seed (int) – random number seed.
- Output:
true_geno (np.array): the true genotype of the embryo. alt_reads (np.array): the number of reads with alt. allele. ref_reads (np.array): the number of reads with ref. allele.
- full_segmental_sim(reads=False, afs=None, ploidy=2, m=10000, length=50000000.0, mat_skew=0.5, rec_rate=1e-08, std_dev=0.1, mix_prop=0.7, mean_size=10, coverage=5.0, a=10.0, b=10.0, switch_err_rate=0.01, err_rate=0.001, seed=42)[source]¶
Conduct a full simulation of segmental aneuploidies conditional on parental haplotypes.
- class karyohmm.simulator.PGTSimVCF[source]¶
Bases:
PGTSimImplements PGT-simulation from parental haplotypes.
- gen_parental_haplotypes(vcf_fp, maternal_id=None, paternal_id=None, **kwargs)[source]¶
Generate parental haplotypes from an actual VCF.
- Parameters:
vcf_fp (str) – path to an input VCF file.
maternal_id (str) – ID of maternal individual.
paternal_id (str) – ID of paternal individual.
- Output:
mat_haps (np.array): paternal haplotypes. pat_haps (np.array): maternal haplotypes. pos (np.array): basepair positions of variants. afs (np.array): alternative allele frequency. df (pd.DataFrame): pandas dataframe of cleaned options.
- class karyohmm.simulator.PGTSimBase[source]¶
Bases:
objectBase-class for simulation of PGT-A data.
- draw_parental_genotypes(afs=None, m=100, seed=42)[source]¶
Draw parental genotypes from a beta distribution.
- Parameters:
afs (np.array) – realized allele frequencies.
m (int) – number of variants to simulate.
seed (int) – random number seed.
- Output:
maternal_haps (np.array): maternal haplotypes paternal_haps (np.array): paternal haplotypes ps (np.array): allele frequency of variants
- create_genotyping_errors(haps, err_rate=0.001, seed=42)[source]¶
Create genotyping errors in haplotypes sampled.
- Parameters:
haps (np.array) – haplotype array from either maternal or paternal
err_rate (float) – site-level error rate in genotyping
seed (int) – random seed
- Output:
err_idx (np.array): m-length array indicating simulated errors err_haps (np.array): haplotypes which reflect newly simulated errors
- create_switch_errors_help(haps, err_rate=0.001, seed=42)[source]¶
Revised method to create switch errors.
- Parameters:
haps (np.array) – 2 x M numpy array of haplotypes for bi-allelic variants.
err_rate (float) – switch error rate between heterozygotes.
seed (int) – random number seed.
- Output:
haps_prime (np.array): switched haplotypes. switches (np.array): snp-indices of haplotype switches.
- create_switch_errors(mat_haps, pat_haps, err_rate=0.001, seed=42)[source]¶
Create switch errors to evaluate impact of poor phasing.
- Parameters:
mat_haps (np.array) – 2 x M numpy array for maternal haplotypes.
pat_haps (np.array) – 2 x M numpy array for paternal haplotypes.
err_rate (float) – switch error rate between heterozygotes.
seed (int) – random number seed.
- Output:
mat_haps_prime (np.array): switched maternal haplotypes. pat_haps_prime (np.array): switched paternal haplotypes. m_switches (np.array): snp-indices of maternal haplotype switches. p_switches (np.array): snp-indices of maternal haplotype switches.
- sim_haplotype_paths(mat_haps, pat_haps, pos, ploidy=2, rec_rate=1e-08, mat_skew=0.5, seed=42)[source]¶
Simulate copying paths through the maternal and paternal haplotypes.
- Parameters:
mat_haps (np.array) – 2 x M numpy array for maternal haplotypes.
pat_haps (np.array) – 2 x M numpy array for paternal haplotypes.
pos (np.array) – position of individual variants.
ploidy (int) – integer of number of chromosomes being simulated.
rec_rate (float) – uniform recombination rate per basepair.
mat_skew (float) – probability of maternal-origin aneuploidy.
seed (int) – random number seed.
- Output:
zs_maternal (np.array): copying path through maternal haplotypes. zs_paternal (np.array): copying path through paternal haplotypes. mat_real_hap (np.array): simulated embryo maternal haplotype(s). pat_real_hap (np.array): simulated embryo paternal haplotype(s). aploid (str): indicator of aneuploidy status.
- sim_haplotype_paths_upd(mat_haps, pat_haps, pos, ploidy=2, aploid='2p0', rec_rate=1e-08, mat_skew=0.5, seed=42)[source]¶
Extension to simulate haplotype-copying paths under uniparental disomy.
- sim_b_allele_freq(mat_hap, pat_hap, ploidy=2, std_dev=0.2, mix_prop=0.3, seed=42)[source]¶
Simulate of B-allele frequency.
- sim_logR_ratio(mat_hap, pat_hap, ploidy=2, alpha=0.5, seed=42)[source]¶
Simulate logR-ratio conditional on ploidy.
Alpha is the degree to which the variance is increased for the LRR.
- sim_read_counts(mat_hap, pat_hap, ploidy, coverage=1.0, a=10.0, b=10.0, eps=0.01, seed=42)[source]¶
Simulate read counts for embryo biopsies.
- Parameters:
mat_hap (np.array) – 1 x M numpy array for maternal haplotype.
pat_hap (np.array) – 1 x M numpy array for paternal haplotype.
pos (np.array) – position of individual variants.
ploidy (int) – integer of number of chromosomes being simulated.
coverage (float) – mean coverage for a standard diploid individual.
a (float) – shape parameter for beta-binomial.
b (float) – scale parameter for beta-binomial.
eps (float) – small noise parameter.
seed (int) – random number seed.
- Output:
true_geno (np.array): the true genotype of the embryo. alt_reads (np.array): the number of reads with alt. allele. ref_reads (np.array): the number of reads with ref. allele.
I/O Utilities¶
Classes for reading in data for karyoHMM.
Implements methods for both MetaHMM and PoCHMM-based datasets.
- class karyohmm.io.DataReader(mode='Meta', duo_maternal=None)[source]¶
Bases:
objectInput / output class for reading in data for karyohmm.
- read_data_np(input_fp)[source]¶
Read data from an .npy or npz file and reformat for karyohmm.
- Parameters:
input_fp (str) – path to input NPZ or NPY file.
- Output:
df (pl.DataFrame): polars dataframe of cleaned options.