pGermlinePoly API Reference¶
This page documents the public API of the pGermlinePoly package.
Inference¶
- class pGermlinePoly.pGermlinePoly.ProbGermline(X, Theta, Phi=None, kappa=100.0, mu=0.001)[source]¶
Bases:
objectCompute the posterior probability of germline polymorphism from clonal sequencing data.
Implements an EM algorithm that jointly estimates logistic annotation weights (lambda, beta) and a Beta-Binomial error concentration (kappa) to discriminate germline heterozygotes from somatic variants.
- Parameters:
X (
numpy.ndarray) – Read-count array of shape (M, J, 2).X[:, :, 0]are reference read counts andX[:, :, 1]are alternative read counts. M = number of sites, J = number of clones.Theta (
numpy.ndarray) – Site-level annotation matrix of shape (M, L).Phi (
numpy.ndarray, optional) – Clone-level annotation array of shape (M, J, B). When provided, clone-specific beta weights are estimated in addition to lambda. Default is None.kappa (
float, optional) – Initial concentration parameter for the Beta-Binomial error prior. Default is 100.0.mu (
float, optional) – Fixed mean sequencing error rate used in the Beta-Binomial model. Default is 1e-3.
- impute_anno()[source]¶
Impute missing annotation values using column-wise means.
Replaces NaN entries in
self.Thetain-place with the mean of the corresponding annotation column, computed over all non-missing sites.
- mle_vaf(naive=True, eps=0.001, **kwargs)[source]¶
Estimate the per-site MLE variant allele frequency from pooled clone reads.
Results are stored in
self.vaf(shape (M,)) andself.logl_vaf(shape (M,)).
- loglik_ratio_het(**kwargs)[source]¶
Compute the likelihood ratio statistic for each site.
Returns 2*(LL_somatic - LL_het). Under the null hypothesis that the site is a germline heterozygote this is asymptotically chi-squared with 1 degree of freedom.
- Parameters:
**kwargs – Forwarded to
mle_vaf()(if VAF has not yet been computed) and tologlik_ratio().- Returns:
LRT statistics of shape (M,).
- Return type:
- prior_poly(lambdas=array([0., 0.]))[source]¶
Compute the log prior probability of germline heterozygosity.
Evaluates log sigma(Theta @ lambdas) for all M sites under the logistic model.
- Parameters:
lambdas (
numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.- Returns:
Log prior probabilities of shape (M,), in (-inf, 0].
- Return type:
- post_prob_poly(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]¶
Compute the log posterior probability of germline heterozygosity for all sites.
Evaluates log P(z_k = het | A_k, R_k) for each of the M sites using the current or supplied model parameters.
- Parameters:
lambdas (
numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.betas (
numpy.ndarrayorNone, optional) – Clone-level annotation weights, shape (B,). None uses zeros.kappa (
floatorNone, optional) – Beta-Binomial concentration parameter. None usesself.kappa.**kwargs – Accepted for interface compatibility; currently unused.
- Returns:
Log posterior probabilities of shape (M,), in (-inf, 0].
- Return type:
- est_vaf_CI(alpha=0.05, df=1, **kwargs)[source]¶
Estimate per-site profile-likelihood confidence intervals for the VAF.
Uses the Wilks approximation: the CI is the set of VAF values p for which 2*(LL_mle - LL_p) < chi2(1-alpha, df). Bounds are found via
scipy.optimize.brentq.- Parameters:
- Returns:
Array of shape (M, 3) with columns [lower_CI, MLE_VAF, upper_CI].
- Return type:
- est_germline_genotype(lambdas=None, betas=None, allele_freq=None, p_hom_alt=0.5)[source]¶
Compute per-site log-posterior probabilities over germline genotypes {0/0, 0/1, 1/1}.
Evaluates the joint binomial log-likelihood of clone read counts under each diploid genotype at the phylogenetic root, combined with a genotype prior, and returns the normalized log-posterior. Because J clones contribute independently, genotype uncertainty decreases quickly with increasing J and per-clone depth — providing far more resolution than a single germline sample. The 0/1 likelihood is identical to the
logprob_het()term used inpost_prob_poly().- Parameters:
lambdas (
numpy.ndarrayorNone, optional) – Site-level annotation weights, shape (L,). Only used whenallele_freqis None. None uses zeros.betas (
numpy.ndarrayorNone, optional) – Clone-level annotation weights, shape (B,). Only used whenallele_freqis None. None uses zeros.allele_freq (
numpy.ndarrayorNone, optional) – Per-site population allele frequencies, shape (M,), values in [0, 1]. When provided, the genotype prior follows Hardy-Weinberg equilibrium: P(0/0) = (1-p)^2, P(0/1) = 2p(1-p), P(1/1) = p^2. When None, the logistic annotation model supplies P(0/1) = sigma(Theta @ lambdas), and the remaining mass is split between 0/0 and 1/1 according top_hom_alt.p_hom_alt (
float, optional) – Fraction of the non-het prior mass assigned to 1/1 whenallele_freqis None. Must be strictly between 0 and 1. Default is 0.5 (symmetric split between 0/0 and 1/1).
- Returns:
Log-posterior probabilities of shape (M, 3), with columns [log P(0/0|data), log P(0/1|data), log P(1/1|data)], normalized so that logsumexp over columns equals 0 for every site.
- Return type:
- Raises:
ValueError – If
p_hom_altis not strictly between 0 and 1.
Notes
The per-clone binomial log-likelihoods for genotype G at site k, summed across J clones (combinatorial coefficient omitted):
log P(X_k | G=0/0) = sum_j a_j log(eps) + r_j log(1 - eps) log P(X_k | G=0/1) = sum_j n_j log(0.5) log P(X_k | G=1/1) = sum_j a_j log(1 - eps) + r_j log(eps)
where eps =
self.mu, a_j and r_j are the alt and ref read counts for clone j, and n_j = a_j + r_j. Clones with zero coverage contribute zero to the sum and therefore carry no information.
- complete_logll(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]¶
Compute the observed data log-likelihood summed over all M sites.
Evaluates sum_k log P(A_k, R_k) by marginalizing the latent class over each site.
- Parameters:
lambdas (
numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.betas (
numpy.ndarrayorNone, optional) – Clone-level annotation weights, shape (B,). None uses zeros.kappa (
floatorNone, optional) – Beta-Binomial concentration parameter. None usesself.kappa.**kwargs – Accepted for interface compatibility; currently unused.
- Returns:
sum_k log P(A_k, R_k | lambdas, betas, kappa).
- Return type:
- naive_mle(algo='L-BFGS-B', **kwargs)[source]¶
Direct MLE of site-level annotation weights with betas fixed at zero.
Maximises the observed log-likelihood via
scipy.optimize.minimize.- Parameters:
algo (
str, optional) – Scipy minimisation algorithm. One of"L-BFGS-B","Powell", or"Nelder-Mead". Default is"L-BFGS-B".**kwargs – Forwarded to
scipy.optimize.minimize.
- Returns:
MLE site-level annotation weights, shape (L,).
- Return type:
- em_algo(lambdas=None, betas=None, kappa=None, algo='L-BFGS-B', delta_logll=0.0001, max_iter=50, **kwargs)[source]¶
Run the EM algorithm to jointly estimate (lambda, beta, kappa).
Iterates E-step / M-step until the absolute change in observed log-likelihood falls below
delta_logll, as described in Eqs. 10-14.- Parameters:
lambdas (
numpy.ndarrayorNone, optional) – Initial site-level annotation weights, shape (L,). None uses zeros.betas (
numpy.ndarrayorNone, optional) – Initial clone-level annotation weights, shape (B,). None uses zeros.kappa (
floatorNone, optional) – Initial Beta-Binomial concentration. None usesself.kappa.algo (
str, optional) – Scipy optimizer for the (lambda, beta) M-step. One of"L-BFGS-B","Powell", or"Nelder-Mead". Default is"L-BFGS-B".delta_logll (
float, optional) – Convergence threshold on the absolute change in observed log-likelihood between successive EM iterations. Default is 1e-4.max_iter (
int, optional) – Maximum number of EM iterations before stopping regardless of convergence. Default is 50.**kwargs – Currently unused; accepted for forward compatibility.
- Returns:
loglls (
numpy.ndarray) – Observed log-likelihood trace, length = number of iterations + 1.lambdas_hat (
numpy.ndarray) – Estimated site-level annotation weights, shape (L,).betas_hat (
numpy.ndarray) – Estimated clone-level annotation weights, shape (B,).kappa_hat (
float) – Estimated Beta-Binomial concentration parameter.
- class pGermlinePoly.pGermlinePoly.MutectLOD(X)[source]¶
Bases:
objectCompute per-site LOD scores following the Mutect2 / Williams et al. model.
- Parameters:
X (
numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2]must equal 2).
- lod¶
Per-site log-likelihood matrix of shape (M, 3), populated by
lod_scores().- Type:
- lod_germline[source]¶
Per-site germline LOD scores of shape (M,), populated by
lod_germline().- Type:
- lod_scores(q=30.0)[source]¶
Compute per-site log-likelihoods under three VAF hypotheses.
Populates
self.lodwith shape (M, 3):column 0: log-likelihood under VAF = 0 (no mutation)
column 1: log-likelihood under the MLE VAF
column 2: log-likelihood under VAF = 0.5 (germline heterozygote)
- Parameters:
q (
float, optional) – Phred-scaled base quality used to derive the error rate. Must be positive. Default is 30.0.
- est_germline_prior(anno)[source]¶
Set per-site germline priors from a dbSNP-like annotation array.
- Parameters:
anno (
numpy.ndarray) – Binary or continuous annotation values, shape (M,).- Raises:
NotImplementedError – This method is not yet implemented.
- lod_germline(p_somatic=3e-06, p_germline=0.095)[source]¶
Compute the per-site LOD score for germline origin.
LOD_germline = (LL_mle + log p_somatic) - (LL_het + log p_germline), converted to base-10 log-odds. A positive LOD score favours somatic origin; a negative value favours germline origin.
Requires
lod_scores()to have been called first. Result is stored inself.lod_germline, shape (M,).
- class pGermlinePoly.pGermlinePoly.BetaOverdispersion(X)[source]¶
Bases:
objectEstimate per-site overdispersion under the Beta-Binomial model.
Implements the overdispersion test from Spencer-Chapman et al. by fitting the rho parameter of a Beta-Binomial distribution to the observed allele counts across clones.
- Parameters:
X (
numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2]must equal 2).
- estimate_rhos()[source]¶
Estimate the per-site overdispersion parameter rho.
For each site, maximises the Beta-Binomial log-likelihood over rho in (0, 1) via
scipy.optimize.minimize_scalar, where the Beta parameters are alpha = phat*(1-rho)/rho and beta = (1-phat)*(1-rho)/rho.- Returns:
Per-site MLE overdispersion values, shape (M,).
- Return type:
Simulation¶
- class pGermlinePoly.pGermlinePoly.ClonalSim(seq_len=10000000.0, n_clones=10)[source]¶
Bases:
objectSimulate clonal sequencing data with germline and somatic variants.
Generates a synthetic dataset containing a germline sample and a set of clonal samples, complete with germline heterozygotes, somatic mutations placed on a neutral coalescent genealogy, and realistic read counts. Output can be written as a VCF file suitable for use with the CLI.
- Parameters:
- simulate_germline(afs=[0.31699444395046117, 6.067159920986527], het_rate=0.001, mean_coverage=15.0, sd_coverage=5.0, mut_rate=1.2e-08, q=30, seed=42)[source]¶
Simulate germline heterozygotes and de-novo mutations for the germline sample.
Draws the number of heterozygous sites from a Poisson distribution, samples allele frequencies from a Beta distribution parameterised by
afs, and simulates read counts under a Normal coverage model. Populatesself.germline_muts,self.germline_af,self.germline_alt_reads,self.germline_tot_reads, andself.germline_pl.- Parameters:
afs (
listoffloat, optional) – Shape parameters [a, b] of the Beta prior on allele frequencies in the external population. Default is [0.317, 6.067].het_rate (
float, optional) – Expected heterozygous site density per base-pair. Default is 1e-3.mean_coverage (
float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.sd_coverage (
float, optional) – Standard deviation of sequencing coverage. Default is 5.0.mut_rate (
float, optional) – De-novo mutation rate per base-pair. Default is 1.2e-8.q (
int, optional) – Phred-scaled read quality used for genotype likelihood computation. Default is 30.seed (
int, optional) – Random seed for reproducibility. Default is 42.
- Raises:
ValueError – If no heterozygous sites are simulated (increase
het_rateorseq_len).
- simulate_clone_genealogy(age=45, seed=42)[source]¶
Simulate a somatic genealogy for the clonal samples under a neutral coalescent.
Uses msprime to simulate a single-locus genealogy for
n_cloneshaploid samples. Branch lengths are later rescaled byagewhen simulating somatic mutations. Populatesself.genealogy.
- sim_somatic_mutations(age=45, mut_rate=5e-09, mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]¶
Simulate somatic mutations on branches of the clonal genealogy.
Traverses each branch of
self.genealogy, draws the number of mutations from a Poisson distribution scaled by the branch length, age, genome size, and mutation rate, then assigns read counts to the leaf clones that descend from the mutated branch. Populatesself.somatic_muts,self.somatic_alt_reads,self.somatic_tot_reads, andself.somatic_mut_pl.- Parameters:
age (
int, optional) – Age of the individual in years, used to rescale branch lengths. Default is 45.mut_rate (
float, optional) – Somatic mutation rate in mutations per base-pair per year (diploid rate). Default is 5e-9.mean_coverage (
float, optional) – Mean sequencing coverage per clone. Default is 15.0.sd_coverage (
float, optional) – Standard deviation of sequencing coverage. Default is 5.0.q (
int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.seed (
int, optional) – Random seed for reproducibility. Default is 42.
- simulate_clonal_germline_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]¶
Simulate germline heterozygote read counts across all clonal samples.
For each germline site, draws per-clone coverage from a Normal distribution and alt counts from a Binomial(p=0.5) distribution. Populates
self.germline_clone_tot_reads,self.germline_clone_alt_reads, andself.germline_clone_pl.- Parameters:
mean_coverage (
float, optional) – Mean sequencing coverage per clone. Default is 15.0.sd_coverage (
float, optional) – Standard deviation of sequencing coverage. Default is 5.0.q (
int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.seed (
int, optional) – Random seed for reproducibility. Default is 42.
- simulate_germline_somatic_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, eps=0.01, seed=42)[source]¶
Simulate read counts for somatic mutations as seen in the germline sample.
Draws coverage from a Normal distribution and alt counts from a Binomial(p=``eps``) distribution (the germline sample should show only error-level alt reads at somatic sites). Populates
self.germline_somatic_tot_reads,self.germline_somatic_alt_reads, andself.germline_somatic_pl.- Parameters:
mean_coverage (
float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.sd_coverage (
float, optional) – Standard deviation of sequencing coverage. Default is 5.0.q (
int, optional) – Phred-scaled read quality. Default is 30.eps (
float, optional) – Error rate used to simulate alt read counts at somatic sites in the germline. Default is 1e-2.seed (
int, optional) – Random seed for reproducibility. Default is 42.
- create_read_matrix()[source]¶
Build a read-count matrix from the simulated somatic and germline data.
Stacks somatic sites above germline sites to produce a combined array. Each entry stores [ref_reads, alt_reads] per clone.
- Returns:
Integer read-count array of shape (M_somatic + M_germline, J, 2), where the last dimension is [ref_reads, alt_reads].
- Return type:
- create_gt_string(alt_reads=0, tot_reads=0, pl=array([0, 0, 0]))[source]¶
Format read counts and genotype likelihoods as a VCF genotype field string.
Determines the GT call from the read counts, formats AD, DP, GQ, and PL fields, and returns them as a colon-delimited string.
- Parameters:
alt_reads (
int, optional) – Number of alternative allele reads. Default is 0.tot_reads (
int, optional) – Total read depth. Default is 0.pl (
numpy.ndarray, optional) – Phred-scaled genotype likelihoods [PL(0/0), PL(0/1), PL(1/1)]. Default is [0, 0, 0].
- Returns:
- write_vcf(out=None)[source]¶
Write the simulated variants to a VCF file.
Produces a VCFv4.2 file containing one germline sample (
Agermline) followed by J clonal samples (Aclone0…AcloneN). Germline heterozygotes are written first, then somatic variants. The INFO field includes AC, AF, AN, DP, ExternalAF, and SM (somatic indicator).- Parameters:
out (
str) – Output file path. Must be writable.
I/O Utilities¶
Module to help with IO routines and validation.
- pGermlinePoly.io.validate_config(config_yaml_fp, schema={'age': {'min': 0.0, 'required': True, 'type': 'number'}, 'annotations': {'required': True, 'type': 'list'}, 'clones': {'required': True, 'schema': {'type': 'string'}, 'type': 'list'}, 'germline': {'required': False, 'schema': {'type': 'string'}, 'type': 'list'}, 'ind': {'required': True, 'type': 'string'}, 'sex': {'allowed': ['M', 'F'], 'maxlength': 1, 'required': True, 'type': 'string'}})[source]¶
Validate a YAML configuration file against the germline schema.
- Parameters:
- Returns:
Parsed and validated configuration dictionary.
- Return type:
- Raises:
AssertionError – If the configuration does not conform to the schema.
- pGermlinePoly.io.check_samples(vcf, samples=[])[source]¶
Assert that all requested sample names are present in the VCF.
- Parameters:
- Raises:
AssertionError – If any name in
samplesis not found invcf.samples.
- pGermlinePoly.io.check_annotations(vcf, annotations=['PL', 'AD'])[source]¶
Assert that required annotation fields are declared in the VCF header.
Checks both INFO and FORMAT fields via
vcf.contains.- Parameters:
- Raises:
AssertionError – If any field ID in
annotationsis not declared in the VCF header.
- pGermlinePoly.io.create_germline_anno(vcf, **kwargs)[source]¶
Extract per-site germline heterozygote log-likelihoods from a germline VCF.
Iterates over biallelic SNPs, reads the AD FORMAT field of the first sample (assumed to be the germline sample), computes Phred-scaled genotype likelihoods via
geno_loglik, and returns the heterozygote PL value (index 1) for each site.- Parameters:
vcf (
cyvcf2.VCF) – Opened VCF with an “AD” FORMAT field. The first sample is treated as the germline reference.**kwargs – Additional keyword arguments forwarded to
geno_loglik(e.g.,q).
- Returns:
Heterozygote genotype log-likelihoods for each biallelic SNP, shape (M,), dtype float64.
- Return type:
- Raises:
AssertionError – If the VCF does not contain the “AD” FORMAT field.
- pGermlinePoly.io.create_anno(vcf, annotations=[])[source]¶
Extract INFO annotation values for all variants in a VCF.
Iterates over all variants, collecting the requested INFO field values for biallelic SNPs. Non-SNP or multiallelic sites receive NaN for all requested annotations.
- Parameters:
- Returns:
Annotation matrix of shape (N, len(annotations)), where N is the total number of variants iterated.
- Return type:
- pGermlinePoly.io.create_read_matrix(vcf)[source]¶
Build a read-count matrix from the AD FORMAT field of a clonal VCF.
Iterates over all variants. For biallelic SNPs the allele depth (AD) matrix is stacked; non-SNP or multiallelic records are represented as rows of zeros.
- Parameters:
vcf (
cyvcf2.VCF) – Opened VCF object containing an “AD” FORMAT field with at least two samples (clones).- Returns:
Integer read-count array of shape (M, J, 2), where M is the number of variants, J is the number of samples, and the last dimension holds [ref_reads, alt_reads] per sample.
- Return type:
- Raises:
AssertionError – If the VCF does not contain the “AD” FORMAT field or has fewer than two samples.