pGermlinePoly API Reference¶

This page documents the public API of the pGermlinePoly package.

Inference¶

class pGermlinePoly.pGermlinePoly.ProbGermline(X, Theta, Phi=None, kappa=100.0, mu=0.001)[source]¶

Bases: ReadCountUtils

Compute the posterior probability of germline polymorphism from clonal sequencing data.

Implements an EM algorithm that jointly estimates logistic annotation weights (\(\boldsymbol{\lambda}\), \(\boldsymbol{\beta}\)) and a Beta-Binomial error concentration (\(\kappa\)) to discriminate germline heterozygotes from somatic variants.

The log-posterior for site k is:

\[\log P(z_k = \text{het} \mid \mathbf{X}_k) \propto \log \sigma\!\left(\boldsymbol{\theta}_k^\top \boldsymbol{\lambda}\right) + \sum_{j=1}^{J} \log p_{\mathrm{BB}}\!\left(a_{kj} \mid n_{kj},\, \kappa\right)\]

where \(\sigma\) is the logistic function and \(p_{\mathrm{BB}}\) is the Beta-Binomial PMF with mean \(\varepsilon\) (sequencing error) under the somatic hypothesis.

Parameters:

X (numpy.ndarray) – Read-count array of shape (M, J, 2). X[:, :, 0] are reference read counts and X[:, :, 1] are alternative read counts. M = number of sites, J = number of clones.
Theta (numpy.ndarray) – Site-level annotation matrix of shape (M, L).
Phi (numpy.ndarray, optional) – Clone-level annotation array of shape (M, J, B). When provided, clone-specific beta weights are estimated in addition to lambda. Default is None.
kappa (float, optional) – Initial concentration parameter for the Beta-Binomial error prior. Default is 100.0.
mu (float, optional) – Fixed mean sequencing error rate used in the Beta-Binomial model. Default is 1e-3.

impute_anno(col_fill_values=None)[source]¶

Impute missing annotation values using column-wise means.

Replaces NaN entries in self.Theta in-place with the mean of the corresponding annotation column, computed over all non-missing sites. Columns that are entirely NaN (no observed values) are filled with 0.0 and a warning is emitted; such columns carry no information and will be assigned zero weight by the optimizer.

Parameters:

col_fill_values (dict of {int: float}, optional) –

Per-column fill values that override the column-wise mean for specific columns. Keys are column indices into self.Theta.

This is important for population allele-frequency (AF) annotations where missingness is informative, not random. A site absent from gnomAD was never observed at detectable frequency — direct evidence that the allele is rare. By contrast, the column mean is computed only over sites present in the database, which skews toward common variants. Imputing missing sites with that mean assigns them a “common germline het” prior, which is the opposite of what absence implies and collapses the annotation’s discriminative power.

Pass a floor value well below the minimum observed value (e.g. nanmin(col) - 2 in log10 space, placing missing sites two orders of magnitude below the rarest in-DB entry) so that absent sites receive a low-AF prior consistent with their likely rarity.

reflect_af_annotations(col_indices, transform_names=None)[source]¶

Reflect allele-frequency annotation columns for reoriented sites.

For each site that was flipped by reorient_to_minor_allele(), the annotation value is mapped AF → 1−AF so that it continues to describe the minor allele rather than the original ALT allele. Reflection is performed in the raw (pre-transform) space: the current value is inverted back to a raw AF, reflected, and the transform re-applied. Sites with NaN annotation values are left untouched so that impute_anno() can handle them after reflection.

Must be called after reorient_to_minor_allele() (which sets self.flipped) and before impute_anno().

Parameters:

col_indices (list of int) – Column indices into self.Theta to reflect.
transform_names (list of str or None, optional) – Transform name applied to each column — "log10", "sqrt", or None for raw (untransformed) AF values in [0, 1]. Must be the same length as col_indices. Default is all None.

Raises:

RuntimeError – If called before reorient_to_minor_allele().
AssertionError – If transform_names is provided but its length does not match col_indices.

mle_vaf(naive=True, eps=0.001, **kwargs)[source]¶

Estimate the per-site MLE variant allele frequency from pooled clone reads.

Results are stored in self.vaf (shape (M,)) and self.logl_vaf (shape (M,)).

Parameters:

naive (bool, optional) – If True (default), uses the empirical proportion alt/(alt+ref). If False, optimises var_loglik() via scipy.optimize.minimize_scalar.
eps (float, optional) – Sequencing error rate used in the likelihood. Default is 1e-3.
**kwargs – Forwarded to var_loglik().

loglik_ratio_het(**kwargs)[source]¶

Compute the likelihood ratio statistic for each site.

\[\Lambda_k = 2\bigl[\ell(\hat{p}_k) - \ell(0.5)\bigr] \;\overset{H_0}{\sim}\; \chi^2_1\]

Under the null hypothesis that the site is a germline heterozygote (\(H_0 : p = 0.5\)), \(\Lambda_k\) is asymptotically chi-squared with 1 degree of freedom.

Parameters:: **kwargs – Forwarded to mle_vaf() (if VAF has not yet been computed) and to loglik_ratio().
Returns:: LRT statistics of shape (M,).
Return type:: numpy.ndarray

prior_poly(lambdas=array([0., 0.]))[source]¶

Compute the log prior probability of germline heterozygosity.

Evaluates the logistic prior for all M sites:

\[\log \pi_k = \log \sigma\!\left(\boldsymbol{\theta}_k^\top \boldsymbol{\lambda}\right)\]

Parameters:: lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.
Returns:: Log prior probabilities of shape (M,), in (-inf, 0].
Return type:: numpy.ndarray

post_prob_poly(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]¶

Compute the log posterior probability of germline heterozygosity for all sites.

Evaluates \(\log P(z_k = \text{het} \mid \mathbf{X}_k)\) for each of the M sites by combining the logistic prior with the Beta-Binomial likelihood ratio via Bayes’ theorem.

Parameters:

lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.
betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). None uses zeros.
kappa (float or None, optional) – Beta-Binomial concentration parameter. None uses self.kappa.
**kwargs – Accepted for interface compatibility; currently unused.

Returns:

Log posterior probabilities of shape (M,), in (-inf, 0].

Return type:

numpy.ndarray

est_vaf_CI(alpha=0.05, df=1, **kwargs)[source]¶

Estimate per-site profile-likelihood confidence intervals for the VAF.

Uses the Wilks approximation: the CI is the set of VAF values p for which the likelihood ratio falls within the chi-squared threshold:

\[\widehat{\mathrm{CI}}_{1-\alpha} = \bigl\{p \in [0,1] : 2\bigl[\ell(\hat{p}_k) - \ell(p)\bigr] \leq \chi^2_{1-\alpha,\,\mathrm{df}}\bigr\}\]

Bounds are found via scipy.optimize.brentq.

Parameters:

alpha (float, optional) – Significance level for the CI. Default is 0.05 (95% CI).
df (int, optional) – Degrees of freedom for the chi-squared threshold. Default is 1.
**kwargs – Forwarded to var_loglik().

Returns:

Array of shape (M, 3) with columns [lower_CI, MLE_VAF, upper_CI].

Return type:

numpy.ndarray

est_germline_genotype(lambdas=None, betas=None, allele_freq=None, p_hom_alt=0.5)[source]¶

Compute per-site log-posterior probabilities over germline genotypes {0/0, 0/1, 1/1}.

Evaluates the joint binomial log-likelihood of clone read counts under each diploid genotype at the phylogenetic root, combined with a genotype prior, and returns the normalized log-posterior. Because J clones contribute independently, genotype uncertainty decreases quickly with increasing J and per-clone depth — providing far more resolution than a single germline sample. The 0/1 likelihood is identical to the logprob_het() term used in post_prob_poly().

Parameters:

lambdas (numpy.ndarray or None, optional) – Site-level annotation weights, shape (L,). Only used when allele_freq is None. None uses zeros.
betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). Only used when allele_freq is None. None uses zeros.
allele_freq (numpy.ndarray or None, optional) – Per-site population allele frequencies, shape (M,), values in [0, 1]. When provided, the genotype prior follows Hardy-Weinberg equilibrium: P(0/0) = (1-p)^2, P(0/1) = 2p(1-p), P(1/1) = p^2. When None, the logistic annotation model supplies P(0/1) = sigma(Theta @ lambdas), and the remaining mass is split between 0/0 and 1/1 according to p_hom_alt.
p_hom_alt (float, optional) – Fraction of the non-het prior mass assigned to 1/1 when allele_freq is None. Must be strictly between 0 and 1. Default is 0.5 (symmetric split between 0/0 and 1/1).

Returns:

Log-posterior probabilities of shape (M, 3), with columns [log P(0/0|data), log P(0/1|data), log P(1/1|data)], normalized so that logsumexp over columns equals 0 for every site.

Return type:

numpy.ndarray

Raises:

ValueError – If p_hom_alt is not strictly between 0 and 1.

Notes

The per-clone binomial log-likelihoods for genotype G at site k, summed across J clones (combinatorial coefficient omitted):

\[\begin{split}\begin{aligned} \log P(\mathbf{X}_k \mid G = 0/0) &= \sum_j \bigl[a_j \log \varepsilon + r_j \log(1-\varepsilon)\bigr] \\ \log P(\mathbf{X}_k \mid G = 0/1) &= \sum_j n_j \log 0.5 \\ \log P(\mathbf{X}_k \mid G = 1/1) &= \sum_j \bigl[a_j \log(1-\varepsilon) + r_j \log \varepsilon\bigr] \end{aligned}\end{split}\]

where \(\varepsilon\) = self.mu, \(a_j\) and \(r_j\) are the alt and ref read counts for clone j, and \(n_j = a_j + r_j\). Clones with zero coverage contribute zero to the sum and therefore carry no information.

complete_logll(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]¶

Compute the observed data log-likelihood summed over all M sites.

Marginalises the latent germline indicator \(z_k\) over each site:

\[\mathcal{L}(\boldsymbol{\lambda}, \boldsymbol{\beta}, \kappa) = \sum_{k=1}^{M} \log P\!\left(A_k, R_k \mid \boldsymbol{\lambda}, \boldsymbol{\beta}, \kappa\right)\]

Parameters:

lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.
betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). None uses zeros.
kappa (float or None, optional) – Beta-Binomial concentration parameter. None uses self.kappa.
**kwargs – Accepted for interface compatibility; currently unused.

Returns:

sum_k log P(A_k, R_k | lambdas, betas, kappa).

Return type:

float

naive_mle(algo='L-BFGS-B', **kwargs)[source]¶

Direct MLE of site-level annotation weights with betas fixed at zero.

Maximises the observed log-likelihood via scipy.optimize.minimize.

Parameters:

algo (str, optional) – Scipy minimisation algorithm. One of "L-BFGS-B", "Powell", or "Nelder-Mead". Default is "L-BFGS-B".
**kwargs – Forwarded to scipy.optimize.minimize.

Returns:

MLE site-level annotation weights, shape (L,).

Return type:

numpy.ndarray

m_step_lambda_beta(eta, gammas, lambdas0, betas0, algo='L-BFGS-B')[source]¶

Run the M-step to update annotation weights via weighted logistic regression.

Maximises the EM lower bound Q with respect to (\(\boldsymbol{\lambda}\), \(\boldsymbol{\beta}\)):

\[Q(\boldsymbol{\lambda}, \boldsymbol{\beta}) = \sum_k \bigl[\eta_k \log \sigma(\pi_k) + (1-\eta_k)\log(1-\sigma(\pi_k))\bigr] + \sum_{k,j} \bigl[\gamma_{kj} \log \sigma(\phi_{kj}) + (1-\gamma_{kj})\log(1-\sigma(\phi_{kj}))\bigr]\]

where \(\pi_k = \boldsymbol{\theta}_k^\top \boldsymbol{\lambda}\) and \(\phi_{kj} = \boldsymbol{\theta}_k^\top \boldsymbol{\lambda} + \boldsymbol{\phi}_{kj}^\top \boldsymbol{\beta}\). Implemented by minimising \(-Q\) via scipy.optimize.minimize.

Parameters:

eta (numpy.ndarray) – Site-level responsibilities from the E-step, shape (M,).
gammas (numpy.ndarray) – Clone-level responsibilities from the E-step, shape (M, J).
lambdas0 (numpy.ndarray) – Initial site-level annotation weights, shape (L,).
betas0 (numpy.ndarray) – Initial clone-level annotation weights, shape (B,).
algo (str, optional) – Scipy minimisation algorithm. Default is "L-BFGS-B".

Returns:

lambdas (numpy.ndarray) – Updated site-level annotation weights, shape (L,).
betas (numpy.ndarray) – Updated clone-level annotation weights, shape (B,).

em_algo(lambdas=None, betas=None, kappa=None, algo='L-BFGS-B', delta_logll=0.0001, max_iter=50, **kwargs)[source]¶

Run the EM algorithm to jointly estimate (lambda, beta, kappa).

Iterates E-step / M-step until the absolute change in observed log-likelihood falls below delta_logll, as described in Eqs. 10-14.

Parameters:

lambdas (numpy.ndarray or None, optional) – Initial site-level annotation weights, shape (L,). None uses zeros.
betas (numpy.ndarray or None, optional) – Initial clone-level annotation weights, shape (B,). None uses zeros.
kappa (float or None, optional) – Initial Beta-Binomial concentration. None uses self.kappa.
algo (str, optional) – Scipy optimizer for the (lambda, beta) M-step. One of "L-BFGS-B", "Powell", or "Nelder-Mead". Default is "L-BFGS-B".
delta_logll (float, optional) – Convergence threshold on the absolute change in observed log-likelihood between successive EM iterations. Default is 1e-4.
max_iter (int, optional) – Maximum number of EM iterations before stopping regardless of convergence. Default is 50.
**kwargs – Currently unused; accepted for forward compatibility.

Returns:

loglls (numpy.ndarray) – Observed log-likelihood trace, length = number of iterations + 1.
lambdas_hat (numpy.ndarray) – Estimated site-level annotation weights, shape (L,).
betas_hat (numpy.ndarray) – Estimated clone-level annotation weights, shape (B,).
kappa_hat (float) – Estimated Beta-Binomial concentration parameter.

class pGermlinePoly.pGermlinePoly.MutectLOD(X)[source]¶

Bases: ReadCountUtils

Compute per-site LOD scores following the Mutect / Cibulskis et al. model.

Based on the somatic variant calling approach described in: Cibulskis et al., Nature Biotechnology (2013). https://doi.org/10.1038/nbt.2514

Parameters:: X (numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2] must equal 2).

lod_scores(q=30.0)[source]¶

Compute per-site log-likelihoods under three VAF hypotheses.

Populates self.lod with shape (M, 3), where each column is \(\ell(p) = \sum_j \bigl[a_j \log p_\varepsilon + r_j \log(1-p_\varepsilon)\bigr]\) evaluated at:

column 0: \(p = 0\) (no mutation)
column 1: \(p = \hat{p}\) (MLE VAF)
column 2: \(p = 0.5\) (germline heterozygote)

Parameters:: q (float, optional) – Phred-scaled base quality used to derive the error rate. Must be positive. Default is 30.0.

est_germline_prior(anno)[source]¶

Set per-site germline priors from a dbSNP-like annotation array.

Parameters:: anno (numpy.ndarray) – Binary or continuous annotation values, shape (M,).
Raises:: NotImplementedError – This method is not yet implemented.

lod_germline(p_somatic=3e-06, p_germline=0.095)[source]¶

Compute the per-site LOD score for germline origin.

\[\text{LOD}_\text{germ} = \frac{1}{\ln 10} \Bigl[\bigl(\ell(\hat{p}) + \ln p_\text{somatic}\bigr) - \bigl(\ell(0.5) + \ln p_\text{germline}\bigr)\Bigr]\]

A positive score favours somatic origin; a negative value favours germline origin.

Requires lod_scores() to have been called first. Result is stored in self.lod_germline, shape (M,).

Parameters:

p_somatic (float, optional) – Prior probability of somatic origin. Default is 3e-6.
p_germline (float, optional) – Prior probability of germline origin. Default is 0.095.

class pGermlinePoly.pGermlinePoly.BetaOverdispersion(X)[source]¶

Bases: ReadCountUtils

Estimate per-site overdispersion under the Beta-Binomial model.

Implements the overdispersion test from Spencer-Chapman et al. by fitting the rho parameter of a Beta-Binomial distribution to the observed allele counts across clones.

Based on the approach described in: Spencer Chapman et al., Nature (2021). https://doi.org/10.1038/s41586-021-03548-6

The Beta-Binomial model places a Beta mixing distribution on the clone-level VAF \(p_{kj}\). At each site k, the J clone read counts are drawn as:

\[\begin{split}\begin{aligned} a_{kj} \mid p_{kj} &\sim \mathrm{Binomial}(n_{kj},\, p_{kj}) \\ p_{kj} &\sim \mathrm{Beta}(\alpha_k,\, \beta_k) \end{aligned}\end{split}\]

where \(\alpha_k\) and \(\beta_k\) are reparameterised in terms of the pooled MLE VAF \(\hat{p}_k\) and an overdispersion scalar \(\rho_k \in (0, 1)\):

\[\alpha_k = \frac{\hat{p}_k\,(1 - \rho_k)}{\rho_k}, \qquad \beta_k = \frac{(1 - \hat{p}_k)\,(1 - \rho_k)}{\rho_k}\]

Under this parameterisation \(\mathbb{E}[p_{kj}] = \hat{p}_k\) and \(\mathrm{Var}[p_{kj}] = \hat{p}_k(1-\hat{p}_k)\rho_k\), so \(\rho_k \to 0\) recovers the Binomial limit and large \(\rho_k\) indicates strong overdispersion (somatic or sub-clonal signal).

Parameters:: X (numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2] must equal 2).

estimate_rhos()[source]¶

Estimate the per-site overdispersion parameter rho.

For each site k, \(\hat{p}_k\) is computed from pooled read counts and then \(\rho_k\) is found by maximising the marginalised Beta-Binomial log-likelihood:

\[\hat{\rho}_k = \operatorname*{arg\,max}_{\rho \in (0,1)} \sum_{j=1}^{J} \log p_{\mathrm{BB}}\!\left(a_{kj} \mid n_{kj},\, \frac{\hat{p}_k(1-\rho)}{\rho},\, \frac{(1-\hat{p}_k)(1-\rho)}{\rho}\right)\]

where \(p_{\mathrm{BB}}\) is the Beta-Binomial PMF. Optimisation is performed via scipy.optimize.minimize_scalar on \((0, 1)\).

Returns:: Per-site MLE overdispersion values \(\hat{\rho}\), shape (M,).
Return type:: numpy.ndarray

Simulation¶

class pGermlinePoly.pGermlinePoly.ClonalSim(seq_len=10000000.0, n_clones=10)[source]¶

Bases: object

Simulate clonal sequencing data with germline and somatic variants.

Generates a synthetic dataset containing a germline sample and a set of clonal samples, complete with germline heterozygotes, somatic mutations placed on a neutral coalescent genealogy, and realistic read counts. Output can be written as a VCF file suitable for use with the CLI.

Parameters:

seq_len (float or int, optional) – Simulated genome length in base-pairs. Default is 1e7.
n_clones (int, optional) – Number of clonal samples to simulate. Must be > 1. Default is 10.

__init__(seq_len=10000000.0, n_clones=10)[source]¶: Initialize the ClonalSim object.

simulate_germline(afs=[0.31699444395046117, 6.067159920986527], het_rate=0.001, mean_coverage=15.0, sd_coverage=5.0, mut_rate=1.2e-08, q=30, seed=42)[source]¶

Simulate germline heterozygotes and de-novo mutations for the germline sample.

The number of heterozygous sites and population allele frequencies are drawn from:

\[K \sim \mathrm{Poisson}(L \cdot \theta_\text{het}), \quad p_k \sim \mathrm{Beta}(1 + a,\, 1 + b)\]

where \(L\) is the genome length, \(\theta_\text{het}\) is the heterozygosity rate, and \((a, b)\) are the Beta shape parameters from afs. Read counts are then drawn under a Normal coverage model with Binomial sampling at \(p = 0.5\). Populates self.germline_muts, self.germline_af, self.germline_alt_reads, self.germline_tot_reads, and self.germline_pl.

Parameters:

afs (list of float, optional) – Shape parameters [a, b] of the Beta prior on allele frequencies in the external population. Default is [0.317, 6.067].
het_rate (float, optional) – Expected heterozygous site density per base-pair. Default is 1e-3.
mean_coverage (float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.
sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.
mut_rate (float, optional) – De-novo mutation rate per base-pair. Default is 1.2e-8.
q (int, optional) – Phred-scaled read quality used for genotype likelihood computation. Default is 30.
seed (int, optional) – Random seed for reproducibility. Default is 42.

Raises:

ValueError – If no heterozygous sites are simulated (increase het_rate or seq_len).

simulate_clone_genealogy(age=45, seed=42)[source]¶

Simulate a somatic genealogy for the clonal samples under a neutral coalescent.

Uses msprime to simulate a single-locus genealogy for n_clones haploid samples. Branch lengths are later rescaled by age when simulating somatic mutations. Populates self.genealogy.

Parameters:

age (int, optional) – Age of the individual at time of sampling (years). Used to rescale coalescent branch lengths. Default is 45.
seed (int, optional) – Random seed forwarded to msprime. Default is 42.

sim_somatic_mutations(age=45, mut_rate=5e-09, mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]¶

Simulate somatic mutations on branches of the clonal genealogy.

For each branch e with length \(\ell_e\), the number of mutations is drawn as:

\[N_e \sim \mathrm{Poisson}\!\left( \ell_e \cdot \frac{\text{age}}{h} \cdot L \cdot \mu_\text{som} \right)\]

where \(h\) is the tree height (used to rescale coalescent branch lengths to years), \(L\) is the genome length, and \(\mu_\text{som}\) is the per-base-pair per-year somatic mutation rate. Traverses each branch of self.genealogy and assigns read counts to all leaf clones that descend from the mutated branch. Populates self.somatic_muts, self.somatic_alt_reads, self.somatic_tot_reads, and self.somatic_mut_pl.

Parameters:

age (int, optional) – Age of the individual in years, used to rescale branch lengths. Default is 45.
mut_rate (float, optional) – Somatic mutation rate in mutations per base-pair per year (diploid rate). Default is 5e-9.
mean_coverage (float, optional) – Mean sequencing coverage per clone. Default is 15.0.
sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.
q (int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.
seed (int, optional) – Random seed for reproducibility. Default is 42.

simulate_clonal_germline_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]¶

Simulate germline heterozygote read counts across all clonal samples.

For each germline site, draws per-clone coverage from a Normal distribution and alt counts from a Binomial(p=0.5) distribution. Populates self.germline_clone_tot_reads, self.germline_clone_alt_reads, and self.germline_clone_pl.

Parameters:

mean_coverage (float, optional) – Mean sequencing coverage per clone. Default is 15.0.
sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.
q (int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.
seed (int, optional) – Random seed for reproducibility. Default is 42.

simulate_germline_somatic_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, eps=0.01, seed=42)[source]¶

Simulate read counts for somatic mutations as seen in the germline sample.

Draws coverage from a Normal distribution and alt counts from a Binomial(p=``eps``) distribution (the germline sample should show only error-level alt reads at somatic sites). Populates self.germline_somatic_tot_reads, self.germline_somatic_alt_reads, and self.germline_somatic_pl.

Parameters:

mean_coverage (float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.
sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.
q (int, optional) – Phred-scaled read quality. Default is 30.
eps (float, optional) – Error rate used to simulate alt read counts at somatic sites in the germline. Default is 1e-2.
seed (int, optional) – Random seed for reproducibility. Default is 42.

create_read_matrix()[source]¶

Build a read-count matrix from the simulated somatic and germline data.

Stacks somatic sites above germline sites to produce a combined array. Each entry stores [ref_reads, alt_reads] per clone.

Returns:: Integer read-count array of shape (M_somatic + M_germline, J, 2), where the last dimension is [ref_reads, alt_reads].
Return type:: numpy.ndarray

create_gt_string(alt_reads=0, tot_reads=0, pl=array([0, 0, 0]))[source]¶

Format read counts and genotype likelihoods as a VCF genotype field string.

Determines the GT call from the read counts, formats AD, DP, GQ, and PL fields, and returns them as a colon-delimited string.

Parameters:

alt_reads (int, optional) – Number of alternative allele reads. Default is 0.
tot_reads (int, optional) – Total read depth. Default is 0.
pl (numpy.ndarray, optional) – Phred-scaled genotype likelihoods [PL(0/0), PL(0/1), PL(1/1)]. Default is [0, 0, 0].

Returns:

gt_str (str) – Full VCF FORMAT field string in GT:AD:DP:GQ:PL format.
gt (int) – Integer genotype call (0 = hom-ref, 1 = het).
an (int) – Allele number contribution (0 if missing, 2 otherwise).
tot_reads (int) – Total read depth.
gq (float) – Genotype quality (second-lowest minus lowest PL value).

write_vcf(out=None)[source]¶

Write the simulated variants to a VCF file.

Produces a VCFv4.2 file containing one germline sample (Agermline) followed by J clonal samples (Aclone0 … AcloneN). Germline heterozygotes are written first, then somatic variants. The INFO field includes AC, AF, AN, DP, ExternalAF, and SM (somatic indicator).

Parameters:: out (str) – Output file path. Must be writable.

I/O Utilities¶

Module to help with IO routines and validation.

pGermlinePoly.io.parse_annotation(entry)[source]¶

Return (field_name, transform_fn) from a string or dict annotation entry.

Parameters:

entry (str or dict) – Either a plain INFO field name (string) or a dict with keys "field" (required) and "transform" (optional, one of SUPPORTED_TRANSFORMS).

Returns:

field_name (str) – INFO field name to extract from the VCF.
transform_fn (callable or None) – Function to apply element-wise to the extracted column, or None.

pGermlinePoly.io.is_af_annotation(entry)[source]¶

Return True if the annotation entry is flagged as a population allele frequency.

AF annotations are reflected (AF → 1−AF) for sites where reorient_to_minor_allele swapped ref/alt, so the annotation continues to describe the minor allele. Only dict entries with is_af: true qualify; plain string entries always return False.

Parameters:: entry (str or dict) – Annotation entry as accepted by parse_annotation().
Returns:: True if entry is a dict with is_af: true, False otherwise.
Return type:: bool

pGermlinePoly.io.annotation_transform_name(entry)[source]¶

Return the transform name string for an annotation entry, or None.

Useful when the transform name (rather than the callable) is needed — for example to invert a transform before reflecting an allele frequency.

Parameters:: entry (str or dict) – Annotation entry as accepted by parse_annotation().
Returns:: One of the keys in SUPPORTED_TRANSFORMS if a transform was specified, otherwise None.
Return type:: str or None

pGermlinePoly.io.validate_config(config_yaml_fp, schema={'age': {'min': 0.0, 'required': True, 'type': 'number'}, 'annotations': {'required': True, 'schema': {'anyof': [{'type': 'string'}, {'schema': {'field': {'required': True, 'type': 'string'}, 'is_af': {'required': False, 'type': 'boolean'}, 'transform': {'allowed': ['log10', 'sqrt'], 'type': 'string'}}, 'type': 'dict'}]}, 'type': 'list'}, 'clones': {'required': True, 'schema': {'type': 'string'}, 'type': 'list'}, 'germline': {'required': False, 'schema': {'type': 'string'}, 'type': 'list'}, 'ind': {'required': True, 'type': 'string'}, 'sex': {'allowed': ['M', 'F'], 'maxlength': 1, 'required': True, 'type': 'string'}})[source]¶

Validate a YAML configuration file against the germline schema.

Parameters:

config_yaml_fp (str) – Path to the YAML configuration file.
schema (dict, optional) – Cerberus schema to validate against. Default is germline_schema.

Returns:

Parsed and validated configuration dictionary.

Return type:

dict

Raises:

AssertionError – If the configuration does not conform to the schema.

pGermlinePoly.io.check_samples(vcf, samples=[])[source]¶

Assert that all requested sample names are present in the VCF.

Parameters:

vcf (cyvcf2.VCF) – Opened VCF object with a samples attribute.
samples (list of str, optional) – Sample names to verify. Default is an empty list.

Raises:

AssertionError – If any name in samples is not found in vcf.samples.

pGermlinePoly.io.check_annotations(vcf, annotations=['PL', 'AD'])[source]¶

Assert that required annotation fields are declared in the VCF header.

Checks both INFO and FORMAT fields via vcf.contains. Each entry in annotations may be a plain string (field name) or a dict with a "field" key (see parse_annotation()).

Parameters:

vcf (cyvcf2.VCF) – Opened VCF object.
annotations (list of str or dict, optional) – Annotation entries to verify. Default is ["PL", "AD"].

Raises:

AssertionError – If any field ID in annotations is not declared in the VCF header.

pGermlinePoly.io.create_germline_anno(vcf, **kwargs)[source]¶

Extract per-site germline heterozygote log-likelihoods from a germline VCF.

Iterates over biallelic SNPs, reads the AD FORMAT field of the first sample (assumed to be the germline sample), computes Phred-scaled genotype likelihoods via geno_loglik, and returns the heterozygote PL value (index 1) for each site.

Parameters:

vcf (cyvcf2.VCF) – Opened VCF with an “AD” FORMAT field. The first sample is treated as the germline reference.
**kwargs – Additional keyword arguments forwarded to geno_loglik (e.g., q).

Returns:

Heterozygote genotype log-likelihoods for each biallelic SNP, shape (M,), dtype float64.

Return type:

numpy.ndarray

Raises:

AssertionError – If the VCF does not contain the “AD” FORMAT field.

pGermlinePoly.io.create_anno(vcf, annotations=[])[source]¶

Extract INFO annotation values for all variants in a VCF.

Iterates over all variants, collecting the requested INFO field values for biallelic SNPs. Non-SNP or multiallelic sites receive NaN for all requested annotations. Per-annotation transforms (e.g. log10, sqrt) are applied column-wise after extraction; NaN values pass through unchanged and can be imputed later via impute_anno().

Parameters:

vcf (cyvcf2.VCF) – Opened VCF object.
annotations (list of str or dict, optional) – Annotation entries. Each entry is either a plain INFO field name (string) or a dict {"field": name, "transform": "log10"|"sqrt"}. See parse_annotation(). Default is an empty list.

Returns:

Float64 annotation matrix of shape (N, len(annotations)), where N is the total number of variants iterated.

Return type:

numpy.ndarray

pGermlinePoly.io.create_read_matrix(vcf)[source]¶

Build a read-count matrix from the AD FORMAT field of a clonal VCF.

Iterates over all variants. For biallelic SNPs the allele depth (AD) matrix is stacked; non-SNP or multiallelic records are represented as rows of zeros.

Parameters:: vcf (cyvcf2.VCF) – Opened VCF object containing an “AD” FORMAT field with at least two samples (clones).
Returns:: Integer read-count array of shape (M, J, 2), where M is the number of variants, J is the number of samples, and the last dimension holds [ref_reads, alt_reads] per sample.
Return type:: numpy.ndarray
Raises:: AssertionError – If the VCF does not contain the “AD” FORMAT field or has fewer than two samples.