pGermlinePoly API Reference

This page documents the public API of the pGermlinePoly package.

Inference

class pGermlinePoly.pGermlinePoly.ProbGermline(X, Theta, Phi=None, kappa=100.0, mu=0.001)[source]

Bases: object

Compute the posterior probability of germline polymorphism from clonal sequencing data.

Implements an EM algorithm that jointly estimates logistic annotation weights (lambda, beta) and a Beta-Binomial error concentration (kappa) to discriminate germline heterozygotes from somatic variants.

Parameters:
  • X (numpy.ndarray) – Read-count array of shape (M, J, 2). X[:, :, 0] are reference read counts and X[:, :, 1] are alternative read counts. M = number of sites, J = number of clones.

  • Theta (numpy.ndarray) – Site-level annotation matrix of shape (M, L).

  • Phi (numpy.ndarray, optional) – Clone-level annotation array of shape (M, J, B). When provided, clone-specific beta weights are estimated in addition to lambda. Default is None.

  • kappa (float, optional) – Initial concentration parameter for the Beta-Binomial error prior. Default is 100.0.

  • mu (float, optional) – Fixed mean sequencing error rate used in the Beta-Binomial model. Default is 1e-3.

M

Number of sites.

Type:

int

J

Number of clones.

Type:

int

L

Number of site-level annotations.

Type:

int

B

Number of clone-level annotations (0 if Phi is None).

Type:

int

kappa

Current (or estimated) concentration parameter.

Type:

float

mu

Fixed mean sequencing error rate.

Type:

float

vaf

Per-site MLE variant allele frequencies, shape (M,). Set by mle_vaf().

Type:

numpy.ndarray or None

impute_anno()[source]

Impute missing annotation values using column-wise means.

Replaces NaN entries in self.Theta in-place with the mean of the corresponding annotation column, computed over all non-missing sites.

mle_vaf(naive=True, eps=0.001, **kwargs)[source]

Estimate the per-site MLE variant allele frequency from pooled clone reads.

Results are stored in self.vaf (shape (M,)) and self.logl_vaf (shape (M,)).

Parameters:
  • naive (bool, optional) – If True (default), uses the empirical proportion alt/(alt+ref). If False, optimises var_loglik() via scipy.optimize.minimize_scalar.

  • eps (float, optional) – Sequencing error rate used in the likelihood. Default is 1e-3.

  • **kwargs – Forwarded to var_loglik().

loglik_ratio_het(**kwargs)[source]

Compute the likelihood ratio statistic for each site.

Returns 2*(LL_somatic - LL_het). Under the null hypothesis that the site is a germline heterozygote this is asymptotically chi-squared with 1 degree of freedom.

Parameters:

**kwargs – Forwarded to mle_vaf() (if VAF has not yet been computed) and to loglik_ratio().

Returns:

LRT statistics of shape (M,).

Return type:

numpy.ndarray

prior_poly(lambdas=array([0., 0.]))[source]

Compute the log prior probability of germline heterozygosity.

Evaluates log sigma(Theta @ lambdas) for all M sites under the logistic model.

Parameters:

lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.

Returns:

Log prior probabilities of shape (M,), in (-inf, 0].

Return type:

numpy.ndarray

post_prob_poly(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]

Compute the log posterior probability of germline heterozygosity for all sites.

Evaluates log P(z_k = het | A_k, R_k) for each of the M sites using the current or supplied model parameters.

Parameters:
  • lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.

  • betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). None uses zeros.

  • kappa (float or None, optional) – Beta-Binomial concentration parameter. None uses self.kappa.

  • **kwargs – Accepted for interface compatibility; currently unused.

Returns:

Log posterior probabilities of shape (M,), in (-inf, 0].

Return type:

numpy.ndarray

est_vaf_CI(alpha=0.05, df=1, **kwargs)[source]

Estimate per-site profile-likelihood confidence intervals for the VAF.

Uses the Wilks approximation: the CI is the set of VAF values p for which 2*(LL_mle - LL_p) < chi2(1-alpha, df). Bounds are found via scipy.optimize.brentq.

Parameters:
  • alpha (float, optional) – Significance level for the CI. Default is 0.05 (95% CI).

  • df (int, optional) – Degrees of freedom for the chi-squared threshold. Default is 1.

  • **kwargs – Forwarded to var_loglik().

Returns:

Array of shape (M, 3) with columns [lower_CI, MLE_VAF, upper_CI].

Return type:

numpy.ndarray

est_germline_genotype(lambdas=None, betas=None, allele_freq=None, p_hom_alt=0.5)[source]

Compute per-site log-posterior probabilities over germline genotypes {0/0, 0/1, 1/1}.

Evaluates the joint binomial log-likelihood of clone read counts under each diploid genotype at the phylogenetic root, combined with a genotype prior, and returns the normalized log-posterior. Because J clones contribute independently, genotype uncertainty decreases quickly with increasing J and per-clone depth — providing far more resolution than a single germline sample. The 0/1 likelihood is identical to the logprob_het() term used in post_prob_poly().

Parameters:
  • lambdas (numpy.ndarray or None, optional) – Site-level annotation weights, shape (L,). Only used when allele_freq is None. None uses zeros.

  • betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). Only used when allele_freq is None. None uses zeros.

  • allele_freq (numpy.ndarray or None, optional) – Per-site population allele frequencies, shape (M,), values in [0, 1]. When provided, the genotype prior follows Hardy-Weinberg equilibrium: P(0/0) = (1-p)^2, P(0/1) = 2p(1-p), P(1/1) = p^2. When None, the logistic annotation model supplies P(0/1) = sigma(Theta @ lambdas), and the remaining mass is split between 0/0 and 1/1 according to p_hom_alt.

  • p_hom_alt (float, optional) – Fraction of the non-het prior mass assigned to 1/1 when allele_freq is None. Must be strictly between 0 and 1. Default is 0.5 (symmetric split between 0/0 and 1/1).

Returns:

Log-posterior probabilities of shape (M, 3), with columns [log P(0/0|data), log P(0/1|data), log P(1/1|data)], normalized so that logsumexp over columns equals 0 for every site.

Return type:

numpy.ndarray

Raises:

ValueError – If p_hom_alt is not strictly between 0 and 1.

Notes

The per-clone binomial log-likelihoods for genotype G at site k, summed across J clones (combinatorial coefficient omitted):

log P(X_k | G=0/0) = sum_j  a_j log(eps) + r_j log(1 - eps)
log P(X_k | G=0/1) = sum_j  n_j log(0.5)
log P(X_k | G=1/1) = sum_j  a_j log(1 - eps) + r_j log(eps)

where eps = self.mu, a_j and r_j are the alt and ref read counts for clone j, and n_j = a_j + r_j. Clones with zero coverage contribute zero to the sum and therefore carry no information.

complete_logll(lambdas=array([0., 0.]), betas=None, kappa=None, **kwargs)[source]

Compute the observed data log-likelihood summed over all M sites.

Evaluates sum_k log P(A_k, R_k) by marginalizing the latent class over each site.

Parameters:
  • lambdas (numpy.ndarray, optional) – Site-level annotation weights, shape (L,). Default is zeros.

  • betas (numpy.ndarray or None, optional) – Clone-level annotation weights, shape (B,). None uses zeros.

  • kappa (float or None, optional) – Beta-Binomial concentration parameter. None uses self.kappa.

  • **kwargs – Accepted for interface compatibility; currently unused.

Returns:

sum_k log P(A_k, R_k | lambdas, betas, kappa).

Return type:

float

naive_mle(algo='L-BFGS-B', **kwargs)[source]

Direct MLE of site-level annotation weights with betas fixed at zero.

Maximises the observed log-likelihood via scipy.optimize.minimize.

Parameters:
  • algo (str, optional) – Scipy minimisation algorithm. One of "L-BFGS-B", "Powell", or "Nelder-Mead". Default is "L-BFGS-B".

  • **kwargs – Forwarded to scipy.optimize.minimize.

Returns:

MLE site-level annotation weights, shape (L,).

Return type:

numpy.ndarray

em_algo(lambdas=None, betas=None, kappa=None, algo='L-BFGS-B', delta_logll=0.0001, max_iter=50, **kwargs)[source]

Run the EM algorithm to jointly estimate (lambda, beta, kappa).

Iterates E-step / M-step until the absolute change in observed log-likelihood falls below delta_logll, as described in Eqs. 10-14.

Parameters:
  • lambdas (numpy.ndarray or None, optional) – Initial site-level annotation weights, shape (L,). None uses zeros.

  • betas (numpy.ndarray or None, optional) – Initial clone-level annotation weights, shape (B,). None uses zeros.

  • kappa (float or None, optional) – Initial Beta-Binomial concentration. None uses self.kappa.

  • algo (str, optional) – Scipy optimizer for the (lambda, beta) M-step. One of "L-BFGS-B", "Powell", or "Nelder-Mead". Default is "L-BFGS-B".

  • delta_logll (float, optional) – Convergence threshold on the absolute change in observed log-likelihood between successive EM iterations. Default is 1e-4.

  • max_iter (int, optional) – Maximum number of EM iterations before stopping regardless of convergence. Default is 50.

  • **kwargs – Currently unused; accepted for forward compatibility.

Returns:

  • loglls (numpy.ndarray) – Observed log-likelihood trace, length = number of iterations + 1.

  • lambdas_hat (numpy.ndarray) – Estimated site-level annotation weights, shape (L,).

  • betas_hat (numpy.ndarray) – Estimated clone-level annotation weights, shape (B,).

  • kappa_hat (float) – Estimated Beta-Binomial concentration parameter.

class pGermlinePoly.pGermlinePoly.MutectLOD(X)[source]

Bases: object

Compute per-site LOD scores following the Mutect2 / Williams et al. model.

Parameters:

X (numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2] must equal 2).

M

Number of sites.

Type:

int

J

Number of clones.

Type:

int

lod

Per-site log-likelihood matrix of shape (M, 3), populated by lod_scores().

Type:

numpy.ndarray or None

lod_germline[source]

Per-site germline LOD scores of shape (M,), populated by lod_germline().

Type:

numpy.ndarray or None

lod_scores(q=30.0)[source]

Compute per-site log-likelihoods under three VAF hypotheses.

Populates self.lod with shape (M, 3):

  • column 0: log-likelihood under VAF = 0 (no mutation)

  • column 1: log-likelihood under the MLE VAF

  • column 2: log-likelihood under VAF = 0.5 (germline heterozygote)

Parameters:

q (float, optional) – Phred-scaled base quality used to derive the error rate. Must be positive. Default is 30.0.

est_germline_prior(anno)[source]

Set per-site germline priors from a dbSNP-like annotation array.

Parameters:

anno (numpy.ndarray) – Binary or continuous annotation values, shape (M,).

Raises:

NotImplementedError – This method is not yet implemented.

lod_germline(p_somatic=3e-06, p_germline=0.095)[source]

Compute the per-site LOD score for germline origin.

LOD_germline = (LL_mle + log p_somatic) - (LL_het + log p_germline), converted to base-10 log-odds. A positive LOD score favours somatic origin; a negative value favours germline origin.

Requires lod_scores() to have been called first. Result is stored in self.lod_germline, shape (M,).

Parameters:
  • p_somatic (float, optional) – Prior probability of somatic origin. Default is 3e-6.

  • p_germline (float, optional) – Prior probability of germline origin. Default is 0.095.

class pGermlinePoly.pGermlinePoly.BetaOverdispersion(X)[source]

Bases: object

Estimate per-site overdispersion under the Beta-Binomial model.

Implements the overdispersion test from Spencer-Chapman et al. by fitting the rho parameter of a Beta-Binomial distribution to the observed allele counts across clones.

Parameters:

X (numpy.ndarray) – Read-count array of shape (M, J, 2). Only biallelic variants are supported (X.shape[2] must equal 2).

estimate_rhos()[source]

Estimate the per-site overdispersion parameter rho.

For each site, maximises the Beta-Binomial log-likelihood over rho in (0, 1) via scipy.optimize.minimize_scalar, where the Beta parameters are alpha = phat*(1-rho)/rho and beta = (1-phat)*(1-rho)/rho.

Returns:

Per-site MLE overdispersion values, shape (M,).

Return type:

numpy.ndarray

Simulation

class pGermlinePoly.pGermlinePoly.ClonalSim(seq_len=10000000.0, n_clones=10)[source]

Bases: object

Simulate clonal sequencing data with germline and somatic variants.

Generates a synthetic dataset containing a germline sample and a set of clonal samples, complete with germline heterozygotes, somatic mutations placed on a neutral coalescent genealogy, and realistic read counts. Output can be written as a VCF file suitable for use with the CLI.

Parameters:
  • seq_len (float or int, optional) – Simulated genome length in base-pairs. Default is 1e7.

  • n_clones (int, optional) – Number of clonal samples to simulate. Must be > 1. Default is 10.

__init__(seq_len=10000000.0, n_clones=10)[source]

Initialize the ClonalSim object.

simulate_germline(afs=[0.31699444395046117, 6.067159920986527], het_rate=0.001, mean_coverage=15.0, sd_coverage=5.0, mut_rate=1.2e-08, q=30, seed=42)[source]

Simulate germline heterozygotes and de-novo mutations for the germline sample.

Draws the number of heterozygous sites from a Poisson distribution, samples allele frequencies from a Beta distribution parameterised by afs, and simulates read counts under a Normal coverage model. Populates self.germline_muts, self.germline_af, self.germline_alt_reads, self.germline_tot_reads, and self.germline_pl.

Parameters:
  • afs (list of float, optional) – Shape parameters [a, b] of the Beta prior on allele frequencies in the external population. Default is [0.317, 6.067].

  • het_rate (float, optional) – Expected heterozygous site density per base-pair. Default is 1e-3.

  • mean_coverage (float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.

  • sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.

  • mut_rate (float, optional) – De-novo mutation rate per base-pair. Default is 1.2e-8.

  • q (int, optional) – Phred-scaled read quality used for genotype likelihood computation. Default is 30.

  • seed (int, optional) – Random seed for reproducibility. Default is 42.

Raises:

ValueError – If no heterozygous sites are simulated (increase het_rate or seq_len).

simulate_clone_genealogy(age=45, seed=42)[source]

Simulate a somatic genealogy for the clonal samples under a neutral coalescent.

Uses msprime to simulate a single-locus genealogy for n_clones haploid samples. Branch lengths are later rescaled by age when simulating somatic mutations. Populates self.genealogy.

Parameters:
  • age (int, optional) – Age of the individual at time of sampling (years). Used to rescale coalescent branch lengths. Default is 45.

  • seed (int, optional) – Random seed forwarded to msprime. Default is 42.

sim_somatic_mutations(age=45, mut_rate=5e-09, mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]

Simulate somatic mutations on branches of the clonal genealogy.

Traverses each branch of self.genealogy, draws the number of mutations from a Poisson distribution scaled by the branch length, age, genome size, and mutation rate, then assigns read counts to the leaf clones that descend from the mutated branch. Populates self.somatic_muts, self.somatic_alt_reads, self.somatic_tot_reads, and self.somatic_mut_pl.

Parameters:
  • age (int, optional) – Age of the individual in years, used to rescale branch lengths. Default is 45.

  • mut_rate (float, optional) – Somatic mutation rate in mutations per base-pair per year (diploid rate). Default is 5e-9.

  • mean_coverage (float, optional) – Mean sequencing coverage per clone. Default is 15.0.

  • sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.

  • q (int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.

  • seed (int, optional) – Random seed for reproducibility. Default is 42.

simulate_clonal_germline_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, seed=42)[source]

Simulate germline heterozygote read counts across all clonal samples.

For each germline site, draws per-clone coverage from a Normal distribution and alt counts from a Binomial(p=0.5) distribution. Populates self.germline_clone_tot_reads, self.germline_clone_alt_reads, and self.germline_clone_pl.

Parameters:
  • mean_coverage (float, optional) – Mean sequencing coverage per clone. Default is 15.0.

  • sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.

  • q (int, optional) – Phred-scaled read quality for genotype likelihood computation. Default is 30.

  • seed (int, optional) – Random seed for reproducibility. Default is 42.

simulate_germline_somatic_muts(mean_coverage=15.0, sd_coverage=5.0, q=30, eps=0.01, seed=42)[source]

Simulate read counts for somatic mutations as seen in the germline sample.

Draws coverage from a Normal distribution and alt counts from a Binomial(p=``eps``) distribution (the germline sample should show only error-level alt reads at somatic sites). Populates self.germline_somatic_tot_reads, self.germline_somatic_alt_reads, and self.germline_somatic_pl.

Parameters:
  • mean_coverage (float, optional) – Mean sequencing coverage for the germline sample. Default is 15.0.

  • sd_coverage (float, optional) – Standard deviation of sequencing coverage. Default is 5.0.

  • q (int, optional) – Phred-scaled read quality. Default is 30.

  • eps (float, optional) – Error rate used to simulate alt read counts at somatic sites in the germline. Default is 1e-2.

  • seed (int, optional) – Random seed for reproducibility. Default is 42.

create_read_matrix()[source]

Build a read-count matrix from the simulated somatic and germline data.

Stacks somatic sites above germline sites to produce a combined array. Each entry stores [ref_reads, alt_reads] per clone.

Returns:

Integer read-count array of shape (M_somatic + M_germline, J, 2), where the last dimension is [ref_reads, alt_reads].

Return type:

numpy.ndarray

create_gt_string(alt_reads=0, tot_reads=0, pl=array([0, 0, 0]))[source]

Format read counts and genotype likelihoods as a VCF genotype field string.

Determines the GT call from the read counts, formats AD, DP, GQ, and PL fields, and returns them as a colon-delimited string.

Parameters:
  • alt_reads (int, optional) – Number of alternative allele reads. Default is 0.

  • tot_reads (int, optional) – Total read depth. Default is 0.

  • pl (numpy.ndarray, optional) – Phred-scaled genotype likelihoods [PL(0/0), PL(0/1), PL(1/1)]. Default is [0, 0, 0].

Returns:

  • gt_str (str) – Full VCF FORMAT field string in GT:AD:DP:GQ:PL format.

  • gt (int) – Integer genotype call (0 = hom-ref, 1 = het).

  • an (int) – Allele number contribution (0 if missing, 2 otherwise).

  • tot_reads (int) – Total read depth.

  • gq (float) – Genotype quality (second-lowest minus lowest PL value).

write_vcf(out=None)[source]

Write the simulated variants to a VCF file.

Produces a VCFv4.2 file containing one germline sample (Agermline) followed by J clonal samples (Aclone0AcloneN). Germline heterozygotes are written first, then somatic variants. The INFO field includes AC, AF, AN, DP, ExternalAF, and SM (somatic indicator).

Parameters:

out (str) – Output file path. Must be writable.

I/O Utilities

Module to help with IO routines and validation.

pGermlinePoly.io.validate_config(config_yaml_fp, schema={'age': {'min': 0.0, 'required': True, 'type': 'number'}, 'annotations': {'required': True, 'type': 'list'}, 'clones': {'required': True, 'schema': {'type': 'string'}, 'type': 'list'}, 'germline': {'required': False, 'schema': {'type': 'string'}, 'type': 'list'}, 'ind': {'required': True, 'type': 'string'}, 'sex': {'allowed': ['M', 'F'], 'maxlength': 1, 'required': True, 'type': 'string'}})[source]

Validate a YAML configuration file against the germline schema.

Parameters:
  • config_yaml_fp (str) – Path to the YAML configuration file.

  • schema (dict, optional) – Cerberus schema to validate against. Default is germline_schema.

Returns:

Parsed and validated configuration dictionary.

Return type:

dict

Raises:

AssertionError – If the configuration does not conform to the schema.

pGermlinePoly.io.check_samples(vcf, samples=[])[source]

Assert that all requested sample names are present in the VCF.

Parameters:
  • vcf (cyvcf2.VCF) – Opened VCF object with a samples attribute.

  • samples (list of str, optional) – Sample names to verify. Default is an empty list.

Raises:

AssertionError – If any name in samples is not found in vcf.samples.

pGermlinePoly.io.check_annotations(vcf, annotations=['PL', 'AD'])[source]

Assert that required annotation fields are declared in the VCF header.

Checks both INFO and FORMAT fields via vcf.contains.

Parameters:
  • vcf (cyvcf2.VCF) – Opened VCF object.

  • annotations (list of str, optional) – Annotation field IDs to verify. Default is ["PL", "AD"].

Raises:

AssertionError – If any field ID in annotations is not declared in the VCF header.

pGermlinePoly.io.create_germline_anno(vcf, **kwargs)[source]

Extract per-site germline heterozygote log-likelihoods from a germline VCF.

Iterates over biallelic SNPs, reads the AD FORMAT field of the first sample (assumed to be the germline sample), computes Phred-scaled genotype likelihoods via geno_loglik, and returns the heterozygote PL value (index 1) for each site.

Parameters:
  • vcf (cyvcf2.VCF) – Opened VCF with an “AD” FORMAT field. The first sample is treated as the germline reference.

  • **kwargs – Additional keyword arguments forwarded to geno_loglik (e.g., q).

Returns:

Heterozygote genotype log-likelihoods for each biallelic SNP, shape (M,), dtype float64.

Return type:

numpy.ndarray

Raises:

AssertionError – If the VCF does not contain the “AD” FORMAT field.

pGermlinePoly.io.create_anno(vcf, annotations=[])[source]

Extract INFO annotation values for all variants in a VCF.

Iterates over all variants, collecting the requested INFO field values for biallelic SNPs. Non-SNP or multiallelic sites receive NaN for all requested annotations.

Parameters:
  • vcf (cyvcf2.VCF) – Opened VCF object.

  • annotations (list of str, optional) – INFO field IDs to extract. Default is an empty list.

Returns:

Annotation matrix of shape (N, len(annotations)), where N is the total number of variants iterated.

Return type:

numpy.ndarray

pGermlinePoly.io.create_read_matrix(vcf)[source]

Build a read-count matrix from the AD FORMAT field of a clonal VCF.

Iterates over all variants. For biallelic SNPs the allele depth (AD) matrix is stacked; non-SNP or multiallelic records are represented as rows of zeros.

Parameters:

vcf (cyvcf2.VCF) – Opened VCF object containing an “AD” FORMAT field with at least two samples (clones).

Returns:

Integer read-count array of shape (M, J, 2), where M is the number of variants, J is the number of samples, and the last dimension holds [ref_reads, alt_reads] per sample.

Return type:

numpy.ndarray

Raises:

AssertionError – If the VCF does not contain the “AD” FORMAT field or has fewer than two samples.