{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with `geovar`\n", "\n", "This notebook highlights an instructive example of how to generate \"GeoVar\"-style plots using an example dataset of 5000 randomly chosen bi-allelic variants on Chromosome 22 from the new high-coverage sequencing of the [1000 Genomes Project from the New York Genome Center](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import pkg_resources\n", "from geovar import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "The `geovar` package contains example frequency tables as well as a gzipped vcf dataset to illustrate how to move from a [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) file and a population panel file to a full \"GeoVar\"-plot. \n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "data_path = pkg_resources.resource_filename(\"geovar\", \"data\")\n", "\n", "# Filepath to the VCF File\n", "vcf_file = \"{}/new_1kg_nygc.chr22.biallelic_snps.filt.n5000.vcf.gz\".format(data_path)\n", "\n", "# Filepath to the population panel file\n", "population_panel = \"{}/integrated_call_samples_v3.20130502.1kg_superpops.panel\".format(data_path)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/aabiddanda/.pyenv/versions/3.9.1/envs/venv_geovar/lib/python3.9/site-packages/geovar/data/integrated_call_samples_v3.20130502.1kg_superpops.panel\n" ] } ], "source": [ "print(population_panel)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The population panel file is a two column file with the columns `sample` and `pop` separated by whitespace. The sample column must match the `sample` IDs in the VCF file. the `pop` column contains population labels" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "5000it [00:02, 2059.90it/s]\n" ] }, { "data": { "text/html": [ "
\n", " | CHR | \n", "SNP | \n", "A1 | \n", "A2 | \n", "MAC | \n", "MAF | \n", "AFR | \n", "AMR | \n", "EAS | \n", "EUR | \n", "SAS | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "22 | \n", "10662593 | \n", "C | \n", "T | \n", "1 | \n", "0.000201 | \n", "0.000759 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
1 | \n", "22 | \n", "10664208 | \n", "G | \n", "A | \n", "38 | \n", "0.008137 | \n", "0.028963 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
2 | \n", "22 | \n", "10666881 | \n", "C | \n", "A | \n", "1 | \n", "0.000218 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.001104 | \n", "
3 | \n", "22 | \n", "10670699 | \n", "T | \n", "A | \n", "1633 | \n", "0.354538 | \n", "0.228395 | \n", "0.379538 | \n", "0.501029 | \n", "0.259709 | \n", "0.447137 | \n", "
4 | \n", "22 | \n", "10679257 | \n", "A | \n", "T | \n", "35 | \n", "0.007008 | \n", "0.025797 | \n", "0.001449 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "