sudo apt-get install tabix vcftools rbase
The 100 genomes data was downloaded between positions 178404570 and 179734924 using tabix. This was then converted to the IMPUTE format for phased data by vcftools (vcftools.sourceforge.net). Filtering this data by only including positions shared with our assays was performed using a custom written python script, filter1000genomes.py, and then the data was prepared for the programs PHASE and haploview/ using R scripts, selecting only those individuals from the CEU and GB populations that did not have parents in the data set, and by adding homozygote wild type alleles at the disease position for 1000 genomes data.
Sample information was obtained from the file phase1_samples_integrated_20101123.ped downloaded from the 1000 genomes data repository.
PHASE was used to phase haplotypes, using the known haplotypes option to use the 1000 genomes data as known phase data. The pairs output file was used to determine the relative posterior probability of the different haplotype reconstructions for these data. R was used for post processing of PHASE pairs output. The haploview program was used to investigate the haplotype structure of the population data and to produce plots.1000_genomes_script
,
haploview_phase.R
, Transform_PHASE.R
,
HMERFmatch1K.csv
, HMERF_Haplotype.csv
,
filter1000genomes.py
and readphase.R
.
1000_genomes_script
using bash 1000_genomes_script
R --vanilla readphase.R