fineRADstructure Title
Inference of population structure from RAD datasets
 

Understanding of shared ancestry in genetic datasets is almost always key to their interpretation. The fineSTRUCTURE package (Lawson et al., 2012) represents a powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry, as evidenced for example in its application to investigation of genetic structure of the British population (Leslie et al., 2015). The high resolution of this method derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals to derive a "co-ancestry matrix" - a summary of nearest neighbor haplotype relationships in the dataset. Further advantages when compared with other model-based methods (e.g. STRUCTURE and ADMIXTURE) include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.

The existing pipeline for co-ancestry matrix inference was designed to meet the needs of analyzing large scale human genetic SNP datasets, where chromosomal location of the markers are known and haplotypes are typically assumed to be correctly phased. Therefore, these methods have so far been inaccessible to users without high quality genome-wide haplotypes.

Here we present RADpainter, a program designed specifically to infer the co-ancestry matrix from RAD-seq data, taking full advantage of its unique features. We package this new program together with the fineSTRUCTURE MCMC clustering algorithm into fineRADstructure - a complete, easy to use, and fast population inference package for RAD-seq data.


Download:
Note: 25th August 2017 we released a new version (v0.2) of the fineRADstructure package. It is available for download from GitHub. The program should work on most UNIX like system (including Linux and OSX/Apple).
We recommend that all users update to the new version because it substantially improves handling of missing data and also gives a more reasonable default clustring by inferring statistical variance directly from the data.

Installation:

  1. Clone the package from GitHub: git clone https://github.com/millanek/fineRADstructure
  2. Enter the folder: cd fineRADstructure
  3. Run configure: ./configure
  4. Run make: ./make
Mac/Apple users need to have the Command Line Tools installed on their machine to run "configure" and "make".


Usage:

  1. Calculate the co-ancestry matrix: ./RADpainter paint INPUT_RAD_FILE.txt
  2. Assign individuals to populations: ./finestructure -x 100000 -y 100000 -z 1000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml
  3. Tree building: ./finestructure -m T -x 10000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml INPUT_RAD_FILE_chunks.mcmcTree.xml

Input format:
The input file (INPUT_RAD_FILE.txt) should be in one of the three following fomats:

  1. Stacks export_sql.pl output (-a haplo -o tsv -F snps_l=1): Example
  2. Tag Haplotype Matrix (for unmapped data): Example
  3. Chromosome/scaffold + Tag Haplotype Matrix (for mapped data): Example
In all cases, rows correspond to RAD loci. Data columns correspond to individual haplotype calls (only sites variable in the dataset are needed). The program assumes all individuals are diploid: if the two alleles are the same (e.g. TGAT/TGAT) they can be collapsed (to just TGAT), if not then both alleles need to be fully specified (e.g. TGAT/CGAT). Missing data is left blank (see the examples above).

A python script contributed by Emiliano Trucchi can be used to to do some filtering, screening and conversion from the output of the software "populations" in Stacks (a file usually called "batch_1.haplotypes.tsv") to a fineRADpainter input file. It has various filtering options. I have not tested it extensively but some people reported it works well. The script is available here.

Important note: The amount of missing data should not vary too much across individuals (remove outliers if necessary) or systematically between putative populations (this hould now matter less with the new v0.2 vesrion, but still take some care about missingness - e.g. test a few runs with filtering tags/individuals at various stringency levels).


Plotting results:
If you have R installed (and are a bit familiar with it) then the easiest way to plot the results may be using our R scripts: fineRADstructurePlot.R and FinestructureLibrary.R (download both and follow the intructions in the first one). These scripts are a simplified version of the Daniel Lawson's scripts from the finestructure website).

Alternatively you can try to install and use the fineStructure GUI

Questions/feedback:
Please write to Milan Malinsky: mm812 at cam.ac.uk