Understanding of shared ancestry in genetic datasets is almost always key to their interpretation. The fineSTRUCTURE package (Lawson et al., 2012) represents a powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry, as evidenced for example in its application to investigation of genetic structure of the British population (Leslie et al., 2015). The high resolution of this method derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals to derive a "co-ancestry matrix" - a summary of nearest neighbor haplotype relationships in the dataset. Further advantages when compared with other model-based methods (e.g. STRUCTURE and ADMIXTURE) include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.
The existing pipeline for co-ancestry matrix inference was designed to meet the needs of analyzing large scale human genetic SNP datasets, where chromosomal location of the markers are known and haplotypes are typically assumed to be correctly phased. Therefore, these methods have so far been inaccessible to users without high quality genome-wide haplotypes.
Here we present RADpainter, a program designed specifically to infer the co-ancestry matrix from RAD-seq data, taking full advantage of its unique features. We package this new program together with the fineSTRUCTURE MCMC clustering algorithm into fineRADstructure - a complete, easy to use, and fast population inference package for RAD-seq data.
17th July 2018: v0.3.2 - A major speedup of RADpainter (about 5x faster); fixing a bug that could have led to incorrect results when different individuals have different numbers of alleles
05th March 2018: v0.3.1 - An improvement in the handling of missing genotypes (N characters) within the rad tags; eliminates a bug that could have led to a division by zero
30th January 2018: v0.3 - can now handle arbitrary ploidy (just set the maximum number of alleles per individual via the
25th August 2017: v0.2 - substantial improvememnts in handling of missing data + improvements in inferring statistical variance directly from the data (resulting in more reasonable clustering).
The software is available for download from GitHub. It works on most UNIX like systems (including Linux and OSX/Apple).
You'll need the GNU Scientific Development Library, (libgsl0-dev and libgsl0 in Ubuntu, gsl-devel in OpenSUSE), or download from https://www.gnu.org/software/gsl/
Mac/Apple users also need to have the Command Line Tools installed on their machine to run "configure" and "make".
- Clone the package from GitHub:
git clone https://github.com/millanek/fineRADstructure
- Enter the folder:
- Run configure:
- Run make:
On Linux/Unix you need a resonably new(-ish) version of the gcc compiler (gcc 4.6.2 and onwards work fine). The Apple llvm compiler that comes with the Command Line Tools works fine too (all versions as far as I know).
If you get an autotools version error like
/bin/sh: aclocal-1.14: command not found
then you probably need to re-do the autotools sequence. First run:
aclocal; autoconf; automake -a
and then re-run:
- Calculate the co-ancestry matrix:
./RADpainter paint INPUT_RAD_FILE.txt
- Assign individuals to populations:
./finestructure -x 100000 -y 100000 -z 1000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml
- Tree building:
./finestructure -m T -x 10000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml INPUT_RAD_FILE_chunks.mcmcTree.xml
The input file (
INPUT_RAD_FILE.txt) should be in one of the three following fomats:
In all cases, rows correspond to RAD loci. Data columns correspond to individual haplotype calls; only sites that are variable in the dataset are needed. In diploid individuals, if the two alleles are the same, e.g.
export_sql.pl output (
-a haplo -o tsv -F snps_l=1): Example
- Tag Haplotype Matrix (for unmapped data): Example
- Chromosome/scaffold + Tag Haplotype Matrix (for mapped data): Example
TGAT/TGAT, they can be collapsed to just
TGAT, if not then both alleles need to be fully specified (e.g.
TGAT/CGAT). Missing data is left blank; see the examples above. If you have polyploid individuals then the best way is to write out all the alleles; e.g.
TGAT/TGAG/TGAG/TGAG could be the four haplotypes for a tetraploid individual.
Stacks now has the:
populations --radpainter option to make suitable input files. This is probably the easiest option if you have your data in Stacks. Also the
RADpainter hapsFromVCF command may work if you have a VCF file; feel free to get in touch if it doesn't work for your specific VCF.
populations --radpainter option is for some reason not optiomal, the package also includes a python script
Stacks2fineRAD.py that can be used for conversion from the output of the software "populations" in Stacks (a file usually called "batch_1.haplotypes.tsv") to a fineRADpainter input file. It has various filtering options. The script is also available here. An example command line to run this would be:
python Stacks2fineRAD.py -i Stacks_output.tsv -n 10 -m 30. The
-n option, specifies the maximum number of SNPs allowed at a locus (misaligned paralogs can lead to unusually many SNPs). Using the
-m option, you discard individuals with unusually high levels of missing data (e.g. using option
-m 30 we set the cutoff to 30%).
Finally, there is an utility script from Edgardo M. Ortiz enabling conversion from the format used by the pyRAD and ipyrad toolkits https://github.com/edgardomortiz/fineRADstructure-tools.
Important note: If you have a reference genome, the RAD loci should ideally be ordered according to genomic coordinates. If you have unmapped loci, we provide a script
sampleLD.R that can reorder loci according to linkage disequilibrium (LD). If LD is strong and loci are not sorted, this could lead to overconfident clustering. Therefore, we recommend using the
sampleLD.R script before running RADpainter to users with unmapped data who want to be extra careful to ensure they obtain a CONSERVATIVE upper bound on the number of statistically identifiable clusters. Example command line:
Rscript sampleLD.R -s 1 -n 500 INPUT_RAD_FILE.txt INPUT_RAD_FILE_reordered.txt. This should do the job. If you want to understand the options, they are described in the R script.
Important note 2: The amount of missing data should not vary too much across individuals (remove outliers if necessary) or systematically between putative populations (this now matters less, as of the v0.2 vesrion; but still take some care about missingness - e.g. test a few runs with filtering tags/individuals at various stringency levels; as a rule of thumb, I'd say that missingness over 50% per individual can cause problems).
If you are familiar with R then the easiest way to plot the results may be by adapting our R scripts: fineRADstructurePlot.R and FinestructureLibrary.R (these are now included within the package; just follow the intructions in the first one). The R scripts are a simplified version of the Daniel Lawson's code from the finestructure website).
Alternatively you can try to install and use the fineStructure GUI
Please write to Milan Malinsky: email@example.com