br2 is a software designed to estimate bias-corrected OR and BETA values for association studies. The bootstrap method used is computationally intensive, however br2 can analyze most GWAS datasets and obtain point estimates in a few hours on a modern machine.
For cpu intensive calculations, br2 supports both multi-cpu machines and dividing the workload onto an arbitrary number of computers in a cluster. It has been tested on datasets with up to 10,000 invidivuals and 1,000,000 SNPs.
Both binary and quantitative traits can be analyzed. The statistical models supported include linear and logistic regression models, trend test, genotypic test and simple allelic test. br2 also supports the inclusion of covariates for linear and logistic regression.
The input files to br2 are common .ped / .map text files, as used by the software package plink.
To run br2 download a binary appropriate for your computing environment. Currently Linux and MacOS are supported. If your system is not in the above list, you can download and compile br2 yourself with a C++ compiler.
If a binary with multi-CPU support is offered for your platform then it recommended to download this, since most modern machines have more than one processors.
Note: For some tests, br2 uses plink to obtain the results. Currently this is only required for a linear model with a quantitative trait and covariates. It is necessary to have a copy of plink in your path in these cases. Any recent version of plink can be used, br2 has been tested with versions 1.04, 1.05 and 1.06.
The source code is also available in the homepage of br2.
For example if you are using Linux, after downloading the tar ball, you can extract it with:
tar -xvzf br2-1.05-linux.tar.gz
and then run br2, for example:
br2-1.05/br2 --file data --alpha 1e-3 --test TREND --B1 200
Command line options | |
---|---|
--file data | Read data.ped and data.map as input files |
--pheno phenofile | Read alternative phenotype file |
--covar coverfile | Read covariates |
br2 can read the input genotypes in a ped/map format, as used by the software plink. If you have your data in a .ped / .map format that plink can read, then br2 will be able to read that too.
The command line option --file test will read both the pedigree and the map file (named test.ped and test.map).
The ped file contains one line for each individual containing the following fields, separated by whitespace:
FID IID MID PID SEX PHENO A.A1 A.A2 B.A1 B.A2 ...
Where:
The only fields used by br2 are PHENO and the genotypes. The genotypes can be coded numerically (1 or 2) or with any letter. The value of 0 signifies a missing value for an allele. For each SNP, either both alleles should be missing or none at all. The order of genotypes should be the same as in the map file.
The map file has one line per marker, in the same order as in the ped file, and 4 colunms:
CHR NAME CM BP
Where:
The data in the map file are used for reporting purposes and have no effect in the estimation of parameters.
An example dataset with 3 individuals and 4 markers is given here:
test.ped
fam ind1 0 0 0 1 A C G G T T C A fam ind2 0 0 0 1 C C G A T T C A fam ind3 0 0 0 2 A C A A 0 0 0 0
test.map
1 rs456 0 1230922 1 rs23982 0 23489792 2 rs2397 0 309872 2 rs12937 0 293274982
In this data set we have 2 cases and 1 control, and 4 markers: 2 in chromosome 1 and 2 in chromosome 2.
If your phenotype is in a different file, br2 supports reading an alternate phenotype file. Specify --pheno <filename>
to read an alternative phenotype file. The format of this file is:
FID1 IID1 PHENO1 FID2 IID2 PHENO2 ...
The covariates file is formatted in a similar manner to the alternative phenotype file:
FID1 IID1 COV1 COV2 ... FID2 IID2 COV1 COV2 ... ...
Covariates should be coded numerically, with missing value set to -9.
Note: Not all tests support inclusion of covariates. The following tests do:
Command line options | |
---|---|
--alpha <alpha> | threshold for significance |
--test <test type> | select detection method |
--estimate <estimation type> | select estimation method |
When running br2 you should use the same statistical detection and estimation methods and thresholds as used in your original study. There are 3 things to configure:
p-value threshold for significance
Use the command line option --alpha val to select which p-value was used in your study to select significant SNPs. For example if you used 10^-5 as your threshold specify
--alpha 1e-5
Detection and estimation
br2 supports both quantitative and binary phenotypes, the types of tests offered for each case are:
Quantitative Trait
A linear model is used for both estimation of p values and beta coefficients. This is the same test as used by plink --assoc
.
Use command line option --qt when your trait is quantitative. There is no need to specify the tests to be used, since the default is a linear model. Missing phenotype values should be coded as -9
, in this case.
Binary Trait
The following tests can be used for calculation of p-values for a binary (case/control) trait:
p-value calculation | ||
---|---|---|
Command Line Option | Test type | Equivalent plink command |
--test ALLELIC | Allelic association test | plink --assoc |
--test TREND | Trend test | plink --model (TREND column) (default) |
--test GENOTYPIC | Genotypic test | plink --model (GENO column) |
--test minTRENDGENO | Minimum of trend and genotypic tests | plink --model (GENO,TREND column) |
--test LOGISTIC | Logistic regression | plink --logistic |
Note Using the LOGISTIC model requires an increased amount of computation time (approximately 10x more).
For the estimation of odds ratios two methods are implemented:
Estimation of OR | ||
---|---|---|
Command Line Option | Test type | Equivalent plink command |
--estimate ALLELIC | Basic allelic test for OR | plink --assoc |
--estimate LOGISTIC | Logistic regression | plink --logistic (default) |
Command line options | |
---|---|
--B1 <n> | How many replicates to use for point estimate (default is 500) |
--B2 <n> | Iterations to run for computing the SD (default is 0) |
The number of replicates used in the bootstrap affects the accuracy of the bias-corrected OR or BETA estimate but also the computation time. Command line option --B1 controls how many replicates are used. The default is 500, which is accurate for most datasets.
Note that the running time of br2 increasing linearly with B1, so 1000 iterations will take 2 times more time than 500.
Computation of SD
The default behaviour is to skip the computation of standard deviation of the bootrasp estimate. If you wish to compute the SD, use the command line option --B2 100. The number of replicates used for computation of SD affects the running time of BR2 significantly. 100 was observed to be enough for most GWAS studies.
WARNING Computing SD is very computationally intensive. When computing SD with –B2 100, you are going to need 100 times more CPU time compared to skipping the SD calculation. If you have more than 100k SNPs, it is recommended to run br2 on a cluster or on a multi-cpu machine.
For comparison purposed here are the expected running times on a fictional dataset, compared to the value of B1 and B2 parameters:
--B1 | --B2 | Running time |
---|---|---|
100 | 0 | 1 minute |
500 | 0 | 5 minutes |
1000 | 0 | 10 minutes |
1000 | 100 | 1010 minutes (16 hours) |
Once you have prepared your input files and know which test you are going to use, you are read to run br2:
br2 --file data --alpha 1e-5 --test TREND --B1 200
This might take some time to finish if your dataset is large. When it is done, you get an output file out.br2
containing the point estimates. The output is also printed on screen along with some diagnostic data that can be used to evaluate problems.
Command line options | |
---|---|
--out <file> | Specify a different name for the output file (default out.br2) |
The output file contains the list of significant SNPs, in order of significance. For each SNP the original and bias-corrected OR estimates are reported in columns OR and UBOOT. The p-value, snp-name, maf are also printed for comparison purposes.
The order of the SNPs, p-values and OR should be the same as your original study. If it isn't, you have probably used a different test and br2 output is not meaningful.
A typical output file is shown here:
out.br2
CHR SNP BP MAF A1 P OR UBOOT SD CI 1 snp817 818 0.420 C 9.08e-05 1.4176 1.0419 1.1662 (0.771,1.408) 1 snp533 534 0.439 T 0.000798 1.3565 1.0406 1.1713 (0.763,1.419) 1 snp977 978 0.467 C 0.000953 1.3404 1.0615 1.1131 (0.860,1.310) 1 snp652 653 0.438 T 0.00239 1.3111 1.0268 1.1503 (0.780,1.351) 1 snp586 587 0.470 C 0.00316 1.3036 1.0693 1.1091 (0.873,1.310) 1 snp499 500 0.473 C 0.00375 1.2724 1.0005 1.1528 (0.757,1.322) 1 snp695 696 0.453 C 0.00387 1.2815 1.0816 1.1379 (0.840,1.393)
The most interesting column is UBOOT which contains the bias-corrected OR estimate. P and OR contain the p-value and OR from the original dataset. The CI column constructs a confidence interval around UBOOT, based on simulation (this column is used only if option --B2 is present otherwise it is NA).
Command line options | |
---|---|
--cpu N | Specify number of processors to use |
If you have a multi-core or multi-cpu machine, br2 can use all the available cpus. Currently this is available for Linux and MacOS platforms.
Download a multi-cpu version for br2 suitable for your platform. The number of cpus is automatically recognized (see br2 output). If you would like to use less CPUs than that or the number is not correctly identifies, use option --cpu N, where N is the number of CPUs to use.
Command line options | |
---|---|
--cluster K/N | Divide processing among N instances and run instance K |
--join N | Join the results of a previous cluster run with N processors |