Documentation for br2

Documentation for br2

Introduction

br2 is a software designed to estimate bias-corrected OR and BETA values for association studies. The bootstrap method used is computationally intensive, however br2 can analyze most GWAS datasets and obtain point estimates in a few hours on a modern machine.

For cpu intensive calculations, br2 supports both multi-cpu machines and dividing the workload onto an arbitrary number of computers in a cluster. It has been tested on datasets with up to 10,000 invidivuals and 1,000,000 SNPs.

Both binary and quantitative traits can be analyzed. The statistical models supported include linear and logistic regression models, trend test, genotypic test and simple allelic test. br2 also supports the inclusion of covariates for linear and logistic regression.

The input files to br2 are common .ped / .map text files, as used by the software package plink.

Downloading and installing

To run br2 download a binary appropriate for your computing environment. Currently Linux and MacOS are supported. If your system is not in the above list, you can download and compile br2 yourself with a C++ compiler.

If a binary with multi-CPU support is offered for your platform then it recommended to download this, since most modern machines have more than one processors.

Note: For some tests, br2 uses plink to obtain the results. Currently this is only required for a linear model with a quantitative trait and covariates. It is necessary to have a copy of plink in your path in these cases. Any recent version of plink can be used, br2 has been tested with versions 1.04, 1.05 and 1.06.

The source code is also available in the homepage of br2.

For example if you are using Linux, after downloading the tar ball, you can extract it with:

tar -xvzf br2-1.05-linux.tar.gz

and then run br2, for example:

 br2-1.05/br2 --file data --alpha 1e-3 --test TREND --B1 200

Running br2

Input Files

Command line options
--file data	Read data.ped and data.map as input files
--pheno phenofile	Read alternative phenotype file
--covar coverfile	Read covariates

Genotype and markers files

br2 can read the input genotypes in a ped/map format, as used by the software plink. If you have your data in a .ped / .map format that plink can read, then br2 will be able to read that too.

The command line option --file test will read both the pedigree and the map file (named test.ped and test.map).

The ped file contains one line for each individual containing the following fields, separated by whitespace:

FID IID MID PID SEX PHENO A.A1 A.A2 B.A1 B.A2 ...

Where:

FID the family id of this individual (not used)
IID a unique id for this individual (not used)
MID mother ID (not used)
PID parent ID (not used)
SEX 1 for male or 2 for female (not used)
PHENO phenotype for this individual. For binary traits use 1 for controls and 2 for cases. For quantitative traits, should be a real number.
A.A1,A.A2 etc. A list of the genotypes for this individual, seperated by space (for example A A C A G T 0 0 T A).

The only fields used by br2 are PHENO and the genotypes. The genotypes can be coded numerically (1 or 2) or with any letter. The value of 0 signifies a missing value for an allele. For each SNP, either both alleles should be missing or none at all. The order of genotypes should be the same as in the map file.

The map file has one line per marker, in the same order as in the ped file, and 4 colunms:

CHR NAME CM BP

Where:

CHR Chromosome number (for reporting purposes)
NAME Marker name (for reporting purposes)
CM Absolute CM value. (ignored by br2)
BP Base pair number (ignored by br2)

The data in the map file are used for reporting purposes and have no effect in the estimation of parameters.

An example dataset with 3 individuals and 4 markers is given here:

test.ped

fam ind1 0 0 0 1 A C G G T T C A
fam ind2 0 0 0 1 C C G A T T C A
fam ind3 0 0 0 2 A C A A 0 0 0 0

test.map

1 rs456    0   1230922
1 rs23982  0  23489792
2 rs2397   0    309872
2 rs12937  0 293274982

In this data set we have 2 cases and 1 control, and 4 markers: 2 in chromosome 1 and 2 in chromosome 2.

Alternative phenotype files

If your phenotype is in a different file, br2 supports reading an alternate phenotype file. Specify --pheno <filename> to read an alternative phenotype file. The format of this file is:

FID1 IID1 PHENO1
FID2 IID2 PHENO2
...

Covariates file

The covariates file is formatted in a similar manner to the alternative phenotype file:

FID1 IID1 COV1 COV2 ...
FID2 IID2 COV1 COV2 ...
...

Covariates should be coded numerically, with missing value set to -9.

Note: Not all tests support inclusion of covariates. The following tests do:

linear model (for quantitative traits)
logistic regression (for binary traits)

Specifying statistical tests

Command line options
--alpha <alpha>	threshold for significance
--test <test type>	select detection method
--estimate <estimation type>	select estimation method

When running br2 you should use the same statistical detection and estimation methods and thresholds as used in your original study. There are 3 things to configure:

Alpha value for declaring a marker significant.
Test used for the p-value.
Test used to estimate the OR or BETA.

p-value threshold for significance

Use the command line option --alpha val to select which p-value was used in your study to select significant SNPs. For example if you used 10^-5 as your threshold specify

--alpha 1e-5

Detection and estimation

br2 supports both quantitative and binary phenotypes, the types of tests offered for each case are:

Quantitative Trait

A linear model is used for both estimation of p values and beta coefficients. This is the same test as used by plink --assoc.

Use command line option --qt when your trait is quantitative. There is no need to specify the tests to be used, since the default is a linear model. Missing phenotype values should be coded as -9, in this case.

Binary Trait

The following tests can be used for calculation of p-values for a binary (case/control) trait:

p-value calculation
Command Line Option	Test type	Equivalent plink command
--test ALLELIC	Allelic association test	`plink --assoc`
--test TREND	Trend test	`plink --model` (TREND column) (default)
--test GENOTYPIC	Genotypic test	`plink --model` (GENO column)
--test minTRENDGENO	Minimum of trend and genotypic tests	`plink --model` (GENO,TREND column)
--test LOGISTIC	Logistic regression	`plink --logistic`

Note Using the LOGISTIC model requires an increased amount of computation time (approximately 10x more).

For the estimation of odds ratios two methods are implemented:

Estimation of OR
Command Line Option	Test type	Equivalent plink command
--estimate ALLELIC	Basic allelic test for OR	`plink --assoc`
--estimate LOGISTIC	Logistic regression	`plink --logistic` (default)

Number of replicates and computation of SD

Command line options
--B1 <n>	How many replicates to use for point estimate (default is 500)
--B2 <n>	Iterations to run for computing the SD (default is 0)

The number of replicates used in the bootstrap affects the accuracy of the bias-corrected OR or BETA estimate but also the computation time. Command line option --B1 controls how many replicates are used. The default is 500, which is accurate for most datasets.

Note that the running time of br2 increasing linearly with B1, so 1000 iterations will take 2 times more time than 500.

Computation of SD

The default behaviour is to skip the computation of standard deviation of the bootrasp estimate. If you wish to compute the SD, use the command line option --B2 100. The number of replicates used for computation of SD affects the running time of BR2 significantly. 100 was observed to be enough for most GWAS studies.

WARNING Computing SD is very computationally intensive. When computing SD with –B2 100, you are going to need 100 times more CPU time compared to skipping the SD calculation. If you have more than 100k SNPs, it is recommended to run br2 on a cluster or on a multi-cpu machine.

For comparison purposed here are the expected running times on a fictional dataset, compared to the value of B1 and B2 parameters:

--B1	--B2	Running time
100	0	1 minute
500	0	5 minutes
1000	0	10 minutes
1000	100	1010 minutes (16 hours)

Running br2

Once you have prepared your input files and know which test you are going to use, you are read to run br2:

br2 --file data --alpha 1e-5 --test TREND --B1 200

This might take some time to finish if your dataset is large. When it is done, you get an output file out.br2 containing the point estimates. The output is also printed on screen along with some diagnostic data that can be used to evaluate problems.

Output files

Command line options
--out <file>	Specify a different name for the output file (default out.br2)

The output file contains the list of significant SNPs, in order of significance. For each SNP the original and bias-corrected OR estimates are reported in columns OR and UBOOT. The p-value, snp-name, maf are also printed for comparison purposes.

The order of the SNPs, p-values and OR should be the same as your original study. If it isn't, you have probably used a different test and br2 output is not meaningful.

A typical output file is shown here:

out.br2

CHR     SNP   BP    MAF  A1          P       OR    UBOOT      SD             CI  
1    snp817  818  0.420   C   9.08e-05   1.4176   1.0419  1.1662  (0.771,1.408)  
1    snp533  534  0.439   T   0.000798   1.3565   1.0406  1.1713  (0.763,1.419)  
1    snp977  978  0.467   C   0.000953   1.3404   1.0615  1.1131  (0.860,1.310)  
1    snp652  653  0.438   T    0.00239   1.3111   1.0268  1.1503  (0.780,1.351)  
1    snp586  587  0.470   C    0.00316   1.3036   1.0693  1.1091  (0.873,1.310)  
1    snp499  500  0.473   C    0.00375   1.2724   1.0005  1.1528  (0.757,1.322)  
1    snp695  696  0.453   C    0.00387   1.2815   1.0816  1.1379  (0.840,1.393)

The most interesting column is UBOOT which contains the bias-corrected OR estimate. P and OR contain the p-value and OR from the original dataset. The CI column constructs a confidence interval around UBOOT, based on simulation (this column is used only if option --B2 is present otherwise it is NA).

Computational Issues

Running br2 on multi-cpu machines

Command line options
--cpu N	Specify number of processors to use

If you have a multi-core or multi-cpu machine, br2 can use all the available cpus. Currently this is available for Linux and MacOS platforms.

Download a multi-cpu version for br2 suitable for your platform. The number of cpus is automatically recognized (see br2 output). If you would like to use less CPUs than that or the number is not correctly identifies, use option --cpu N, where N is the number of CPUs to use.

Running br2 on a cluster

Command line options
--cluster K/N	Divide processing among N instances and run instance K
--join N	Join the results of a previous cluster run with N processors