2LD -- two-locus LD calculator

JH Zhao 10/12/2003

>>> Program description

2LD is a simple program calculating linkage disequilibrium (LD) measures
between two polymorphic markers.  It is developed as a supplement for an Excel
calculator by Dr David Collier, adding standard errors and handling
multiallelic markers.

Denote the number of alleles at markers 1 and 2 as m and n, the observed data
can be organized into a contingency table of m(m+1)/2 x n(n+1)/2 cells.
However table can be parameterised by only m x n possible haplotype
frequencies.  As illustrated by ASSOCIATE, an ordinary chi-squared statistic
will have [m(m+1)/2-1][n(n+1)/2-1] degrees of freedom and this test of
independence may imply dependence other than allelic association.  The
haplotypes at linkage equilibrium can be obtained from allele frequencies of
each marker, while assuming LD they are estimated using a gene-counting
procedure.  Alternatively, haplotype frequency estimates from other sources can
also be used.  In addition, the program gives chi-squared test of the estimated
haplotype frequencies, as a global measure of LD, i.e.  Phi coefficient and
Cramer's V.  However this will need asymptotic approximation, Fisher's exact
test may be more appropriate for its significance.  Finally 2ld calculates
another popular measure of LD, D' and standard error.  As yet 2ld does not
handle missing genotypes.

Computer programs ASSOCIATE and EH also implement gene-counting procedure, and
are available from http://linkage.rockefeller.edu.

For more details see Klitz et al.  (1995), Long et al.  (1995), Weir (1996) and
other references listed below.  To obtain permutation-based LD measures use PM
or FASTEH+ at this site.


>>> Running 2ld

2ld will accept an input file in one of the three formats.

Format 1.  the raw genotype data in consecutive lines, each line containing
five columns as follows.

Column 1.    Individual ID
Column 2, 3. Marker 1 genotypes
Column 4, 5. Marker 2 genotypes

File genotype.dat is in this format.

Format 2.  genotype table as required by ASSOCIATE and EH programs

Line 1. < No of alleles at marker 1> <No of alleles at marker 2>
Line 2- actual two-locus genotype counts

File genotype.tab is in this format.  Note that since the two markers have
alleles 14 and 2, the genotype table has 14(14+1)/2 x 2(2+1)/2 = 105 x 3 cells.

Format 3.  observed haplotype frequencies from any other sources.

Line 1.
<# of alleles at marker 1> <# of alleles at marker 2> <# of subjects>
Line 2-
haplotype frequencies

File genotype.eh, 2by2.dat, kbyl.dat are in this format.  Note if haplotype
frequency estimates are based on EH then the column "w/association" in the
output has to be used.  Since 2ld was designed for diploid individuals, the
number of haplotypes is equal to twice the number of subjects.

The syntax of 2ld is as follows,

2ld <data file>

We use convention of DOS that ``data file'' within angled bracket is needed;
otherwise the program only gives the internal example.  The screen output can
be redirected to a ASCII file by using "> output file name".  For a
case-control study three separate input files can be created, for cases only,
controls only and combined.  To run 2ld on 2by2.dat the command is

2ld 2by2.dat

to output on computer screen,

or

2ld 2by2.dat >2by2.out

to output to file 2by2.out


>>> File listing

2ld.c        C source
2ld.exe      Win9x executable
2ld.linux    RedHat Linux executable
2ld.txt      this file
2by2.dat     sample data file for two biallelic markers
kbyl.dat     sample data file for two multiallelic markers
genotype.dat an example of directly using raw genotype data
genotype.tab an example of directly using EH input file
genotype.eh  EH output file on genotype.dat, genotype.tab

2ld.c is in ANSI C so should be portable to Unix/Linux. 2ld.exe is created
from 2ld.c by Symantec C/C++ 7.2 compiler with command "sc -mn 2ld.c".


>>> How to cite

2lD, GENECOUNTING and HAP: computer programs for linkage disequilibrium
analysis. Bioinformatics

>>> Known problems

> Q

2ld.exe always gives internal example if clicking 2ld.exe from Win9x, when
I press <Return>, even 2ld window disappears.

> A

Since 2ld is a DOS-based program, Windows users will have to enter MS-DOS
Prompt first.  For example under Win9x, click Start -> Program -> MS-DOS
Prompt, change to the appropriate directory and issue 2ld command described
above.  I put 2ld.exe in c:\iop\2ld directory, so my commands would be

c:
cd \iop\2ld
2ld 2by2.dat >2by2.out
edit 2by2.out

Alternatively I can add 2ld directory by DOS command

set path=c:\iop\2ld;%path%

then I can use 2ld anywhere under DOS.

> Q

Why my degree(s) of freedom differ from EH ?

> A

2LD calculates degree(s) of freedom from alleles present, rather than
user-specified.

> Q

I get two D's, one negative and one positive, for two biallelic data, why?

> A

When there are 2 biallelic markers involved, let p_1, p_2 be allele frequencies
at locus 1, and q_1, q_2 be allele frequencies at locus 2.  The first D' is
given under "Disequilibria, expectations and variances" as D'_{11}, i.e., D'
for haplotype 11, while the second D' is obtained as follows,

p_1q_1|D'_{11}|+p_1q_2|D'_{12}|+p_2q_1|D'_{21}|+p_2q_2|D'_{22}|

They happen to have the same quantity but may differ in signs.

> Q

How to get more sense of the global chi-square ?

> A

A chi-square based on the estimated haplotype frequency table, as a direct test
of global disequilibrium.  A simple check can be done with standard packages
such as SAS.  For example the following program computes chi-square and
Fisher's exact test for a biallelic marker and a three-allele marker from 21
diploid individuals (data kindly from Dr Zhicheng Lin).

data abc;
input a b w;
w=round(w*42);
cards;
1 1 0.095234
1 2 0.000005
1 3 0.047618
2 1 0.238099
2 2 0.571423
2 3 0.047621
;
proc freq;
  weight w;
  table a*b/chisq exact;
run;

Unfortunately there is no degree of freedom for a test between this and that
from EH, for both chi-squares have (m-1)(n-1) degrees of freedom.


>>> Acknowledgement

A chronicle list of people who help to make the program/documentation more useful.

David Collier <sphadco@iop.kcl.ac.uk>
Zhicheng Lin <ZLIN@intra.nida.nih.gov>
Mustafa Neamatallah <m_neamatallah@hotmail.com>
Carlos Zapata <bfcazaba@usc.es>
Tan Hui Hui Jenny <medp9193@nus.edu.sg>
Barry Chioza <spbcaja@iop.kcl.ac.uk>
Maria Arranz <m.arranz@iop.kcl.ac.uk>
Andrew Makoff <a.makoff@iop.kcl.ac.uk>
Hui-Hua Li <huihuali@NEMOURS.ORG>
Andrew Makoff <a.makoff@iop.kcl.ac.uk>
Pak Sham <spjupcs@iop.kcl.ac.uk>


>>> Contact information

Please let me know your problem or comments by e-mailing j.zhao@iop.kcl.ac.uk
or by post to


Jing Hua Zhao

Section of Genetic Epidemiology & Biostatistics
Division of Psychological Medicine
Institute of Psychiatry
De Crespigny Park
London SE5 8AF
UK
e-mail: j.zhao@iop.kcl.ac.uk


>>> Reference

Abramowitz M and Stegun IA (1968) Handbook of mathematical functions.  New York

Bishop YMM, Fienberg SE, Holland PW (1975) Discrete Multivariate Analysis
-- Theory and Practice, The MIT press

Cramer H (1946) Mathematical Methods of Statistics. Princeton Univ. Press

Klitz W, Stephen JC, Grote M, Carrington M (1995) Discordant patterns of
linkage disequilibrium of the peptide transporter loci within the HLA class II
region. Am. J. Hum. Genet. 57:1436-1444

Long JC, Williams RC, Unbanek M (1995) An E-M  algorithm and testing strategy
for multiple-locus haplotypes. Am. J. Hum. Genet. 56:799-810

Weir BS (1979) Inferences about linkage disequilibrium. Biometrics 35:235-254

Weir BS (1996) Genetic Data Analysis II. Sinaur

Xie X, Ott J (1993) Testing linkage disequilibrium between a disease gene and
marker loci. Am J Hum Genet 53:1107

Zapata C, Alvarez G, Carollo C (1997) Approximate variance of the standardized
measure of gametic disequilibrium D'. Am. J. Hum. Genet. 61:771-774

Zapata C, Carollo C, Rodriguez S (2001) Sampling variance and distribution of
the D' measure of overall gametic disequlibrium between multiallelic loci.
Ann Hum Genet 65:395-406
