Issues for Immunogenetic Data Richard M Single Associate Professor of Statistics Department of Mathematics amp Statistics University of Vermont HLA nomenclature Why it matters for analysis and interpretation ID: 931910
Download Presentation The PPT/PDF document "Data Standards and Statistical" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data Standards and Statistical Issuesfor Immunogenetic Data
Richard M. Single
Associate Professor of Statistics
Department of Mathematics & Statistics
University of Vermont
Slide2HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility Developing a community standard for HLA & KIR data
reporting
Overview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolutionRandomization tests and genomic controls
Outline
Slide3HLA Nomenclature and why it matters
MHC
Slide4HLA Nomenclature and why it mattersChallenges for HLA data management and analysisThe HLA genes are very polymorphic;HLA nomenclature is complicated;There are multiple ways to generate HLA data;All common typing systems generate ambiguous data;There are multiple ways to report alleles and ambiguities;These issues make meta-analyses of HLA data from different sources very difficult.
Slide5Klein J. et al New Eng J Med, 2000; 343:702-709
An extremely gene-rich region.
Slide6Structure of HLA molecules
HLA molecules are cell-surface proteins
that present peptide fragments to T-cellsThey bind specific sets of peptides based on structure
Slide77
90
73
77
80
Ribbon drawing from Hedrick et al. PNAS, 88, 5897-5901
HLA-C binding pocket
Slide8DP
DQ
DRB
C
A
50 kb
850 kb
100 kb
1270 kb
class II loci
class I loci
B1
A1
B1
A1
B1
A
400 kb
250 kb
1612
2211
1280
2
980
31
216
19
153
IMGT/HLA
Database Release 3.12.0 April 17,
2013
HLA classical loci and polymorphism
Protein-level allele numbers
:
Slide9HLA-A * 24 : 02 : 01 : 02 : LLocus
Field 1
(2-Digit)Serological level(where possible)Field 2 (4-Digit)Peptide level(amino acid difference)Field 3(6-Digit)Nucleotide level[silent](synonymous substitutions)
Field 4
(8-Digit)
Intron level
(3’ or 5’
polymorphism)
Expression
N = null
L = low
S = soluble
…
For most analyses, we want to distinguish among unique peptide
sequences,
i.e., 2
fields
(“4-digit”) level
This level of resolution treats alleles with the same peptide sequence for
exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“
binning
” alleles]
HLA Allele Nomenclature
Slide10HLA alleles are defined by a “patchwork” of sequence-level polymorphisms. Most typing systems do not interrogate the same set of polymorphisms - e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3
There is currently no simple way to identify which alleles could (could not) have been detected by a given typing system. HLA Nomenclature & Polymorphism
Slide11Distinctive Geographical Distribution of subtypes of HLA-DRB1*08
Slide12HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility
Developing a community standard for HLA & KIR data
reportingOverview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolutionRandomization tests and genomic controls
Outline
Slide13Data Standardization to facilitate Meta-analysesData standardization methods …Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated, and the set of detectable alleles: Perform data validation by checking against IMGT & IPD-KIR allele lists
allow re-evaluation of raw data in future contexts allow information/results to be combined across datasets more easily
Slide14Extending STREGA to Immunogenomic StudiesThe STrengthening the REporting of Genetic Association studies (STREGA) statement provides community-based data reporting and analysis standards for genomic disease association studiesThe IDAWG (immunogenomics.org) has proposed an extension of STREGA:
ST
rengthening the REporting of Immunogenomic Studies (STREIS)
Slide15From STREGA to STREISExtensions to the
STREGA
guidelines for immunogenomic data include:Describing the system(s) used to store, manage, and validate genotype and allele dataDocumenting all methods applied to resolve ambiguity Defining any codes used to represent ambiguitiesDescribing any binning or combining of alleles into common categoriesAvoiding the use of subjective terms (e.g. high-resolution typing), that may change over time
Slide16HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility
Developing a community standard for HLA & KIR data
reportingOverview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolutionRandomization tests and genomic controls
Outline
Slide17Allele-level AmbiguityGroup codes (“g”-codes) for alleles identical in exons 2 & 3 for class I, or exon 2 for class II.A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”
NMDP ambiguity codes for
4-digit non-null allelesA*0201/0209 = A*02AFA*0201/0209/0266 = A*02AJEYA*0201/0209/0266/0275/0289 = A*02BSFJAmbiguous allele setsA*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289Ambiguous alleles result from polymorphisms outside of assessed regions; outside of exons 2 & 3, or in sections of those exons that were not interrogated.
Slide18Genotype-level AmbiguityAmbiguous genotypes result from an inability to establish the phase of individual polymorphisms or entire exons.Different combinations of alleles can lead to the same typing result.Example: A typing result for one individual that could be explained by any of four different possible genotype sets at HLA-B.
Genotype 1
27054402Genotype 227054411
Genotype 3
2709
4402
Genotype 4
2709
4411
B*2705 + B*4402 or
B*2705 + B*4411 or
B*2709 + B*4402 or
B*2709 + B*4411
Most analytical methods require a single genotype call for each individual sample.
Slide19Standardized Ambiguity Reduction2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717
44020101, 44020102S, 440203
, 4419N, 4423N, 4424, 4427, 44332703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717
440202
,
4411
2709
44020101, 44020102S, 440203
,
4419N
, 4423N,
4424
,
4427
,
4433
2709
440202
,
4411
HLA-B allele 1
HLA-B allele 2
Genotype 1
Genotype 2
Genotype 3
Genotype 4
Sample #001
Peptide-level
Filtering,
Remove non-CWD alleles,
Binning alleles identical over exons 2&3
Unambiguous
data
2703, 2705
4402
Regional population-level frequency data
Slide20xxx
2703, 2705
440227054402
immunogenomics.org
Slide21Slide22Genotype List (GL) StringsUse a hierarchical set of operators to describe the relationships between alleles, lists of possible alleles, phased alleles, genotypes, lists of possible genotypes, and multilocus unphased genotypes, without losing typing information or increasing typing ambiguity.Are proposed to replace NMDP codes
Milius
et al. (2013) Tissue Antigens
Slide23Genotype List (GL) StringsExample GL string for the genotype:A*02:69 + A*23:30 or A*02:302 + A*23:26 or A*02:302 + A*23:39
B*44:02
+ B*49:08and
Slide24Immunology Database and Analysis Portal (www.ImmPort.org)
Developed
under the Bioinformatics Integration Support Contract (BISC) for NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)Data validation pipelineAnalysis toolsStandardized ambiguity reduction tools Data from a large number of immunogenomic studiesImmunoGenomics Data Analysis Working Group (www.immunogenomics.org) (www.IgDAWG.org)
An international collaborative group working to …
facilitate the sharing of
immunogenomic
data
(HLA, KIR, etc.) and
foster consistent analysis and interpretation of
immunogenomic
data
Resources
for
HLA Data Validation & Analysis
Slide25Slide26Slide27HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility
Developing a community standard for HLA & KIR data
reportingOverview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolutionRandomization tests and genomic controls
Outline
Slide28The KIR gene complex is located on
C
hromosome 19 (19q13.4)KIR are expressed on natural killer (NK) cells and a subset of T cellsCertain HLA alleles serve as ligands for KIR KIR Gene Function Ligand 2DL1
Inhibitory
HLA-C group2
2DS1
Activating
HLA-C group2
2DL2/3
Inhibitory
HLA-C group1
2DS2
Activating
HLA-C group1
3DL1
Inhibitory HLA-Bw4
3DS1
Activating
HLA-Bw4
Killer cell
Immunoglobulin-like Receptor
(KIR
)
Slide29NK Cell
Normal
Cell
No Lysis
Dominant inhibition
iKIR
HLA
Act. rec.
Protection
ligand
Lysis
Cytokines
Missing-self recognition
NK Cell
iKIR
Act. rec.
HIV
+
Targets
ligand
KIR regulate NK cell activity
Slide30HLA-C alleles can be divided into two groups based on the amino acid at position 80 (& 77), which determines KIR recognition
Ser
77Asp80Cw1 Cw3 Cw7 Cw8 Cw12Cw13Cw14
HLA-C1
KIR2DL3/2DL2
NK cell
inhibition
HLA-C2
Asp
77
Lys
80
Cw2
Cw4
Cw5
Cw6
Cw15
Cw17
KIR2DL1
Slide31Bifurcation of HLA-B allotypes
HLA-B
Bw4 (40%)
Bw6 (60%)
KIR3DL1 ligands
KIR3DS1
Not a ligand for KIR
80I
80T
Slide32HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility
Developing a community standard for HLA & KIR data
reportingOverview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolutionRandomization tests
and genomic controls
Outline
Slide33KIR & HLA in 30 Global Populations
Slide34Several studies hypothesized selection for KIR that suit the locale-specific HLA repertoire.Disease association studies point to HLA-Bw4 alleles with Isoleucine at position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1Population-level evidence for Co-evolution & Natural Selection for KIR and HLA
Slide35KIR2DL3 vs. HLA-Cgroup1
r = 0.184
KIR3DL1 vs. HLA-Bw4r = 0.426KIR2DL1 vs. HLA-Cgroup2r = 0.046Inhibitory
KIR
Correlations between frequencies for
KIR and HLA Ligands
Slide36Correlations between frequencies for KIR and HLA Ligands
KIR3DS1 vs. HLA-Bw4
r = -0.632KIR2DS1 vs. HLA-Cgroup2r = -0.478KIR2DS2 vs. HLA-Cgroup1r = -0.371
Activating
KIR
Slide37Correlations between frequencies for KIR and HLA LigandsActivating KIR3DS1
Subsets of Bw4 alleles based on amino acid position 80
KIR3DS1 vs. HLA-Bw4
r = -0.632
KIR3DS1 vs. HLA-Bw4-
80I
r = -0.657
KIR3DS1 vs. HLA-Bw4-
80T
r = -0.190
Single
et al
., Nature
Genetics
Slide38Challenges for these and other population studiesDemographic history shapes patterns of variation & can mimic the effects of selection.
Gene frequencies are not statistically independent among populations,
due to shared demographic history.Ordinary Pearson correlation p-values assume independence among the observations.We constructed a randomization test to account for the demographic histories of the populations and focus on the genetic effect.
Statistical Issues
Slide39Assessing the significance
ρ
= cor(X,Y) Null Hypothesis: H0: ρ = 0Statistic: Pearson’s correlation coefficient
Hypothesis Test for a Correlation Coefficient
X
Y
4.1
4.9
8.6
5.4
2.3
4.2
5.4
7.4
9.2
8.8
7.7
6.7
6.4
8.8
4.3
5.1
7.6
9.4
3.4
5.3
Slide40Randomization Test
Bw4
alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701, 3801, 3802, 4402, 4403, 4404, 4405, ...Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502, 1503, 1504, 1506, 1507, 1508, 1510, ...Reassign Bw4/Bw6 status to simulate the null hypothesisCompute correlation of frequencies for KIR-3DS1 & reassigned HLA
Slide41Permutation Distribution
KIR3DS1 – HLA-Bw4 correlation
Permutation p-value=0.012
r = -0.632
Slide42Empirical comparisons based on genomic data or other methods that incorporate information about the demographic histories of populations (Pritchard and Donnelly, 2001).
Our
study used data from the ALFRED database to assess statistical significance http://alfred.med.yale.eduWe selected 538 neutral sites from 202 genes typed in the same individuals Genomic Controls
Slide43Genomic Data
Slide44Randomly select two SNP sites from different chromosomes Find the frequencies in each population and compute the correlationRepeat
Genomic Data for Empirical Tests
Slide45KIR3DS1 – HLA-Bw4 correlation
empirical p-value=0.041
r = -0.632
Genomic Data – Empirical Distribution
Slide46* Ordinary Pearson p-values in red overestimate the significance of trendslocus pair
Correlation
p-value (1)(correlation)p-value (2)(permutation)
p-value
(3)
(empirical)
3DS1 - Bw4
-0.632
0.000
0.012
0.041
3DS1 - Bw480I
-0.657
0.000
0.009
0.038
3DS1 - Bw480T
-0.190
0.316
0.532
0.534
3DL1 - Bw4
0.426
0.019
0.106
0.218
3DL1 - Bw480
I
0.416
0.022
0.1
15
0.1
9
1
3DL1 - Bw480
T
0.171
0.367
0.
540
0.
7
5
8
2DS1 - C2
-0.478
0.008
0.243
0.149
2DL1 - C2
0.046
0.810
0.891
0.924
2DL2 - C1
-0.366
0.047
0.193
0.542
2DL3 - C10.1840.331
0.4580.328
2DS2 - C1-0.371
0.044
0.1700.479
P-correlation is the ordinary Pearson product-moment correlation p-value.(2) P-permutation is based on the permutation distribution under the null hypothesis.(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.
Significance of Correlations
*
Slide47HLA nomenclature: Why it matters for analysis and interpretationChallenges for combining HLA data from different sourcesData Standardization to facilitate meta-analyses and reproducibility
Developing a community standard for HLA & KIR data
reportingOverview of HLA data curation & ambiguity resolutionExample, Immport, Next steps: GL strings & QR codesHLA (chrom 6) and KIR (chrom 19) interactions A brief overviewHLA and KIR: population-level evidence of co-evolutionPopulation-genetic evidence of co-evolution
Randomization tests
and genomic controls
Outline
Slide48AcknowledgementsNCIMary CarringtonPat MartinGao
XiaojiangUSPDiogo MeyerRodrigo dos Santos FranciscoYale UniversityKen and Judy KiddChildren's Hospital Oakland Research Inst.Steven J. Mack
Jill A.
Hollenbach
Harvard Medical School
Alex Lancaster
UC San Francisco
Owen Solberg
Roche Molecular Systems
Henry A.
Erlich
Anthony Nolan Research Inst.
Steven G.E. Marsh
NCBI/NIH
Mike
Feolo
NGIT
Jeff Wiser
Patrick Dunn
Tom Smith
Slide49If time allows …
Slide50The two most common measures of the strength of LD are: the normalized measure of the individual LD values, namely Dij
' =
Dij / Dmax (Lewontin 1964); and (2) the correlation coefficient r for bi-allelic data, which is most often reported as r2 = D2 / (pA1 pA2 pB1 pB2). r =1 only when the allelic variations at the two loci show 100% correlationTheir multi-allelic extensions are:Linkage Disequilibrium (LD) Measures
Slide51Standard LD measures D’ and WnStandard LD measures (overall D’ & Wn) assume/force symmetry,
even though with >2 alleles per locus that is not the case
Data Source: Immport Study#SDY26: Identifying polymorphisms associated with risk for the development of myopericarditis following smallpox vaccine
Slide52Asymmetric Linkage Disequilibrium (ALD)Interpretation: ALD for HLA-DRB1 conditioning on HLA-DQA1 WDRB1 /
DQA1
= .58ALD for HLA-DQA1 conditioning on HLA-DRB1 WDQA1 / DRB1 = .95 The overall variation for DRB1 is relatively high given specific DQA1 alleles.The overall variation for DQA1 is relatively low given specific DRB1 alleles.ALDrow gene conditional on column geneThomson and Single, 2014 Genetics
Slide53Balancing selection can result from:- Overdominance/Heterozygote advantage- Frequency-dependent
selection
- Selective regimes that change over time/spaceFor HLA, the common factor in these models is rare allele advantage, which is consistent with a pathogen-directed frequency-dependent selection model.At the Amino Acid (AA) level we seeHigh AA variability at antigen recognition sites (ARS)Relatively even AA frequencies at ARS sitesHigher rates of non-synonymous vs. synonymous changes at ARSBalancing Selection Operates at Most HLA LociMeyer & Mack, 2008
Slide54Homozygosity (F) and theNormalized Deviate (Fnd)
Neutrality
FOBS
≈
F
EQ
F
nd
≈
0
Directional Selection
F
OBS
>
F
EQ
F
nd
> 0
Balancing Selection
F
OBS
<
F
EQ
F
nd
< 0
Fnd
=
(
F
OBS
-
F
EQ
)
/
SD(FEQ)
Slide55Fnd for DRB1 AA sites in a EUR populationFnd << 0 gives evidence of possible balancing selection.Fnd >> 0 gives evidence of possible directional selection.
Slide56LD for DRB1 AAsWn
ALD
row gene conditional on column geneAsymmetric LD (ALD)Wn (symmetric)
Slide57Fnd for DRB1 AA sites (Meta-Analysis)Fnd for all polymorphic sites in a meta-analysis of 57 populationsFnd << 0 gives evidence of possible balancing selection.
Fnd
>> 0 gives evidence of possible directional selection.
Slide58Asymmetric Linkage Disequilibrium (ALD)
Thomson and Single(2014) Genetics
Slide59Asymmetric Linkage Disequilibrium (ALD)
Thomson and Single(2014) Genetics