Having established a list of phenotypes where we have reasonable confidence in the LDSR results, we can now address the question of which phenotypes are significantly heritable. The primary question is how to account for multiple testing across the GWASed phenotypes.
As an initial observation, the distribution of \(h^2_g\) results does not appear fully null within any of the confidence levels. Especially strong results are observed among the high confidence phenotypes.
Note: Expected quantiles are computed within each confidence bin.
Weaker p-values among the lower confidence phenotypes are not surprising given that most phenotypes in those bins have reduced confidence due to smaller sample sizes. This is especially true for the phenotypes designated as “medium” confidence due to potential sex biases or nonlinear ordinal codings, where the potential biases are unlikely to completely remove true signal from their GWAS. Conversely, it is not surprising that there are non-significant results among the high confidence phenotypes since the confidence level is not assigned based on the \(h^2_g\) estimate, only based on expectations about the stability and potential biases in that estimate.
Noteably, the distribution of \(h^2_g\) point estimates is similar across the confidence levels, albeit nosier in the low confidence set.
Given the large number of phenotypes, it’s important to account for multiple testing in defining significance for the \(h^2_g\) estimates. Although we might be comfortable with a conventional Bonferroni correction for significance, this is complicated by two considerations:
Focusing on the question of independent tests, we can adopt the method of Li et al. 2011 to estimate the number of effectively independent phenotypes (\(M_{eff}\)) based on the observed correlation between the phenotypes. Specifically, we compute \(M_{eff} = M - \sum I(\lambda_i > 1)(\lambda_i-1)\) where \(\lambda_i\) are the eigenvalues of the phenotypic correlation matrix. Thus asymptotically \(M_{eff}=M\) when the phenotypes are independent (i.e. all \(lambda_i=1\)) and shrinks proportional to the amount of redundancy from correlation between phenotypes.
We estimate these phenotypic correlations from the UK Biobank GWAS sample (minus a handful of individuals who have withdrawn since the Round 2 GWAS release) after residualizing on the GWAS covariates (\(sex, age, age^2, sex \times age, sex \times age^2, 20 PCs\)) using pairwise complete data. This leaves some phenotypic correlations that either cannot be estimated due to never being measured in the same individual (e.g. sex-specific items across sex, or other conditional dependencies on previous items), or where the correlation estimate is highly unstable due to the number of intersecting individuals observed for both phenotypes is small. To resolve this, we conversatively set to zero all correlations between pairs of phenotypes where less than 1000 individuals are observed for both phenotypes.
This computation of \(M_{eff}\) suggests:
We skip computation of \(M_{eff}\) including phenotypes with no confidence since we generally don’t recommend use of those results.
The above process leaves us with a large number of possible p-value thresholds:
ldsc
), treating them as independentWe observe that the differences between most of these options based on (effective) number of tests is fairly marginal. Splitting the phenotypes by confidence level, we see the number of phenotypes surpassing each p-value threshold is quite similar.
Threshold | Low Conf. | Medium Conf. | High Conf. |
---|---|---|---|
\(p<1.02\times 10^{-4}\) | 36 | 122 | 613 |
\(p<6.21\times 10^{-5}\) | 31 | 115 | 605 |
\(p<6.89\times 10^{-5}\) | 31 | 117 | 605 |
\(p<4.25\times 10^{-5}\) | 28 | 115 | 598 |
\(p<4.22\times 10^{-5}\) | 28 | 115 | 598 |
\(p<2.66\times 10^{-5}\) | 25 | 111 | 590 |
\(p<1.2\times 10^{-5}\) | 22 | 108 | 576 |
\(p < 3.17\times 10^{-5}\) | 26 | 112 | 591 |
We choose to focus on reporting the following levels:
Level | Criteria | Description |
---|---|---|
NA | low confidence | not evaluated due to risk of biases/instability |
NonSig | \(p > .05\) | insufficient evidence for \(h^2_g > 0\) |
Nominal | \(p < .05\) | if you only looked at one phenotype… |
z4 | \(p < 3.17 \times 10^{-5}\ (z > 4)\) | Bonferroni sig. for medium/high confidence phenotypes, sufficient for \(r_g\) analysis |
z7 | \(p < 1.28 \times 10^{-12}\ (z > 7)\) | significant enough for stratified LDSR |
We anticipate that these should cover most of the range of interests in using and interpreting the LDSR \(h^2_g\) results. we adopt \(z > 4\) as the primary significance threshold, since it conservatively approximates the Bonferroni thresholds of interest (among medium and high confidence phenotypes) and matches the previously suggested standard for recommending followup analyses. This conservative choice does mean that a few phenotypes that would reach significance under one of the other thresholds are omitted, but p-values and results for all phenotypes are reported so other thresholds can be applied by other researchers if desired.
The resulting breakdown of phenotypes with significant heritability is:
Confidence | NonSig | Nominal | z4 | z7 | NA |
---|---|---|---|---|---|
low | 0 | 0 | 0 | 0 | 703 |
medium | 126 | 134 | 64 | 48 | 0 |
high | 62 | 152 | 186 | 405 | 0 |
Totalling 703 significant phenotypes (\(z > 4\) with medium or high confidence), with 405 highest tier results (\(z > 7\) with high confidence).