Having established a list of phenotypes where we have reasonable confidence in the LDSR results, we can now address the question of which phenotypes are significantly heritable. The primary question is how to account for multiple testing across the GWASed phenotypes.

Distribution of \(h^2_g\) results

As an initial observation, the distribution of \(h^2_g\) results does not appear fully null within any of the confidence levels. Especially strong results are observed among the high confidence phenotypes.

Note: Expected quantiles are computed within each confidence bin.

Weaker p-values among the lower confidence phenotypes are not surprising given that most phenotypes in those bins have reduced confidence due to smaller sample sizes. This is especially true for the phenotypes designated as “medium” confidence due to potential sex biases or nonlinear ordinal codings, where the potential biases are unlikely to completely remove true signal from their GWAS. Conversely, it is not surprising that there are non-significant results among the high confidence phenotypes since the confidence level is not assigned based on the \(h^2_g\) estimate, only based on expectations about the stability and potential biases in that estimate.

Note: Range resticted for visibility. Zoom out to see additional low confidence results above and below the plotted region.

Noteably, the distribution of \(h^2_g\) point estimates is similar across the confidence levels, albeit nosier in the low confidence set.

Multiple testing correction

Given the large number of phenotypes, it’s important to account for multiple testing in defining significance for the \(h^2_g\) estimates. Although we might be comfortable with a conventional Bonferroni correction for significance, this is complicated by two considerations:

  • Should we test low confidence results? (i.e. should phenotypes denoted as low confidence count towards the number of tests to be corrected for in the Bonferroni adjustment)
  • How should we address correlation between the phenotypes? We know many of the UK Bioank phenotypes are strongly correlated, and the Bonferroni significance threshold will be very conservative if we treat the test of those phenotypes as independent.

Estimating the effective number of independent tests

Focusing on the question of independent tests, we can adopt the method of Li et al. 2011 to estimate the number of effectively independent phenotypes (\(M_{eff}\)) based on the observed correlation between the phenotypes. Specifically, we compute \(M_{eff} = M - \sum I(\lambda_i > 1)(\lambda_i-1)\) where \(\lambda_i\) are the eigenvalues of the phenotypic correlation matrix. Thus asymptotically \(M_{eff}=M\) when the phenotypes are independent (i.e. all \(lambda_i=1\)) and shrinks proportional to the amount of redundancy from correlation between phenotypes.

We estimate these phenotypic correlations from the UK Biobank GWAS sample (minus a handful of individuals who have withdrawn since the Round 2 GWAS release) after residualizing on the GWAS covariates (\(sex, age, age^2, sex \times age, sex \times age^2, 20 PCs\)) using pairwise complete data. This leaves some phenotypic correlations that either cannot be estimated due to never being measured in the same individual (e.g. sex-specific items across sex, or other conditional dependencies on previous items), or where the correlation estimate is highly unstable due to the number of intersecting individuals observed for both phenotypes is small. To resolve this, we conversatively set to zero all correlations between pairs of phenotypes where less than 1000 individuals are observed for both phenotypes.

This computation of \(M_{eff}\) suggests:

  • 488.7 tests among high confidence phenotypes alone
  • 725.43 tests among medium and high confidence phenotypes combined
  • 1183.62 tests among low, medium, and high confidence phenotypes combined

We skip computation of \(M_{eff}\) including phenotypes with no confidence since we generally don’t recommend use of those results.

Potential thresholds

The above process leaves us with a large number of possible p-value thresholds:

  • \(p<.05\) for nominal significance
  • \(p<1.02\times 10^{-4}\) for the 488.7 effective tests in high confidence phenotypes
  • \(p<6.21\times 10^{-5}\) for the 805 high confidence phenotypes, treating them as independent
  • \(p<6.89\times 10^{-5}\) for the 725.43 effective tests in medium and high confidence phenotypes
  • \(p<4.25\times 10^{-5}\) for the 1177 medium and high confidence phenotypes, treating them as independent
  • \(p<4.22\times 10^{-5}\) for the 1183.62 effective tests in low, medium and high confidence phenotypes
  • \(p<2.66\times 10^{-5}\) for the 1880 low, medium, and high confidence phenotypes, treating them as independent
  • \(p<1.2\times 10^{-5}\) for the 4178 GWASed phenotypes (including those with no confidence for ldsc), treating them as independent
  • \(p < 3.167 \times 10^{-5}\ (z > 4)\) as previously suggested as a rule of thumb for the necessary level of \(h^2_g\) signal necessary to support subsequent LDSR analyses of genetic correlation (\(r_g\); Bulik-Sullivan et al. 2015)
  • \(p < 1.280 \times 10^{-12}\ (z > 7)\) as previously suggested as a threshold for inclusion in stratified LDSR analyses (Finucane et al. 2015)

We observe that the differences between most of these options based on (effective) number of tests is fairly marginal. Splitting the phenotypes by confidence level, we see the number of phenotypes surpassing each p-value threshold is quite similar.

Threshold Low Conf. Medium Conf. High Conf.
\(p<1.02\times 10^{-4}\) 36 122 613
\(p<6.21\times 10^{-5}\) 31 115 605
\(p<6.89\times 10^{-5}\) 31 117 605
\(p<4.25\times 10^{-5}\) 28 115 598
\(p<4.22\times 10^{-5}\) 28 115 598
\(p<2.66\times 10^{-5}\) 25 111 590
\(p<1.2\times 10^{-5}\) 22 108 576
\(p < 3.17\times 10^{-5}\) 26 112 591

Chosen significance thresholds

We choose to focus on reporting the following levels:

Level Criteria Description
NA low confidence not evaluated due to risk of biases/instability
NonSig \(p > .05\) insufficient evidence for \(h^2_g > 0\)
Nominal \(p < .05\) if you only looked at one phenotype…
z4 \(p < 3.17 \times 10^{-5}\ (z > 4)\) Bonferroni sig. for medium/high confidence phenotypes, sufficient for \(r_g\) analysis
z7 \(p < 1.28 \times 10^{-12}\ (z > 7)\) significant enough for stratified LDSR

We anticipate that these should cover most of the range of interests in using and interpreting the LDSR \(h^2_g\) results. we adopt \(z > 4\) as the primary significance threshold, since it conservatively approximates the Bonferroni thresholds of interest (among medium and high confidence phenotypes) and matches the previously suggested standard for recommending followup analyses. This conservative choice does mean that a few phenotypes that would reach significance under one of the other thresholds are omitted, but p-values and results for all phenotypes are reported so other thresholds can be applied by other researchers if desired.

Summary of significant results

The resulting breakdown of phenotypes with significant heritability is:

Confidence NonSig Nominal z4 z7 NA
low 0 0 0 0 703
medium 126 134 64 48 0
high 62 152 186 405 0

Totalling 703 significant phenotypes (\(z > 4\) with medium or high confidence), with 405 highest tier results (\(z > 7\) with high confidence).