The UKB Round 2 GWAS contains 11685 GWAS of 4236 unique phenotype codes (3011 PHESANT + 559 FinnGen + 633 ICD10 + 31 biomarkers + 2 covariates). For many of these phenotypes, however, there are multiple GWAS, due to:
Our first task for the LD Score regression analyses, then, is to get a “primary” result for each phenotype.
For the biomarker assays, UK Biobank has reported that some samples were unintentionally diluted (pdf) during processing. Although an effort has been made to estimate the dilution fraction and correct assay values accordingly, the estimated dilution fraction has also been reported for potential use in additional modelling. For instance, other initial analyses of the biomarker data have opted to include the dilution fraction as a covariate in regression analyses.
For the Neale Lab GWAS, the analysis was performed both with and without the dilution fraction as a covariate (along with the same GWAS covariates for age, sex, and PCs used for all phenotypes). We evaluate here whether to use the GWAS results with or without the dilution fraction covariate as the primary GWAS for the purposes of the ldsc \(h^2_g\) analyses here.
If the dilution fraction covariate controls for substantial noise in the phenotype, we may anticipate stronger \(h^2_g\) results (higher point estimate, more significant) when the covariate is included.
Takeaway: \(h^2_g\) estimate and significance are not meaningfully affected by the addition of the dilution factor covariate.
If the dilution fraction is somehow correlated with the genetic data in a way that would lead to overall inflation of the GWAS results (e.g. some correlation with residual population structure), we may anticipate genome-wide inflation to be evident in the intercept results (higher point estimate, more significant) when the covariate is omitted.
Takeaway: The intercept estimate and significance are not substantially affected by the addition of the dilution factor covariate. If anything, adding the dilution factor covariate reduces the stability of the intercept (lower significance) without affecting the point estimate.
Overall, the inclusion of the dilution factor covariate has minimal impact on the results for the biomarkers. Therefore in the interest of simplicity we treat the GWAS without the dilution factor covariate as the primary analysis for the biomarker phenotypes. (Results for the GWAS with the dilution factor covariate still appear in the complete results file though.) The only remaining variation in covariates across the analyses is that sex is omitted as a covariate in sex-specific GWAS.
The Round 1 GWAS rank-normalized (IRNT) all continuous phenotypes. In Round 2, untransformed copies of the continuous phenotypes were also GWASed for the purposes of evaluating whether rank-normalizing was beneficial. Here we compare the raw and IRNT versions of all of the continuous phenotypes from PHESANT that were GWASed in both_sexes
(i.e. that aren’t sex-specific).
Specifically, we evaluate whether:
We first look at heritability:
Takeaway: \(h^2_g\) results are generally consistent, but with higher \(h^2_g\) for the IRNT versions of each phenotype.
Takeaway: p-values for testing \(h^2_g\) are mostly consistent between scalings, but IRNT does average more significant \(h^2_g\) results, especially among the phenotypes that have high \(h^2_g\). Compared to the observed differences in \(h^2_g\), the moderate change in p-values here reflects that the SEs for \(h^2_g\) are often also nominally larger for IRNT (not shown).
Taken together, these seem to point towards IRNT providing a net benefit to the LDSR \(h^2_g\) results.
Before adopting that conclusion, however, we also look at the results for the intercept term:
Takeaway: Intercepts are maybe slightly larger on average for IRNT versions of each phenotype (especially for intercepts 1.05-1.20; zoom plot out for additional outliers), but the differences are marginal. The largest differences seem to occur among the biomarkers and haemotology measures. Comparing these estimates with the mean \(\chi^2\) values, the estimated intercept ratios remain mostly unchanged between the IRNT and raw versions (not shown).
Takeaway: Focusing here on the majority of phenotypes with moderate/nominally significant intercept results (zoom out for p-values out to 1e-450), the p-values are strongly consistent between the IRNT and raw untransformed phenotypes. IRNT intercepts are maybe marginally more significant on average.
One noteworthy outlier is the estimated dilution fraction for the biomarker data (code 30897), which has a much higher intercept in the IRNT version of the GWAS (mean \(\chi^2 = 1.078\), intercept \(= 1.071\), SE \(= 0.0094\), \(p = 1.71\times 10^{-14}\)) than in the GWAS of the raw value (mean \(\chi^2 = 1.025\), intercept \(= 1.03\), SE \(= 0.0087\), \(p = 2.54\times 10^{-4}\)). The reason for this strong difference is unclear, though it may relate to the bimodal, left-skewed distribution of the dilution fraction estimate. The SNP heritability result is reassuringly null in either case (IRNT: \(h^2_g = 4.09\times 10^{-5}\), \(p = 0.493\); raw: \(h^2_g = -9.06\times 10^{-4}\), \(p = 0.664\)), as would be expected for GWAS of an estimate of sample contamination in the lab. As a result we give limited weight to this outlier in evaluating the choice of IRNT vs. raw versions of the GWAS.
Overall, the results are largely consistent regardless of the choice of IRNT or raw untransformed phenotypes. Since IRNT does appear to provide a marginal boost to the \(h^2_g\) results, especially in terms of significance, we choose to treat the IRNT version as the primary analysis for continuous phenotypes. (Results for the raw, untransformed versions will still appear in the complete results file though.)
During review of the GWAS phenotypes, we identified some instances where a FinnGen code corresponds to a phenotype that is identical to another GWASed FinnGen code and in thus redundant. The most common case is pairs with codes C_*
and C3_*
with the same name, description,and sample size. For example, both C_OTHER_SKIN
and C3_OTHER_SKIN
are phenotypes for “Other malignant neoplasms of skin”, and both have 14,402 cases and 346,792 controls. We identify and mark as redundant those phenotypes here.
We can systematically identify these redundant pairs by confirming that the phenotypic correlation between the codes is 1 and that the phenotype is observed in the same individuals with the same number of cases. From that process, we identify the following phenotype pairs:
For these pairs, we drop the phenotype listed in the second column as redundant. This leads to the removal of 58 FinnGen phenotype codes, leaving 4178 unique phenotype codes among the GWAS.
UKB includes a number of sex-specific phenotypes, meaning that not all phenotypes have a both_sexes
version of the analysis. In addition, although some of these are specified by UKB and thus coded as such (e.g. all members of the non-applicable sex are marked as missing), others do not exclude non-applicable sex, for example treating them as controls. The Round 2 GWAS included strong efforts to address phenotypes with this issue, particularly among the PHESANT phenotypes, but we still have some GWAS where there’s a both_sexes
analysis of a sex-specific phenotype. Therefore the goal here is to verify that we have the appropriate version of each phenotype respecting sex-specificity where applicable.
Our process is as follows:
male
or female
version of the analysis exists (with no both_sexes
), take that version.male
and female
version of the analysis exists, verify that both sex-specific analyses have a non-trivial proportion of the sample size and of the number of cases. As long as that is true, take the both_sexes
analysis, otherwise take the appropriate sex-specific analysis.both_sexes
analysis but only one sex-stratified analysis, compare sample size and case count of the sex-specific analysis to the both_sexes
analysis. If the sex-specific analysis is the dominant source of samples then use the sex-specific analysis, otherwise use the both_sexes
analysis.We inspect the results of this process to ensure face validity of the concluded sex-specificity of the phenotypes.
[1] "Cancer code, self-reported: breast cancer [20001_1002]"
[2] "Cancer code, self-reported: ovarian cancer [20001_1039]"
[3] "Cancer code, self-reported: uterine/endometrial cancer [20001_1040]"
[4] "Cancer code, self-reported: cervical cancer [20001_1041]"
[5] "Cancer code, self-reported: cin/pre-cancer cells cervix [20001_1072]"
[6] "Non-cancer illness code, self-reported: gestational hypertension/pre-eclampsia [20002_1073]"
[7] "Non-cancer illness code, self-reported: gestational diabetes [20002_1221]"
[8] "Non-cancer illness code, self-reported: gynaecological disorder (not cancer) [20002_1348]"
[9] "Non-cancer illness code, self-reported: ovarian cyst or cysts [20002_1349]"
[10] "Non-cancer illness code, self-reported: polycystic ovaries/polycystic ovarian syndrome [20002_1350]"
[11] "Non-cancer illness code, self-reported: uterine fibroids [20002_1351]"
[12] "Non-cancer illness code, self-reported: uterine polyps [20002_1352]"
[13] "Non-cancer illness code, self-reported: vaginal prolapse/uterine prolapse [20002_1353]"
[14] "Non-cancer illness code, self-reported: breast disease (not cancer) [20002_1364]"
[15] "Non-cancer illness code, self-reported: fibrocystic disease [20002_1366]"
[16] "Non-cancer illness code, self-reported: breast cysts [20002_1367]"
[17] "Non-cancer illness code, self-reported: endometriosis [20002_1402]"
[18] "Non-cancer illness code, self-reported: female infertility [20002_1403]"
[19] "Non-cancer illness code, self-reported: post-natal depression [20002_1531]"
[20] "Non-cancer illness code, self-reported: cervical intra-epithelial neoplasia (cin) / precancerous cells cervix [20002_1554]"
[21] "Non-cancer illness code, self-reported: cervical polyps [20002_1555]"
[22] "Non-cancer illness code, self-reported: menorrhagia (unknown cause) [20002_1556]"
[23] "Non-cancer illness code, self-reported: ectopic pregnancy [20002_1558]"
[24] "Non-cancer illness code, self-reported: miscarriage [20002_1559]"
[25] "Non-cancer illness code, self-reported: breast fibroadenoma [20002_1560]"
[26] "Non-cancer illness code, self-reported: abnormal smear (cervix) [20002_1663]"
[27] "Non-cancer illness code, self-reported: dysmenorrhoea / dysmenorrhea [20002_1664]"
[28] "Non-cancer illness code, self-reported: menopausal symptoms / menopause [20002_1665]"
[29] "Non-cancer illness code, self-reported: benign breast lump [20002_1666]"
[30] "Treatment/medication code: depo-provera 50mg/1ml injection [20003_1140857620]"
[31] "Treatment/medication code: prempak 0.625 tablet [20003_1140857636]"
[32] "Treatment/medication code: tranexamic acid [20003_1140861832]"
[33] "Treatment/medication code: climagest 1mg tablet [20003_1140864196]"
[34] "Treatment/medication code: climaval 1mg tablet [20003_1140868372]"
[35] "Treatment/medication code: premarin 625micrograms tablet [20003_1140868408]"
[36] "Treatment/medication code: hormonin tablet [20003_1140868458]"
[37] "Treatment/medication code: progynova 1mg tablet [20003_1140868460]"
[38] "Treatment/medication code: vagifem 25mcg pessary [20003_1140868472]"
[39] "Treatment/medication code: tibolone [20003_1140868482]"
[40] "Treatment/medication code: nuvelle tablet [20003_1140868518]"
[41] "Treatment/medication code: norethisterone [20003_1140868580]"
[42] "Treatment/medication code: progesterone product [20003_1140868588]"
[43] "Treatment/medication code: ortho-gynest 500micrograms pessary [20003_1140869034]"
[44] "Treatment/medication code: ovestin 0.1% vaginal cream [20003_1140869036]"
[45] "Treatment/medication code: mercilon tablet [20003_1140869164]"
[46] "Treatment/medication code: logynon tablet [20003_1140869176]"
[47] "Treatment/medication code: microgynon 30 tablet [20003_1140869180]"
[48] "Treatment/medication code: micronor tablet [20003_1140869276]"
[49] "Treatment/medication code: noriday tablet [20003_1140869278]"
[50] "Treatment/medication code: loestrin 20 tablet [20003_1140869324]"
[51] "Treatment/medication code: cilest tablet [20003_1140869346]"
[52] "Treatment/medication code: femulen tablet [20003_1140869362]"
[53] "Treatment/medication code: norgeston tablet [20003_1140869370]"
[54] "Treatment/medication code: tamoxifen [20003_1140870164]"
[55] "Treatment/medication code: livial 2.5mg tablet [20003_1140882946]"
[56] "Treatment/medication code: oestrogen product [20003_1140884622]"
[57] "Treatment/medication code: starflower oil [20003_1140911680]"
[58] "Treatment/medication code: menophase tablet [20003_1140912212]"
[59] "Treatment/medication code: evorel 25 patch [20003_1140916790]"
[60] "Treatment/medication code: kliofem tablet [20003_1140917056]"
[61] "Treatment/medication code: mirena 52mg intrauterine system [20003_1140921814]"
[62] "Treatment/medication code: mirena 20mcg/24hrs intrauterine system [20003_1140921822]"
[63] "Treatment/medication code: femoston 1/10 tablet [20003_1140922562]"
[64] "Treatment/medication code: premique 0.625mg/5mg tablet [20003_1140922804]"
[65] "Treatment/medication code: premique cycle 10mg tablet [20003_1140922806]"
[66] "Treatment/medication code: anastrozole [20003_1140923018]"
[67] "Treatment/medication code: arimidex 1mg tablet [20003_1140923022]"
[68] "Treatment/medication code: femseven 50 patch [20003_1140923738]"
[69] "Treatment/medication code: elleste-solo 1mg tablet [20003_1140923852]"
[70] "Treatment/medication code: climesse tablet [20003_1140926430]"
[71] "Treatment/medication code: estraderm mx 25 patch [20003_1140926592]"
[72] "Treatment/medication code: zumenon 1mg tablet [20003_1140928878]"
[73] "Treatment/medication code: letrozole [20003_1141145896]"
[74] "Treatment/medication code: evorel conti patch [20003_1141151718]"
[75] "Treatment/medication code: elleste duet conti tablet [20003_1141156644]"
[76] "Treatment/medication code: implanon 68mg subdermal implant [20003_1141166200]"
[77] "Treatment/medication code: kliovance 1mg/0.5mg tablet [20003_1141168326]"
[78] "Treatment/medication code: raloxifene hydrochloride [20003_1141168574]"
[79] "Treatment/medication code: evista 60mg tablet [20003_1141168578]"
[80] "Treatment/medication code: exemestane [20003_1141171100]"
[81] "Treatment/medication code: indivina 1mg/2.5mg tablet [20003_1141172436]"
[82] "Treatment/medication code: estriol product [20003_1141181594]"
[83] "Treatment/medication code: estradiol product [20003_1141181700]"
[84] "Treatment/medication code: cerazette 75micrograms tablet [20003_1141182800]"
[85] "Ever had breast cancer screening / mammogram [2674]"
[86] "Ever had cervical smear test [2694]"
[87] "Number of live births [2734]"
[88] "Ever had stillbirth, spontaneous miscarriage or termination [2774]"
[89] "Ever taken oral contraceptive pill [2784]"
[90] "Hospital episode type: Delivery episode [41231_2]"
[91] "Destinations on discharge from hospital (recoded): Transfer to other NHS provider: Obstetrics [41248_5002]"
[1] "Cancer code, self-reported: prostate cancer [20001_1044]"
[2] "Cancer code, self-reported: testicular cancer [20001_1045]"
[3] "Non-cancer illness code, self-reported: prostate problem (not cancer) [20002_1207]"
[4] "Non-cancer illness code, self-reported: testicular problems (not cancer) [20002_1214]"
[5] "Non-cancer illness code, self-reported: enlarged prostate [20002_1396]"
[6] "Non-cancer illness code, self-reported: bph / benign prostatic hypertrophy [20002_1516]"
[7] "Non-cancer illness code, self-reported: prostatitis [20002_1517]"
[8] "Non-cancer illness code, self-reported: erectile dysfunction / impotence [20002_1518]"
[9] "Non-cancer illness code, self-reported: undescended testicle [20002_1679]"
[10] "Treatment/medication code: cardura 1mg tablet [20003_1140860690]"
[11] "Treatment/medication code: xatral 2.5mg tablet [20003_1140864472]"
[12] "Treatment/medication code: testosterone product [20003_1140868532]"
[13] "Treatment/medication code: finasteride [20003_1140868550]"
[14] "Treatment/medication code: zoladex 3.6mg implant [20003_1140870196]"
[15] "Treatment/medication code: alfuzosin [20003_1140879774]"
[16] "Treatment/medication code: tamsulosin [20003_1140926934]"
[17] "Treatment/medication code: flomax mr 400micrograms m/r capsule [20003_1140926940]"
[18] "Treatment/medication code: sildenafil [20003_1141168936]"
[19] "Treatment/medication code: viagra 25mg tablet [20003_1141168944]"
[20] "Treatment/medication code: viagra 50mg tablet [20003_1141168946]"
[21] "Treatment/medication code: viagra 100mg tablet [20003_1141168948]"
[22] "Treatment/medication code: tadalafil [20003_1141187810]"
[23] "Treatment/medication code: cialis 10mg tablet [20003_1141187814]"
[24] "Treatment/medication code: cialis 20mg tablet [20003_1141187818]"
[25] "Treatment/medication code: dutasteride [20003_1141192000]"
[26] "Treatment/medication code: vardenafil [20003_1141192248]"
[27] "Treatment/medication code: testogel 50mg gel 5g sachet [20003_1141193272]"
[28] "Treatment/medication code: saw palmetto product [20003_1205]"
[29] "Underlying (primary) cause of death: ICD10: C61 Malignant neoplasm of prostate [40001_C61]"
Takeaway: Very strong face validity for each of these lists.
Take these lists as-is.
Takeaway: Sample size is nicely divided male/female for this set of phenotypes.
Takeaway: Cases and controls are also nicely balanced between males and females for most of these phenotypes.
We can examine more closely the phenotypes in the tails of these distributions, with \(>85\%\) of cases or controls coming from one sex:
Note: Table can be scrolled to the right for sample sizes.Takeaway: We note that although these are strongly sex-biased, they are not necessarily sex-specific. For example, this list contains a large number of codes for jobs with a history of being strongly gendered (e.g. nursing) and treatments most commonly recommended for strongly sex-biased phenotypes (e.g. calcium supplements for osteoporosis).
[NB: The majority of these phenotypes will also end up below our effective sample size threshold for high confidence results, as described below, and thus will not be a primary contributor to the top level heritability results regardless of the treatment of these phenotypes here.]
Although a handful of these phenotypes are strongly sex-biased, none appear to be fully sex-specific. Therefore we take the both_sexes
GWAS as the primary result for all phenotypes in this set.
both_sexes
and one sexThere are 860 phenotypes with a GWAS in both_sexes
and one sex but not the other (specifically, 389 in males and 471 in females).
For all phenotypes in this category, we evaluate what proportion of their total sample size comes from the GWASed sex.
The visible outliers from this distribution (\(>85\%\) of samples coming from the GWASed sex) are:
Note: Table can be scrolled to the right for sample sizes.Takeaway: The majority of these outlying phenotypes clearly either are sex-specific (and just have a redundant both_sexes
GWAS) or should be (i.e. the collection of traits here where only 2 individuals aren’t in the primary sex). Note that this includes items that aren’t on their face sex-specific but were asked as part of a sex-specific questionnaire (e.g. 6153 and 6177, administered to self-reported females and males, respectively).
The one exception is 5959 (“Previously smoked cigarettes on most/all days”) which is predominantly male (\(96.6\%\)) but not exclusively gendered. The strong sex bias comes from it being a follow-up question only administered to individuals who report currently smoking “on most or all days” and who “mainly” smoke “cigars or pipes”.
For binary phenotypes on this class (excluding the ones clearly indicated as sex-specific above), we additionally look at the proportion of cases and controls coming from the GWASed sex:
The outliers from these two distributions, again using the \(>85\%\) threshold, are:
Note: Table can be scrolled to the right for sample sizes.Takeaway: This list of phenotypes with cases and/or controls primarily from one sex is dominanted by job codes and medical outcomes that are either strongly sex biased or (in the case of some medical phenotypes, e.g. pregnancy-related ICD codes) sex-specific. Considering the \(97\%\) threshold used to evaluate total sample size above, none of the phenotypes with \(85-97\%\) of cases or controls from a single sex are definitionally-sex specific, instead they are all instances where cases/controls from the rarer sex are entirely plausible. The both_sexes
analysis of these phenotypes therefore seems appropriate the treat as the “primary” analysis (though we revisit expectations about the stability of these results below).
For the remaining phenotypes with >\(97\%\) of cases and/or controls coming from the GWASed sex (relisted below), we see three clear scenarios:
Following the standard applied above, we opt to use the sex-specific GWAS as the primary result for the truely sex-specific medical phenotypes, but keep the both_sexes
analysis for the other two scenarios where cases/controls from both sexes are reasonable despite their rarity.
For all phenotypes in this set (i.e. having GWAS in both_sexes
and one sex, but missing for the other sex), if \(<97\%\) of the total sample size and of the cases and controls (where applicable) come from a single sex then we treat the both_sexes
GWAS as the “primary” analysis of that phenotype. If \(>97\%\) of the total sample size comes from a single sex then we adopt the sex-specific result. Otherwise if \(>97\%\) of cases and/or controls come from a single sex and the phenotype is by definition expect to be sex-specific (e.g. many ICD codes, but excluding job code) then we adopt the sex-specific GWAS as the primary GWAS, otherwise we use the both_sexes
GWAS.
To summarize, after excluding biomarker GWAS with the dilution factor covariate, raw
versions of continuous phenotypes, and the redundant FinnGen codes, there are 10466 total GWAS.
##
## both_sexes female male
## biomarkers 31 31 31
## covariate 2 1 1
## finngen 501 369 356
## icd10 633 482 439
## phesant 2891 2393 2305
Following the above process of looking at sex balance of the available GWAS, we keep as the primary GWAS for the 4178 phenotypes:
both_sexes
GWAS for phenotypes with both male
and female
GWAS (2714 phenotypes)
both_sexes
GWAS for phenotypes with neither male
nor female
GWAS, i.e. due to low sample size or case counts (484 phenotypes)
single-sex GWAS for phenotypes with no both_sexes
GWAS (120 phenotypes)
single-sex GWAS available for phenotypes where \(>97\%\) of the total sample size comes from a single sex (51 phenotypes)
single-sex GWAS available for phenotypes where \(>99.7\%\) of the total cases or controls comes from a single sex (which is sufficient the distinguish sex-specific vs. strongly sex-biased biomedical phenotypes), excluding job codes (96 phenotypes)
both_sexes
GWAS for all remaining phenotypes with only one single-sex GWAS, i.e. job codes and phenotypes where \(<97\%\) of the total samples and \(<99.7\%\) of the total cases and controls are from a single sex (713 phenotypes)
The resulting breakdown of the sexes used as the primary GWAS for the 4178 unique phenotypes is:
##
## both_sexes female male
## biomarkers 31 0 0
## covariate 2 0 0
## finngen 486 12 3
## icd10 553 68 12
## phesant 2839 130 42
The above process leads us to a “primary” \(h^2_g\) result for each of the 4178 unique phenotypes in the Round 2 GWAS. We then work from this set of results to assess our confidence in each LDSR result and its statistical significance.