The relevance of risk exposures for incident diseases typically differs, both qualitatively and quantitatively, by disease types (e.g. ischaemic stroke [IS] vs intracerebral haemorrhage [ICH]) and disease sub-types (e.g. lacunar vs non-lacunar IS) (Lancet Glob Health. 2018, Nat Med. 2019, Int J Cancer. 2008) The value of prospective biobanks, such as the China Kadoorie Biobank (CKB) study depends on detailed assessment of exposures in a large number of study participants and collection and validation of a wide range of fatal and non-fatal disease outcomes.
CKB has developed integrated strategies to ensure completeness of follow-up and reliable characterisation of disease types and relevant subtypes beyond those recorded by routine disease registers.
Following completion of the baseline survey, we implemented secure electronic linkage, via participants’ unique national ID numbers, with established mortality and morbidity registers for stroke, ischaemic heart disease (IHD), cancer and diabetes and with the national health insurance claims system for any episodes of hospitalisation. All the reported disease outcomes (usually ICD-10 coded - the International Classification of Diseases developed by the World Health Organization) were collected every six months by local project staff from relevant government agencies. These disease outcomes were processed centrally, checked, standardised, and integrated into a central database before being released to researchers. By 1 January 2021, about 60% of study participants had been hospitalised at least once, yielding a total of 1.5 million episodes of hospitalisation due to any cause. About 12% (more than 60,000) had died, but only 1% were lost to follow-up.
Since most disease outcomes were chiefly reported by the attending hospitals, the outcomes are typically accurate and no further verification was required. Despite this, it was considered prudent to conduct independent validation of reported incident diseases, in a random sample of such cases, to verify the quality of reported disease diagnoses for major diseases of public health significance and relevant research priorities.
Several validation studies have been conducted in CKB, that involved retrieval and independent review of medical records for about 1000 randomly selected cases (100 from each of the 10 study areas) of stroke, IHD, cancer, diabetes, chronic obstructive pulmonary disease (COPD) and chronic kidney disease (CKD), respectively. For diabetes and COPD, the reliability of reported diagnoses was high (Int J Chron Obstruct Pulmon Dis. 2016) and as there were no major aetiological subtypes that warranted further characterisation, no additional verification or adjudication was required. In contrast, stroke, IHD, cancer and CKD each have a heterogeneous aetiology and typically have incomplete reporting of disease subtypes (e.g. IS as opposed to lacunar and non-lacunar IS subtypes). Therefore, additional information was collected from medical records in order to reliably phenotype these major diseases into their relevant aetiological subtypes.
CKB developed detailed procedures and bespoke IT systems to manage disease outcome verification and adjudication, including selection and generation of participant and hospital lists, automated data transfer and task allocation, collection of relevant information from medical records (e.g. discharge summaries, results of diagnostic tests and use of medication). (Population Biobank Studies: A practical guide. 2021) . For cancer, apart from assessing the accuracy of reported diagnoses, the PVD system also collected additional information, including tumour histology, tumour stage and grade, results of cancer biomarkers (e.g. oestrogen receptor status for breast cancer) which were unavailable in routine sources of follow-up.
By early 2022, medical records from over 113,000 hospitalised participants had been retrieved using the portable verification device (PVD) system, including over 47,000 strokes, 39,000 IHD, 21,000 cancer, and 7200 CKD cases. Overall, the reporting accuracy rates were 93%, 88%, 94%, and 86% for stroke, IHD, cancer, and CKD. For different stroke types, the overall reporting accuracy was 79%, 98% and 98% for IS, ICH and sub-arachnoid haemorrhage (SAH) cases, respectively. Independent clinical adjudication of retrieved medical records for stroke cases using i-CASE has been completed on over 36,000 cases (and over 28,000 IHD cases). Overall, the diagnostic accuracy was 98% for both ICH and SAH, but only 79% for IS due to a high proportion of imaging-detected cerebral infarcts without typical focal neurological deficits (i.e. silent cerebral infarct) (Lancet Reg Health West Pac. 2021) When silent cerebral infarcts were considered as IS as advocated by the recent ICD-11 criteria, the diagnostic accuracy increased to 93% for IS (Lancet Reg Health West Pac. 2022) .For ICH, 35% were lobar and 65% were non-lobar, by MRI while for IS, 14% were lacunar, 86% were non-lacunar, 2% were cardio-embolic and 24% had unknown aetiology. Among cases of IS with unknown aetiology after adjudication, probability-based machine learning approaches were used to determine the likely aetiology according to baseline risk factors (Stoke Vasc. Neurol. 2022; In press). Overall, two-thirds of these cases could be considered as small artery occlusion, one-third as large artery atherosclerosis and 2% as having a cardio-embolic aetiology (Stroke. 1993, Stroke. 2014) Such detailed stroke phenotyping will greatly facilitate novel discoveries in genetic and non-genetic studies (Lancet Glob Health. 2020).
Ongoing work includes adjudication of stroke, IHD and CKD outcomes using the i-CASE outcome adjudication procedures implemented by clinical specialists. Planned future work will include disease verification and phenotyping of additional diseases including heart failure, fatty liver disease, auto-immune disease, and neuro-degenerative disease. This work should be immensely valuable and facilitate further future discoveries of the causes and consequences of these additional diseases in CKB.