Sat Aug 28 2021
In previous posts (Part 1, Part 2) we’ve covered the challenges that Wide Data, Big Data’s more complex cousin, presents to data scientists. We’ve also discussed mass sequencing as an attempt make data bigger and less wide. While increasing sample size is helpful in general, we can only increase it linearly, whereas our feature space grows exponentially as we strive to discover even mildly complex patterns. Thus, ultimately, we need algorithmic solutions to solve Wide Data. In this post and the next, we examine the industry standard for the analysis of Wide Genomic Data, Polygenic Risk Scores, as well a deep learning solution.
Polygenic Risk Scores (PRS) promise that “scientists can capture in a single measurement how millions of sites across the genome can impact one patient’s health”. In essence, PRS are the weighted sum of all genetic variants that bear influence on a given disease. Patients with a high risk score are more likely to develop a disease.
Even though PRS are possibly the most popular approach to deal with Wide Genomic Data, they do little to address the obvious concerns discussed in previous posts:
Given that PRS raise a variety of statistical concerns and have limited predictive performance, why are they so popular?
This might be question for social scientists to explore. Perhaps medical practitioners have been abundantly cautious about making predictions in the face of uncertainty and therefore gravitate toward metrics that bear their probabilistic nature in full view. I would be very interested to hear your thoughts on the matter. What accounts for the popularity of PRS, in your view? Do you use them yourself, and if so, why? Please let us know in the comments!