Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Genomics data analysis in Julia
1. Genomics data analysis in
Jiahao Chen
Andreas Noack
w/ Stavros Papadopoulos (Intel)
Nikos Patsopoulos (Broad & BWH)labs
2. Alan Andreas Xianyi JarrettDavid
(UNAM)
Jiahao
The Julia Labs at MIT
Vertically integrated PL theory, compilers, numerics, and data science
Stavros
Nikos
2016 UROPs: Jacob Higgins, Mark Wang, Yingbo Ma (Lexington H.S.)
2016 GSoC students: Joseph Obiajulu, Juan López (+10 others)
726 open source developers as of Mar ’16
Arch Lindsey Ehsan Tim
Eka
ValentinTim David
Jake
(now USAP)
Jan
Jeremy
Vijay
Isaac
Steven Simon John Yee Sian Joey Miles IainJuan Pablo
Hua
Pete
Collaborators
3. • V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, J. Kepner, Graphulo:
Linear Algebra Graph Kernels for NoSQL Databases, IPDPS 2015.
• J. Chen and W. Zhang, The right way to search evolving graphs, GABB 2016 (an
IPDPS workshop).
• A. Chen, A. Edelman, J. Kepner, V. Gadepally, and D. Hutchison, Julia
Implementation of the Dynamic Distributed Dimensional Data Model, IEEE HPEC
2016
• J. Chen, A. Noack, A. Edelman, Fast computation of the principal components of
genotype matrices in Julia, SIAM J. Sci. Comput., submitted.
• J. Chen and J. Revels, Robust benchmarking in noisy environments, IEEE HPEC
2016.
• J. Chen, The principal components of programming and why they matter for
statistical genomics, Proc. JSM 2016, submitted.
Publications (2015-6)
25 50 75 100
4. Genome-wide association studies (GWAS)
a greatly oversimplified description
The dream of personalized medicine/precision medicine:
tailor treatments of disease to individual sensitivities to treatment,
allergies, or other genetic predispositions
International Multiple Sclerosis Genetics Consortium (IMSGC),
Wellcome Trust Case Control Consortium 2,
Nature, 2011, doi:10.1038/nature10251
identifying
subpopulations
correlate disease
with mutations
insight into biochemistry
isolate risk factors
5. Genome-wide association studies (GWAS)
a greatly oversimplified description
The dream of personalized medicine/precision medicine:
tailor treatments of disease to individual sensitivities to treatment,
allergies, or other genetic predispositions
comorbidities
(“outcomes”)
Regress against single nucleotide polymorphs
(SNPs, or “gene data”)
~patients
0, 1, 2, or ?
counts how many
mutations from a
reference
6. Genome-wide association studies (GWAS)
a greatly oversimplified description
Sources of unwanted variation
- population stratification
Asians have black eyes,
Scandinavians have blond hair,
…
- kinship
blood relatives have highly
correlated genomes
- linkage disequilibrium
long range correlations
- …
- low rank structure with
large variance
- sparse, possibly full-rank
structure with small
variance
- removed by preprocessing
(we won’t consider it here)
In linear algebra terms:
7. The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Bottleneck
8. Many algorithms for PCA
https://github.com/gabraham/flashpca
One commonly used in genomics is
FlashPCA: a randomized SVD algorithm
(subspace iteration)
Research question: is this the best available
algorithm for PCA on genomics data?
9. Genotype matrix is low rank + random
�������� �����
� ������� �������
�����
�����
�����
�����
�������
�����
������� ������� ������� �������
��
�
��
�
��
�
�������������
� �� �� ��
�������
�������
Marchenko-Pastur law
(random matrix theory)
low rank large outliers +
Therefore, we expect iterative Lanczos-based methods to work well
10. Native Julia implementation enables
easy introspection into algorithm state
Chen, Noack and Edelman, 2016
���������
� �� ��� ��� ���
��
���
��
���
��
���
��
���
��
���
����������
���������
� �� ��� ��� ���
��
���
��
���
��
��
��
�
�����������������
Native plots to verify expected convergence of iterative SVD algorithm
11. Fine control of algorithm
allows for fast, accurate results
FlashPCA (published)
FlashPCA (master)
PROPACK
ARPACK
Julia GKL
0 750 1500 2250 3000
FlashPCA (published)
FlashPCA (master)
PROPACK
ARPACK
Julia GKL
0 4 8 12 16
Run time (s)
Digits of accuracy in 10th singular value
Chen, Noack and Edelman, 2016
12. The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Bottleneck
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Computing top 10 principal components
with Julia (mostly matvecs)
81 sec
13. The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Computation bottleneck
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Computing top 10 principal components
with Julia (mostly matvecs)
81 sec
Memory bottleneck
Can we skip this step
without making the
computation slower?
14. Custom matvecs
@inline function getindex(M::PLINK1Matrix, i, j)
offset = (i-1)*M.n+(j-1) #Assume row major
byteoffset = (offset >> 2) + 4
crumboffset = offset & 0b00000011
@inbounds rawbyte = M.data[byteoffset]
rawcrumb = (rawbyte >> 6-2crumboffset) & 0b11
ifelse(rawcrumb==0, rawcrumb, rawcrumb-0x01)
end
function A_mul_B!{T}(y::AbstractVector{T},
M::PLINK1Matrix, b::AbstractVector{T})
y[:] = zero(T)
@fastmath @inbounds for i=1:M.m
δy = zero(T)
@simd for j=1:4:M.n
x = M[i,j]; z = b[j]
x2 = M[i,j+1]; z2 = b[j+1]
x3 = M[i,j+2]; z3 = b[j+2]
x4 = M[i,j+3]; z4 = b[j+3]
δy += x*z + x2*z2 + x3*z3 + x4*z4
end
y[i] += δy
end
y
end
Compute matvecs for a matrix stored in
PLINK format.
SVD code runs with no modifications
Clean separation of numerical algorithm
from lower level I/O, memory management
code
8000x4000 matvec:
Matrix{Float64} (OpenBLAS): 115 ms
PLINK1Matrix: 115 ms
same speed, 32x less memory use
overnight on server run on laptop over lunch
15. Future directions
• More complex analytics: beyond linear regression models
• Linear mixed models, robust PCA, deep learning, etc.
• Each new algorithm has new implementation and optimization challenges
• Data imputation: explore different ways to fill in missing data fast while
retaining important structure of data
• Better out of core matrix operations:
• Distributed/piecewise matrix-vector product computation
• Future polystore access: reading binary blob data in old formats (e.g.
PLINK v1, gVCF v4.1) and new formats (PLINK v2, new gVCF built on
TileDB)
Julia 1.0 by ISTC BD Retreat 2017
16. • Formalization of Julia semantics w/ Lindsey Kuper (Intel), Jan Vitek, Ben Chung (NEU)
• Survey of how Julia language features are used by user code by Isaac Virshup
• New algorithms for time-dependent graph analytics w/ Weijian Zhang (Manchester)
• Unstructured graph analytics of the Panama Papers by Mark Wang
• Jplyr: high level semantics for data frames and database tables by David A. Gold
• Automated feature extraction for unlabeled ECG instrumentation data in the MIMIC II data
set w/ Pete Szolovits (MIT CSAIL)
• Testing and benchmarking a library of iterative numerical methods by Juan López
• Automated benchmarking and regression testing by Jarrett Revels
• Automatic differentiation for optimization problems by Jarrett Revels w/ Miles Lubin (MIT)
• Numerous improvements to the Julia language: multithreading support with Intel,
distributed linear algebra and Parallel PageRank benchmarking, bulk asynchronous task
scheduling, prototype native GPU NVPTX code generation by Andreas Noack, Valentin
Churavy, Tim Besard, and Xianyi Zhang
Other work since BD Retreat 2015