SlideShare a Scribd company logo
1 of 16
Download to read offline
Genomics data analysis in
Jiahao Chen
Andreas Noack
w/ Stavros Papadopoulos (Intel)
Nikos Patsopoulos (Broad & BWH)labs
Alan Andreas Xianyi JarrettDavid
(UNAM)
Jiahao
The Julia Labs at MIT
Vertically integrated PL theory, compilers, numerics, and data science
Stavros
Nikos
2016 UROPs: Jacob Higgins, Mark Wang, Yingbo Ma (Lexington H.S.)
2016 GSoC students: Joseph Obiajulu, Juan López (+10 others)
726 open source developers as of Mar ’16
Arch Lindsey Ehsan Tim
Eka
ValentinTim David
Jake
(now USAP)
Jan
Jeremy
Vijay
Isaac
Steven Simon John Yee Sian Joey Miles IainJuan Pablo
Hua
Pete
Collaborators
• V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, J. Kepner, Graphulo:
Linear Algebra Graph Kernels for NoSQL Databases, IPDPS 2015.
• J. Chen and W. Zhang, The right way to search evolving graphs, GABB 2016 (an
IPDPS workshop).
• A. Chen, A. Edelman, J. Kepner, V. Gadepally, and D. Hutchison, Julia
Implementation of the Dynamic Distributed Dimensional Data Model, IEEE HPEC
2016
• J. Chen, A. Noack, A. Edelman, Fast computation of the principal components of
genotype matrices in Julia, SIAM J. Sci. Comput., submitted.
• J. Chen and J. Revels, Robust benchmarking in noisy environments, IEEE HPEC
2016.
• J. Chen, The principal components of programming and why they matter for
statistical genomics, Proc. JSM 2016, submitted.
Publications (2015-6)
25 50 75 100
Genome-wide association studies (GWAS)
a greatly oversimplified description
The dream of personalized medicine/precision medicine:
tailor treatments of disease to individual sensitivities to treatment,
allergies, or other genetic predispositions
International Multiple Sclerosis Genetics Consortium (IMSGC),
Wellcome Trust Case Control Consortium 2,
Nature, 2011, doi:10.1038/nature10251
identifying
subpopulations
correlate disease
with mutations
insight into biochemistry
isolate risk factors
Genome-wide association studies (GWAS)
a greatly oversimplified description
The dream of personalized medicine/precision medicine:
tailor treatments of disease to individual sensitivities to treatment,
allergies, or other genetic predispositions
comorbidities
(“outcomes”)
Regress against single nucleotide polymorphs
(SNPs, or “gene data”)
~patients
0, 1, 2, or ?
counts how many
mutations from a
reference
Genome-wide association studies (GWAS)
a greatly oversimplified description
Sources of unwanted variation
- population stratification

Asians have black eyes,
Scandinavians have blond hair,
…
- kinship

blood relatives have highly
correlated genomes
- linkage disequilibrium

long range correlations
- …
- low rank structure with
large variance
- sparse, possibly full-rank
structure with small
variance
- removed by preprocessing
(we won’t consider it here)
In linear algebra terms:
The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Bottleneck
Many algorithms for PCA
https://github.com/gabraham/flashpca
One commonly used in genomics is
FlashPCA: a randomized SVD algorithm
(subspace iteration)
Research question: is this the best available
algorithm for PCA on genomics data?
Genotype matrix is low rank + random
�������� �����
� ������� �������
�����
�����
�����
�����
�������
�����
������� ������� ������� �������
��
�
��
�
��
�
�������������
� �� �� ��
�������
�������
Marchenko-Pastur law
(random matrix theory)
low rank large outliers +
Therefore, we expect iterative Lanczos-based methods to work well
Native Julia implementation enables
easy introspection into algorithm state
Chen, Noack and Edelman, 2016
���������
� �� ��� ��� ���
��
���
��
���
��
���
��
���
��
���
����������
���������
� �� ��� ��� ���
��
���
��
���
��
��
��
�
�����������������
Native plots to verify expected convergence of iterative SVD algorithm
Fine control of algorithm
allows for fast, accurate results
FlashPCA (published)
FlashPCA (master)
PROPACK
ARPACK
Julia GKL
0 750 1500 2250 3000
FlashPCA (published)
FlashPCA (master)
PROPACK
ARPACK
Julia GKL
0 4 8 12 16
Run time (s)
Digits of accuracy in 10th singular value
Chen, Noack and Edelman, 2016
The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Bottleneck
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Computing top 10 principal components
with Julia (mostly matvecs)
81 sec
The slowest part of the GWAS pipeline
Read data
from flat
files
Impute
missing
data
Find largest
principal
components
Regress against
comorbidities
and PCs
Form data
matrix
Computation bottleneck
Reading in 80k x 40k matrix
Raw I/O of 32 GB at 500 MB/sec
0.6 sec
PLINK format decoding 20 sec
Computing top 10 principal components
with FlashPCA (mostly matvecs)
2,933 sec
Computing top 10 principal components
with Julia (mostly matvecs)
81 sec
Memory bottleneck
Can we skip this step
without making the
computation slower?
Custom matvecs
@inline function getindex(M::PLINK1Matrix, i, j)
offset = (i-1)*M.n+(j-1) #Assume row major
byteoffset = (offset >> 2) + 4
crumboffset = offset & 0b00000011
@inbounds rawbyte = M.data[byteoffset]
rawcrumb = (rawbyte >> 6-2crumboffset) & 0b11
ifelse(rawcrumb==0, rawcrumb, rawcrumb-0x01)
end
function A_mul_B!{T}(y::AbstractVector{T},
M::PLINK1Matrix, b::AbstractVector{T})
y[:] = zero(T)
@fastmath @inbounds for i=1:M.m
δy = zero(T)
@simd for j=1:4:M.n
x = M[i,j]; z = b[j]
x2 = M[i,j+1]; z2 = b[j+1]
x3 = M[i,j+2]; z3 = b[j+2]
x4 = M[i,j+3]; z4 = b[j+3]
δy += x*z + x2*z2 + x3*z3 + x4*z4
end
y[i] += δy
end
y
end
Compute matvecs for a matrix stored in
PLINK format.
SVD code runs with no modifications
Clean separation of numerical algorithm
from lower level I/O, memory management
code
8000x4000 matvec:
Matrix{Float64} (OpenBLAS): 115 ms
PLINK1Matrix: 115 ms
same speed, 32x less memory use
overnight on server run on laptop over lunch
Future directions
• More complex analytics: beyond linear regression models
• Linear mixed models, robust PCA, deep learning, etc.
• Each new algorithm has new implementation and optimization challenges
• Data imputation: explore different ways to fill in missing data fast while
retaining important structure of data
• Better out of core matrix operations:
• Distributed/piecewise matrix-vector product computation
• Future polystore access: reading binary blob data in old formats (e.g.
PLINK v1, gVCF v4.1) and new formats (PLINK v2, new gVCF built on
TileDB)
Julia 1.0 by ISTC BD Retreat 2017
• Formalization of Julia semantics w/ Lindsey Kuper (Intel), Jan Vitek, Ben Chung (NEU)
• Survey of how Julia language features are used by user code by Isaac Virshup
• New algorithms for time-dependent graph analytics w/ Weijian Zhang (Manchester)
• Unstructured graph analytics of the Panama Papers by Mark Wang
• Jplyr: high level semantics for data frames and database tables by David A. Gold
• Automated feature extraction for unlabeled ECG instrumentation data in the MIMIC II data
set w/ Pete Szolovits (MIT CSAIL)
• Testing and benchmarking a library of iterative numerical methods by Juan López
• Automated benchmarking and regression testing by Jarrett Revels
• Automatic differentiation for optimization problems by Jarrett Revels w/ Miles Lubin (MIT)
• Numerous improvements to the Julia language: multithreading support with Intel,
distributed linear algebra and Parallel PageRank benchmarking, bulk asynchronous task
scheduling, prototype native GPU NVPTX code generation by Andreas Noack, Valentin
Churavy, Tim Besard, and Xianyi Zhang
Other work since BD Retreat 2015

More Related Content

What's hot

Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphFedorNikolaev
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
 
RuleML 2015: Ontology Reasoning using Rules in an eHealth Context
RuleML 2015: Ontology Reasoning using Rules in an eHealth ContextRuleML 2015: Ontology Reasoning using Rules in an eHealth Context
RuleML 2015: Ontology Reasoning using Rules in an eHealth ContextRuleML
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftSebastian Ruder
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graphDing Li
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
From Story-Telling to Production
From Story-Telling to ProductionFrom Story-Telling to Production
From Story-Telling to ProductionKwan-yuet Ho
 
Tensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningTensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningKwan-yuet Ho
 
Data Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesData Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesFerdin Joe John Joseph PhD
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...RuleML
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled GraphsMarko Rodriguez
 

What's hot (20)

Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
 
RuleML 2015: Ontology Reasoning using Rules in an eHealth Context
RuleML 2015: Ontology Reasoning using Rules in an eHealth ContextRuleML 2015: Ontology Reasoning using Rules in an eHealth Context
RuleML 2015: Ontology Reasoning using Rules in an eHealth Context
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
geekgap.io webinar #1
geekgap.io webinar #1geekgap.io webinar #1
geekgap.io webinar #1
 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
From Story-Telling to Production
From Story-Telling to ProductionFrom Story-Telling to Production
From Story-Telling to Production
 
Tensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningTensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine Learning
 
Data Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black TreesData Structures and Algorithm - Week 6 - Red Black Trees
Data Structures and Algorithm - Week 6 - Red Black Trees
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled Graphs
 

Viewers also liked

Julia, genomics data and their principal components
Julia, genomics data and their principal componentsJulia, genomics data and their principal components
Julia, genomics data and their principal componentsJiahao Chen
 
Programming languages: history, relativity and design
Programming languages: history, relativity and designProgramming languages: history, relativity and design
Programming languages: history, relativity and designJiahao Chen
 
Julia? why a new language, an an application to genomics data analysis
Julia? why a new language, an an application to genomics data analysisJulia? why a new language, an an application to genomics data analysis
Julia? why a new language, an an application to genomics data analysisJiahao Chen
 
Julia: Multimethods for abstraction and performance
Julia: Multimethods for abstraction and performanceJulia: Multimethods for abstraction and performance
Julia: Multimethods for abstraction and performanceJiahao Chen
 
Understanding ECG signals in the MIMIC II database
Understanding ECG signals in the MIMIC II databaseUnderstanding ECG signals in the MIMIC II database
Understanding ECG signals in the MIMIC II databaseJiahao Chen
 
Group meeting 3/11 - sticky electrons
Group meeting 3/11 - sticky electronsGroup meeting 3/11 - sticky electrons
Group meeting 3/11 - sticky electronsJiahao Chen
 
Resolving the dissociation catastrophe in fluctuating-charge models
Resolving the dissociation catastrophe in fluctuating-charge modelsResolving the dissociation catastrophe in fluctuating-charge models
Resolving the dissociation catastrophe in fluctuating-charge modelsJiahao Chen
 
Julia: compiler and community
Julia: compiler and communityJulia: compiler and community
Julia: compiler and communityJiahao Chen
 
Excitation Energy Transfer In Photosynthetic Membranes
Excitation Energy Transfer In Photosynthetic MembranesExcitation Energy Transfer In Photosynthetic Membranes
Excitation Energy Transfer In Photosynthetic MembranesJiahao Chen
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTJiahao Chen
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in JuliaJiahao Chen
 
Theory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsTheory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsJiahao Chen
 
Python as number crunching code glue
Python as number crunching code gluePython as number crunching code glue
Python as number crunching code glueJiahao Chen
 

Viewers also liked (13)

Julia, genomics data and their principal components
Julia, genomics data and their principal componentsJulia, genomics data and their principal components
Julia, genomics data and their principal components
 
Programming languages: history, relativity and design
Programming languages: history, relativity and designProgramming languages: history, relativity and design
Programming languages: history, relativity and design
 
Julia? why a new language, an an application to genomics data analysis
Julia? why a new language, an an application to genomics data analysisJulia? why a new language, an an application to genomics data analysis
Julia? why a new language, an an application to genomics data analysis
 
Julia: Multimethods for abstraction and performance
Julia: Multimethods for abstraction and performanceJulia: Multimethods for abstraction and performance
Julia: Multimethods for abstraction and performance
 
Understanding ECG signals in the MIMIC II database
Understanding ECG signals in the MIMIC II databaseUnderstanding ECG signals in the MIMIC II database
Understanding ECG signals in the MIMIC II database
 
Group meeting 3/11 - sticky electrons
Group meeting 3/11 - sticky electronsGroup meeting 3/11 - sticky electrons
Group meeting 3/11 - sticky electrons
 
Resolving the dissociation catastrophe in fluctuating-charge models
Resolving the dissociation catastrophe in fluctuating-charge modelsResolving the dissociation catastrophe in fluctuating-charge models
Resolving the dissociation catastrophe in fluctuating-charge models
 
Julia: compiler and community
Julia: compiler and communityJulia: compiler and community
Julia: compiler and community
 
Excitation Energy Transfer In Photosynthetic Membranes
Excitation Energy Transfer In Photosynthetic MembranesExcitation Energy Transfer In Photosynthetic Membranes
Excitation Energy Transfer In Photosynthetic Membranes
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
 
What's next in Julia
What's next in JuliaWhat's next in Julia
What's next in Julia
 
Theory and application of fluctuating-charge models
Theory and application of fluctuating-charge modelsTheory and application of fluctuating-charge models
Theory and application of fluctuating-charge models
 
Python as number crunching code glue
Python as number crunching code gluePython as number crunching code glue
Python as number crunching code glue
 

Similar to Genomics data analysis in Julia

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative ComputingShayshab Azad
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sHPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sinside-BigData.com
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
BigDataInMedicine.pptx
BigDataInMedicine.pptxBigDataInMedicine.pptx
BigDataInMedicine.pptxFrank Meissner
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...Peter Rose
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-MLJeremy Hadidjojo
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 

Similar to Genomics data analysis in Julia (20)

SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative Computing
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sHPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
BigDataInMedicine.pptx
BigDataInMedicine.pptxBigDataInMedicine.pptx
BigDataInMedicine.pptx
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Genomics data analysis in Julia

  • 1. Genomics data analysis in Jiahao Chen Andreas Noack w/ Stavros Papadopoulos (Intel) Nikos Patsopoulos (Broad & BWH)labs
  • 2. Alan Andreas Xianyi JarrettDavid (UNAM) Jiahao The Julia Labs at MIT Vertically integrated PL theory, compilers, numerics, and data science Stavros Nikos 2016 UROPs: Jacob Higgins, Mark Wang, Yingbo Ma (Lexington H.S.) 2016 GSoC students: Joseph Obiajulu, Juan López (+10 others) 726 open source developers as of Mar ’16 Arch Lindsey Ehsan Tim Eka ValentinTim David Jake (now USAP) Jan Jeremy Vijay Isaac Steven Simon John Yee Sian Joey Miles IainJuan Pablo Hua Pete Collaborators
  • 3. • V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, J. Kepner, Graphulo: Linear Algebra Graph Kernels for NoSQL Databases, IPDPS 2015. • J. Chen and W. Zhang, The right way to search evolving graphs, GABB 2016 (an IPDPS workshop). • A. Chen, A. Edelman, J. Kepner, V. Gadepally, and D. Hutchison, Julia Implementation of the Dynamic Distributed Dimensional Data Model, IEEE HPEC 2016 • J. Chen, A. Noack, A. Edelman, Fast computation of the principal components of genotype matrices in Julia, SIAM J. Sci. Comput., submitted. • J. Chen and J. Revels, Robust benchmarking in noisy environments, IEEE HPEC 2016. • J. Chen, The principal components of programming and why they matter for statistical genomics, Proc. JSM 2016, submitted. Publications (2015-6) 25 50 75 100
  • 4. Genome-wide association studies (GWAS) a greatly oversimplified description The dream of personalized medicine/precision medicine: tailor treatments of disease to individual sensitivities to treatment, allergies, or other genetic predispositions International Multiple Sclerosis Genetics Consortium (IMSGC), Wellcome Trust Case Control Consortium 2, Nature, 2011, doi:10.1038/nature10251 identifying subpopulations correlate disease with mutations insight into biochemistry isolate risk factors
  • 5. Genome-wide association studies (GWAS) a greatly oversimplified description The dream of personalized medicine/precision medicine: tailor treatments of disease to individual sensitivities to treatment, allergies, or other genetic predispositions comorbidities (“outcomes”) Regress against single nucleotide polymorphs (SNPs, or “gene data”) ~patients 0, 1, 2, or ? counts how many mutations from a reference
  • 6. Genome-wide association studies (GWAS) a greatly oversimplified description Sources of unwanted variation - population stratification
 Asians have black eyes, Scandinavians have blond hair, … - kinship
 blood relatives have highly correlated genomes - linkage disequilibrium
 long range correlations - … - low rank structure with large variance - sparse, possibly full-rank structure with small variance - removed by preprocessing (we won’t consider it here) In linear algebra terms:
  • 7. The slowest part of the GWAS pipeline Read data from flat files Impute missing data Find largest principal components Regress against comorbidities and PCs Form data matrix Reading in 80k x 40k matrix Raw I/O of 32 GB at 500 MB/sec 0.6 sec PLINK format decoding 20 sec Computing top 10 principal components with FlashPCA (mostly matvecs) 2,933 sec Bottleneck
  • 8. Many algorithms for PCA https://github.com/gabraham/flashpca One commonly used in genomics is FlashPCA: a randomized SVD algorithm (subspace iteration) Research question: is this the best available algorithm for PCA on genomics data?
  • 9. Genotype matrix is low rank + random �������� ����� � ������� ������� ����� ����� ����� ����� ������� ����� ������� ������� ������� ������� �� � �� � �� � ������������� � �� �� �� ������� ������� Marchenko-Pastur law (random matrix theory) low rank large outliers + Therefore, we expect iterative Lanczos-based methods to work well
  • 10. Native Julia implementation enables easy introspection into algorithm state Chen, Noack and Edelman, 2016 ��������� � �� ��� ��� ��� �� ��� �� ��� �� ��� �� ��� �� ��� ���������� ��������� � �� ��� ��� ��� �� ��� �� ��� �� �� �� � ����������������� Native plots to verify expected convergence of iterative SVD algorithm
  • 11. Fine control of algorithm allows for fast, accurate results FlashPCA (published) FlashPCA (master) PROPACK ARPACK Julia GKL 0 750 1500 2250 3000 FlashPCA (published) FlashPCA (master) PROPACK ARPACK Julia GKL 0 4 8 12 16 Run time (s) Digits of accuracy in 10th singular value Chen, Noack and Edelman, 2016
  • 12. The slowest part of the GWAS pipeline Read data from flat files Impute missing data Find largest principal components Regress against comorbidities and PCs Form data matrix Bottleneck Reading in 80k x 40k matrix Raw I/O of 32 GB at 500 MB/sec 0.6 sec PLINK format decoding 20 sec Computing top 10 principal components with FlashPCA (mostly matvecs) 2,933 sec Computing top 10 principal components with Julia (mostly matvecs) 81 sec
  • 13. The slowest part of the GWAS pipeline Read data from flat files Impute missing data Find largest principal components Regress against comorbidities and PCs Form data matrix Computation bottleneck Reading in 80k x 40k matrix Raw I/O of 32 GB at 500 MB/sec 0.6 sec PLINK format decoding 20 sec Computing top 10 principal components with FlashPCA (mostly matvecs) 2,933 sec Computing top 10 principal components with Julia (mostly matvecs) 81 sec Memory bottleneck Can we skip this step without making the computation slower?
  • 14. Custom matvecs @inline function getindex(M::PLINK1Matrix, i, j) offset = (i-1)*M.n+(j-1) #Assume row major byteoffset = (offset >> 2) + 4 crumboffset = offset & 0b00000011 @inbounds rawbyte = M.data[byteoffset] rawcrumb = (rawbyte >> 6-2crumboffset) & 0b11 ifelse(rawcrumb==0, rawcrumb, rawcrumb-0x01) end function A_mul_B!{T}(y::AbstractVector{T}, M::PLINK1Matrix, b::AbstractVector{T}) y[:] = zero(T) @fastmath @inbounds for i=1:M.m δy = zero(T) @simd for j=1:4:M.n x = M[i,j]; z = b[j] x2 = M[i,j+1]; z2 = b[j+1] x3 = M[i,j+2]; z3 = b[j+2] x4 = M[i,j+3]; z4 = b[j+3] δy += x*z + x2*z2 + x3*z3 + x4*z4 end y[i] += δy end y end Compute matvecs for a matrix stored in PLINK format. SVD code runs with no modifications Clean separation of numerical algorithm from lower level I/O, memory management code 8000x4000 matvec: Matrix{Float64} (OpenBLAS): 115 ms PLINK1Matrix: 115 ms same speed, 32x less memory use overnight on server run on laptop over lunch
  • 15. Future directions • More complex analytics: beyond linear regression models • Linear mixed models, robust PCA, deep learning, etc. • Each new algorithm has new implementation and optimization challenges • Data imputation: explore different ways to fill in missing data fast while retaining important structure of data • Better out of core matrix operations: • Distributed/piecewise matrix-vector product computation • Future polystore access: reading binary blob data in old formats (e.g. PLINK v1, gVCF v4.1) and new formats (PLINK v2, new gVCF built on TileDB) Julia 1.0 by ISTC BD Retreat 2017
  • 16. • Formalization of Julia semantics w/ Lindsey Kuper (Intel), Jan Vitek, Ben Chung (NEU) • Survey of how Julia language features are used by user code by Isaac Virshup • New algorithms for time-dependent graph analytics w/ Weijian Zhang (Manchester) • Unstructured graph analytics of the Panama Papers by Mark Wang • Jplyr: high level semantics for data frames and database tables by David A. Gold • Automated feature extraction for unlabeled ECG instrumentation data in the MIMIC II data set w/ Pete Szolovits (MIT CSAIL) • Testing and benchmarking a library of iterative numerical methods by Juan López • Automated benchmarking and regression testing by Jarrett Revels • Automatic differentiation for optimization problems by Jarrett Revels w/ Miles Lubin (MIT) • Numerous improvements to the Julia language: multithreading support with Intel, distributed linear algebra and Parallel PageRank benchmarking, bulk asynchronous task scheduling, prototype native GPU NVPTX code generation by Andreas Noack, Valentin Churavy, Tim Besard, and Xianyi Zhang Other work since BD Retreat 2015