GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

[en] Background: The amount of data generated in large clinical and phenotyping studies that use single-cell cytometry is constantly growing. Recent technological advances allow the easy generation of data with hundreds of millions of single-cell data points with >40 parameters, originating from thousands of individual samples. The analysis of that amount of high-dimensional data becomes demanding in both hardware and software of high-performance computational resources. Current software tools often do not scale to the datasets of such size; users are thus forced to downsample the data to bearable sizes, in turn losing accuracy and ability to detect many underlying complex phenomena. Results: We present GigaSOM.jl, a fast and scalable implementation of clustering and dimensionality reduction for flow and mass cytometry data. The implementation of GigaSOM.jl in the high-level and high-performance programming language Julia makes it accessible to the scientific community and allows for efficient handling and processing of datasets with billions of data points using distributed computing infrastructures. We describe the design of GigaSOM.jl, measure its performance and horizontal scaling capability, and showcase the functionality on a large dataset from a recent study. Conclusions: GigaSOM.jl facilitates the use of commonly available high-performance computing resources to process the largest available datasets within minutes, while producing results of the same quality as the current state-of-art software. Measurements indicate that the performance scales to much larger datasets. The example use on the data from a massive mouse phenotyping effort confirms the applicability of GigaSOM.jl to huge-scale studies.

Research center :

ULHPC - University of Luxembourg: High Performance Computing
- Luxembourg Centre for Systems Biomedicine (LCSB): Bioinformatics Core (R. Schneider Group)

Disciplines :

Life sciences: Multidisciplinary, general & others
Engineering, computing & technology: Multidisciplinary, general & others

Author, co-author :

KRATOCHVIL, Miroslav ^✱; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core ; Institute of Organic Chemistry and Biochemistry of the CAS, Prague

Hunewald, Oliver ^✱; Luxembourg Institute of Health - LIH

HEIRENDT, Laurent ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

Verissimo, Vasco; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB)

Vondrášek, Jiří; Institute of Organic Chemistry and Biochemistry of the CAS, Prague

SATAGOPAM, Venkata ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

SCHNEIDER, Reinhard ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

TREFOIS, Christophe ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core

Ollert, Markus; Luxembourg Institute of Health - LIH

^✱ These authors have contributed equally to this work.

External co-authors :

yes

Language :

English

Title :

GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets

Publication date :

18 November 2020

Journal title :

GigaScience

ISSN :

2047-217X

Publisher :

Oxford University Press, United Kingdom

Volume :

Issue :

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Computational Sciences

Additional URL :

https://academic.oup.com/gigascience/article/9/11/giaa127/5987271

Funders :

ELIXIR CZ LM2018131 (MEYS)
FNR AFR-RIKEN bilateral program (TregBar 2015/11228353)
FNR PRIDE Doctoral Training Unit program (PRIDE/11012546/NEXTIMMUNE)
Institute of Organic Chemistry and Biochemistry of the CAS (RVO: 61388963)
ELIXIR Staff Exchange programme 2020

Data Set :

3i T-cell Spleen IMPC

Available on ORBilu :

since 18 February 2021

Statistics

Number of views

275 (28 by Unilu)

Number of downloads

186 (4 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™

Bibliography

Bandura DR, Baranov VI, Ornatsky OI, et al. Mass cytometry: Technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Anal Chem 2009;81(16):6813-22.
Jaitin DA, Kenigsberg E, Keren-Shaul H, et al. Massively parallel single-cell RNA-Seq for marker-free decomposition of tissues into cell types. Science 2014;343(6172):776-79.
Schmutz S, Valente M, Cumano A, et al. Spectral cytometry has unique properties allowing multicolor analysis of cell suspensions isolated from solid tissues. PLoS One 2016;11(8):e0159961.
Mair F, Hartmann FJ, Mrdjen D, et al. The end of gating An introduction to automated analysis of high dimensional cytometry data. Eur J Immunol 2016;46(1):34-43.
Arvaniti E, Claassen M. Sensitive detection of rare diseaseassociated cell subsets via representation learning. Nat Commun 2017;8(1):1-10.
Bruggner RV, Bodenmiller B, Dill DL, et al. Automated identification of stratifying signatures in cellular subpopulations. Proc Natl Acad Sci USA 2014;111(26):E2770-7.
Qiu P, Simonds EF, Bendall SC, et al. Extracting a Cellular Hierarchy fromHigh-dimensional Cytometry Data with SPADE. Nat Biotechnol 2011;29(10):886-91.
Lun ATL, Richard AC, Marioni JC. Testing for differential abundance in mass cytometry data. Nat Methods 2017;14(7):707-9.
van Gassen S, Callebaut B, HeldenMJV, et al. FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. Cytometry Part A 2015;87(7):636-45.
Kohonen T. Essentials of the self-organizing map. Neural Netw 2013;37:52-65.
Caruana R, Elhawary M, Nguyen N, et al. Meta Clustering. In: Sixth International Conference on Data Mining (ICDM'06); 2006:107-18.
Weber LM, Robinson MD. Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data. Cytometry Part A 2016;89(12):1084-96. https://onlineli brary.wiley.com/doi/abs/10.1002/cyto.a.23030.
Chen TJ, Kotecha N. Cytobank: Providing an analytics platform for community cytometry data analysis and collaboration, Fienberg HG, Nolan P. In: High-Dimensional Single Cell Analysis. Berlin, Heidelberg: Springer; 2014:127-57.
Bezanson J, Edelman A, Karpinski S, Shah VB, Julia: A fresh approach to numerical computing, SIAM review 2017;59(1):65-98, SIAM.
Kratochv'l M, Koladiya A, Vondr'asek J. Generalized Embed-SOM on quadtree-structured self-organizing maps. F1000Res 2019;8:2120.
Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics 1982;43(1):59-69. http://link.springer.com/10.1007/BF00337288.
Cheng Y. Convergence and Ordering of Kohonen's Batch Map. Neural Comput 1997;9(8):1667-76.
Sul SJ, Tovchigrechko A. Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library. In: 2011 IEEE International Symposium on Parallel and Distributed Process ing Workshops and Phd Forum Anchorage, AK, USA: IEEE; 2011:481-9. http://ieeexplore.ieee.org/document/6008868/.
Liu Y, Sun J, Yao Q, et al. A Scalable Heterogeneous Parallel SOM Based on MPI/CUDA. In: Asian Conference on Machine Learning; 2018. p. 264-279. http://proceedings.mlr.press/v95/liu18b.html.
Sarazin T, Azzag H, Lebbah M. SOM Clustering Using Spark-MapReduce. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops Phoenix, AZ, USA: IEEE; 2014. p. 1727-1734. http://ieeexplore.ieee.org/docu ment/6969583/.
Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Commun ACM 2008;51(1):107-13.
Collange S, Defour D, Graillat S, et al. Numerical reproducibility for the parallel reduction onmulti- A nd many-core architectures. Parallel Comput 2015;49:83-97.
Gropp W, Lusk E, Doss N, et al. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 1996;22(6):789-828.
Ihaka R, Gentleman R. R: A language for data analysis and graphics. J Comput Graph Stat 1996;5(3): 299-314.
Wegener D, Sengstag T, Sfakianakis S, et al. GridR: An R-based tool for scientific data analysis in grid environments. Future Generation Comput Syst 2009;25(4): 481-8.
Zaharia M, Xin RS, Wendell P, et al. Apache Spark: A unified engine for big data processing. Commun ACM 2016;59(11):56-65.
Rocklin M. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. Austin, Texas; 2015:126-32. https://conference.scipy.org/proceedings/scipy2015/mat thew rocklin.html.
Harris CR, Millman KJ, van der Walt SJ, et al. Array programming with NumPy. Nature 2020;585(7825):357-62.
Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM 1975;18(9): 509-17.
Omohundro SM. Five Balltree Construction Algorithms. Int Comput Sci Inst 1989; 22.
Maaten Lvd, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res 2008;9(Nov):2579-605.
McInnes L, Healy J, Saul N, Grossberger L, UMAP: Uniform Manifold Approximation and Projection, Journal of Open Source Software 2018;3(29):861.
Brown SDM, MooreMW. The International Mouse Phenotyping Consortium: Past and future perspectives onmouse phenotyping. Mammalian Genome 2012;23(9-10):632-40. http://link.springer.com/10.1007/s00335-012-9427-x.
Kratochv'l M, Hunewald O, Heirendt L, et al. Supporting data for "GigaSOM.jl: High-performance clustering and visualization of huge cytometry datasets". GigaScience Database 2020. http://dx.doi.org/10.5524/100810.
Varrette S, Bouvry P, Cartiaux H, et al. Management of an academic HPC cluster: The UL experience. In: 2014 International Conference on High Performance Computing and Simulation (HPCS) Bologna, Italy: IEEE; 2014. p. 959-967. http://ieeexplore.ieee.org/document/6903792/.