![]() Fernandes, Maria ![]() Doctoral thesis (2020) Privacy attacks reported in the literature alerted the research community for the existing serious privacy issues in current biomedical process workflows. Since sharing biomedical data is vital for the ... [more ▼] Privacy attacks reported in the literature alerted the research community for the existing serious privacy issues in current biomedical process workflows. Since sharing biomedical data is vital for the advancement of research and the improvement of medical healthcare, reconciling sharing with privacy assumes an overwhelming importance. In this thesis, we state the need for effective privacy-preserving measures for biomedical data processing, and study solutions for the problem in one of the harder contexts, genomics. The thesis focuses on the specific properties of the human genome that make critical parts of it privacy-sensitive and tries to prevent the leakage of such critical information throughout the several steps of the sequenced genomic data analysis and processing workflow. In order to achieve this goal, it introduces efficient and effective privacy-preserving mechanisms, namely at the level of reads filtering right upon sequencing, and alignment. Human individuals share the majority of their genome (99.5%), the remaining 0.5% being what distinguishes one individual from all others. However, that information is only revealed after two costly processing steps, alignment and variant calling, which today are typically run in clouds for performance efficiency, but with the corresponding privacy risks. Reaping the benefits of cloud processing, we set out to neutralize the privacy risks, by identifying the sensitive (i.e., discriminating) nucleotides in raw genomic data, and acting upon that. The first contribution is DNA-SeAl, a systematic classification of genomic data into different levels of sensitivity with regard to privacy, leveraging the output of a state-of-the-art automatic filter (SRF) isolating the critical sequences. The second contribution is a novel filtering approach, LRF, which undertakes the early protection of sensitive information in the raw reads right after sequencing, for sequences of arbitrary length (long reads), improving SRF, which only dealt with short reads. The last contribution proposed in this thesis is MaskAl, an SGX-based privacy-preserving alignment approach based on the filtering method developed. These contributions entailed several findings. The first finding of this thesis is the performance × privacy product improvement achieved by implementing multiple sensitivity levels. The proposed example of three sensitivity levels allows to show the benefits of mapping progressively sensitive levels to classes of alignment algorithms with progressively higher privacy protection (albeit at the cost of a performance tradeoff). In this thesis, we demonstrate the effectiveness of the proposed sensitivity levels classification, DNA-SeAl. Just by considering three levels of sensitivity and taking advantage of three existing classes of alignment algorithms, the performance of privacy-preserving alignment significantly improves when compared with state-of-the-art approaches. For reads of 100 nucleotides, 72% have low sensitivity, 23% have intermediate sensitivity, and the remaining 5% are highly sensitive. With this distribution, DNA-SeAl is 5.85× faster and it requires 5.85× less data transfers than the binary classification – two sensitivity levels. The second finding is the sensitive genomic information filtering improvement by replacing the per read classification with a per nucleotide classification. With this change, the filtering approach proposed in this thesis (LRF) allows the filtering of sequences of arbitrary length (long reads), instead of the classification limited to short reads provided by the state-of-the-art filtering approach (SRF). This thesis shows that around 10% of an individuals genome is classified as sensitive by the developed LRF approach. This improves the 60% achieved by the previous state of the art, the SRF approach. The third finding is the possibility of building a privacy-preserving alignment approach based on reads filtering. The sensitivity-adapted alignment relying on hybrid environments, in particular composed by common (e.g., public cloud) and trustworthy execution environments (e.g., SGX enclave cloud) in clouds, gets the best of both worlds: it enjoys the resource and performance optimization of cloud environments,while providing a high degree of protection to genomic data. We demonstrate that MaskAl is 87% faster than existing privacy-preserving alignment algorithms (Balaur), with similar privacy guarantees. On the other hand, Maskal is 58% slower compared to BWA, a highly efficient non-privacy preserving alignment algorithm. In addition, MaskAl requires less 95% of RAM memory and it requires between 5.7 GB and 15 GB less data transfers in comparison with Balaur. This thesis breaks new ground on the simultaneous achievement of two important goals of genomics data processing: availability of data for sharing; and privacy preservation. We hope to have shown that our work, being generalisable, gives a significant step in the direction of, and opens new avenues for, wider-scale, secure, and cooperative efforts and projects within the biomedical information processing life cycle. [less ▲] Detailed reference viewed: 277 (51 UL)![]() Fernandes, Maria ![]() ![]() ![]() Scientific Conference (2019, October 22) Detailed reference viewed: 94 (17 UL)![]() Fernandes, Maria ![]() ![]() ![]() in IEEE Journal of Biomedical and Health Informatics (2019) The advent of next-generation sequencing (NGS) machines made DNA sequencing cheaper, but also put pressure on the genomic life-cycle, which includes aligning millions of short DNA sequences, called reads ... [more ▼] The advent of next-generation sequencing (NGS) machines made DNA sequencing cheaper, but also put pressure on the genomic life-cycle, which includes aligning millions of short DNA sequences, called reads, to a reference genome. On the performance side, efficient algorithms have been developed, and parallelized on public clouds. On the privacy side, since genomic data are utterly sensitive, several cryptographic mechanisms have been proposed to align reads more securely than the former, but with a lower performance. This manuscript presents DNA-SeAl a novel contribution to improving the privacy × performance product in current genomic workflows. First, building on recent works that argue that genomic data needs to be treated according to a threat-risk analysis, we introduce a multi-level sensitivity classification of genomic variations designed to prevent the amplification of possible privacy attacks. We show that the usage of sensitivity levels reduces future re-identification risks, and that their partitioning helps prevent linkage attacks. Second, after extending this classification to reads, we show how to align and store reads using different security levels. To do so, DNA-SeAl extends a recent reads filter to classify unaligned reads into sensitivity levels, and adapts existing alignment algorithms to the reads sensitivity. We show that using DNA-SeAl allows high performance gains whilst enforcing high privacy levels in hybrid cloud environments. [less ▲] Detailed reference viewed: 208 (24 UL)![]() Lambert, Christoph ![]() ![]() ![]() Scientific Conference (2018) The recent introduction of new DNA sequencing techniques caused the amount of processed and stored biological data to skyrocket. In order to process these vast amounts of data, bio-centers have been ... [more ▼] The recent introduction of new DNA sequencing techniques caused the amount of processed and stored biological data to skyrocket. In order to process these vast amounts of data, bio-centers have been tempted to use low-cost public clouds. However, genomes are privacy sensitive, since they store personal information about their donors, such as their identity, disease risks, heredity and ethnic origin. The first critical DNA processing step that can be executed in a cloud, i.e., read alignment, consists in finding the location of the DNA sequences produced by a sequencing machine in the human genome. While recent developments aim at increasing performance, only few approaches address the need for fast and privacy preserving read alignment methods. This paper introduces MaskAl, a novel approach for read alignment. MaskAl combines a fast preprocessing step on raw genomic data — filtering and masking — with established algorithms to align sanitized reads, from which sensitive parts have been masked out, and refines the alignment score using the masked out information with Intel’s software guard extensions (SGX). MaskAl is a highly competitive privacy-preserving read alignment software that can be massively parallelized with public clouds and emerging enclave clouds. Finally, MaskAl is nearly as accurate as plain-text approaches (more than 96% of aligned reads with MaskAl compared to 98% with BWA) and can process alignment workloads 87% faster than current privacy-preserving approaches while using less memory and network bandwidth. [less ▲] Detailed reference viewed: 394 (42 UL)![]() Decouchant, Jérémie ![]() ![]() ![]() in Journal of Biomedical Informatics (2018) Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data ... [more ▼] Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption. [less ▲] Detailed reference viewed: 310 (53 UL)![]() Volp, Marcus ![]() ![]() ![]() Scientific Conference (2017, October) Recent breakthroughs in genomic sequencing led to an enormous increase of DNA sampling rates, which in turn favored the use of clouds to efficiently process huge amounts of genomic data. However, while ... [more ▼] Recent breakthroughs in genomic sequencing led to an enormous increase of DNA sampling rates, which in turn favored the use of clouds to efficiently process huge amounts of genomic data. However, while allowing possible achievements in personalized medicine and related areas, cloud-based processing of genomic information also entails significant privacy risks, asking for increased protection. In this paper, we focus on the first, but also most data-intensive, processing step of the genomics information processing pipeline: the alignment of raw genomic data samples (called reads) to a synthetic human reference genome. Even though privacy-preserving alignment solutions (e.g., based on homomorphic encryption) have been proposed, their slow performance encourages alternatives based on trusted execution environments, such as Intel SGX, to speed up secure alignment. Such alternatives have to deal with data structures whose size by far exceeds secure enclave memory, requiring the alignment code to reach out into untrusted memory. We highlight how sensitive genomic information can be leaked when those enclave-external alignment data structures are accessed, and suggest countermeasures to prevent privacy breaches. The overhead of these countermeasures indicate that the competitiveness of a privacy-preserving enclave-based alignment has yet to be precisely evaluated. [less ▲] Detailed reference viewed: 289 (26 UL)![]() ; ; et al in 11th International Conference on Practical Applications of Computational Biology & Bioinformatics 2017 (2017) People usually are aware of the privacy risks of publish-ing photos online, but these risks are less evident when sharing humangenomes. Modern photos and sequenced genomes are both digital rep ... [more ▼] People usually are aware of the privacy risks of publish-ing photos online, but these risks are less evident when sharing humangenomes. Modern photos and sequenced genomes are both digital rep-resentations of real lives. They contain private information that maycompromise people’s privacy, and still, their highest value is most oftimes achieved only when sharing them with others. In this work, wepresent an analogy between the privacy aspects of sharing photos andsharing genomes, which clarifies the privacy risks in the latter to thegeneral public. Additionally, we illustrate an alternative informed modelto share genomic data according to the privacy-sensitivity level of eachportion. This article is a call to arms for a collaborative work between ge-neticists and security experts to build more effective methods to system-atically protect privacy, whilst promoting the accessibility and sharingof genomes [less ▲] Detailed reference viewed: 184 (39 UL)![]() Fernandes, Maria ![]() ![]() in 11th International Conference on Practical Applications of Computational Biology & Bioinformatics 2017 (2017) Thanks to the rapid advances in sequencing technologies, genomic data is now being produced at an unprecedented rate. To adapt to this growth, several algorithms and paradigm shifts have been proposed to ... [more ▼] Thanks to the rapid advances in sequencing technologies, genomic data is now being produced at an unprecedented rate. To adapt to this growth, several algorithms and paradigm shifts have been proposed to increase the throughput of the classical DNA workflow, e.g. by relying on the cloud to perform CPU intensive operations. However, the scientific community raised an alarm due to the possible privacy-related attacks that can be executed on genomic data. In this paper we review the state of the art in cloud-based alignment algorithms that have been developed for performance. We then present several privacy-preserving mechanisms that have been, or could be, used to align reads at an incremental performance cost. We finally argue for the use of risk analysis throughout the DNA workflow, to strike a balance between performance and protection of data. [less ▲] Detailed reference viewed: 214 (40 UL) |
||