A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery

AYETIRAN, Eniafe Festus

No full text

Doctoral thesis (Dissertations and theses)

A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery

AYETIRAN, Eniafe Festus

2017

Permalink
https://hdl.handle.net/10993/31289

Files (0)Send to Details Statistics Bibliography Similar publications

Files

Full Text

No document available.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

eDiscovery; unsupervised; classification

Abstract :

[en] Electronic data discovery (EDD), e-discovery or eDiscovery is any process by which electronically stored information (ESI) is sought, identified, collected, preserved, secured, processed, searched for the ones relevant to civil and/or criminal litigations or regulatory matters with the intention of using them as evidence. Searching electronic document collections for relevant documents is part of eDiscovery which poses serious problems for lawyers and their clients alike. Getting efficient and effective techniques for search in eDiscovery is an interesting and still an open problem in the field of legal information systems. Researchers are shifting away from traditional keyword search to more intelligent approaches such as machine learning (ML) techniques. State-of-the-art algorithms for search in eDiscovery focus mainly on supervised approaches, mainly; supervised learning and interactive approaches. The former uses labelled examples for training systems while the latter uses human assistance in the search process to assist in retrieving relevant documents. Techniques in the latter approach include interactive query expansion among others. Both approaches are supervised form of technology assisted review (TAR). Technology assisted review is the use of technology to assist or completely automate the process of searching and retrieval of relevant documents from electronically stored information (ESI). In text retrieval/classification, supervised systems are known for their superior performance over unsupervised systems. However, two serious issues limit their application in the electronic discovery search and information retrieval (IR) in general. First, they have associated high cost in terms of finance and human effort. This is particularly responsible for the huge amount of money expended on eDiscovery on annual basis. Secondly, their case/project-specific nature does not allow for resuse, thereby contributing more to organizations' expenses when they have two or more cases involving eDiscovery. Unsupervised systems on the other hand, is cost-effective in terms of finance and human effort. A major challenge in unsupervised ad hoc information retrieval is that of vocabulary problem which causes terms mismatch in queries and documents. While topic modelling techniques try to tackle this from the thematic point of view in the sense that both queries and documents are likely to match if they discuss about the same topic, natural language processing (NLP) approaches view it from the semantic perspective. Scalable topic modelling algorithms, just like the traditional bag of words technique, suffer from polysemy and synonymy problems. Natural language processing techniques on the other hand, while being able to considerably resolve the polysemy and synonymy problems are computationally expensive and not suitable for large collections as is the case in eDiscovery. In this thesis, we exploit the peculiarity of eDiscovery collections being composed mainly of e-mail communications and their attachments, mining topics of discourse from e-mails and disambiguating these topics and queries for terms matching has been proven to be effective for retrieving relevant documents when compared to traditional stem-based retrieval. In this work, we present an automated unsupervised approach for retrieval/classification in eDiscovery. This approach is an ad hoc retrieval which creates a representative for each original document in the collection using latent dirichlet allocation (LDA) model with Gibbs sampling and explores word sense disambiguation (WSD) to give these representative documents and queries deeper meanings for distributional semantic similarity. The word sense disambiguation technique by itself is a hybrid algorithm derived from the modified version of the original Lesk algorithm and the Jiang & Conrath similarity measure. Evaluation was carried out on this technique using the TREC legal track. Results and observations are discussed in chapter 8. We conclude that WSD can improve ad hoc retrieval effectiveness. Finally, we suggest further work focusing on efficient algorithms for word sense disambiguation which can further improve retrieval effectiveness if applied to original document collections in contrast to using representative collections.

Disciplines :

Computer science

Author, co-author :

AYETIRAN, Eniafe Festus ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)

Language :

English

Title :

A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery

Defense date :

2017

Number of pages :

121

Institution :

Unilu - University of Luxembourg, Luxembourg, Luxembourg

Degree :

Docteur en Informatique

Promotor :

Boella, Guido

Torre, Leon van der

Focus Area :

Computational Sciences

Available on ORBilu :

since 31 May 2017

Statistics

Number of views

234 (7 by Unilu)

Number of downloads

0 (0 by Unilu)

More statistics