Reference : A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery
Dissertations and theses : Doctoral thesis
Engineering, computing & technology : Computer science
Computational Sciences
A Combined Unsupervised Technique for Automatic Classification in Electronic Discovery
Ayetiran, Eniafe Festus mailto [University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC) >]
University of Luxembourg, ​Luxembourg, ​​Luxembourg
Docteur en Informatique
Boella, Guido mailto
Torre, Leon van der mailto
[en] eDiscovery ; unsupervised ; classification
[en] Electronic data discovery (EDD), e-discovery or eDiscovery is any process by which electronically stored information (ESI) is sought, identified, collected, preserved, secured, processed, searched for the ones relevant to civil and/or criminal litigations or regulatory matters with the intention of using them as evidence. Searching electronic document collections for relevant documents is part of eDiscovery which poses serious problems for lawyers and their clients alike. Getting efficient and effective techniques for search in eDiscovery is an interesting and still an open problem in the field of legal information systems. Researchers are shifting away from traditional keyword search to more intelligent approaches such as machine learning (ML) techniques. State-of-the-art algorithms for search in eDiscovery focus mainly on supervised approaches, mainly; supervised learning and interactive approaches. The former uses labelled examples for training systems while the latter uses human assistance in the search process to assist in retrieving relevant documents. Techniques in the latter approach include interactive query expansion among others. Both approaches are supervised form of technology assisted review (TAR). Technology assisted review is the use of technology to assist or completely automate the process of searching and retrieval of relevant documents from electronically stored information (ESI). In text retrieval/classification, supervised systems are known for their superior performance over unsupervised systems. However, two serious issues limit their application in the electronic discovery search and information retrieval (IR) in general. First, they have associated high cost in terms of finance and human effort. This is particularly responsible for the huge amount of money expended on eDiscovery on annual basis. Secondly, their case/project-specific nature does not allow for resuse, thereby contributing more to organizations' expenses when they have two or more cases involving eDiscovery.

Unsupervised systems on the other hand, is cost-effective in terms of finance and human effort. A major challenge in unsupervised ad hoc information retrieval is that of vocabulary problem which causes terms mismatch in queries and documents. While topic modelling techniques try to tackle this from the thematic point of view in the sense that both queries and documents are likely to match if they discuss about the same topic, natural language processing (NLP) approaches view it from the semantic perspective. Scalable topic modelling algorithms, just like the traditional bag of words technique, suffer from polysemy and synonymy problems. Natural language processing techniques on the other hand, while being able to considerably resolve the polysemy and synonymy problems are computationally expensive and not suitable for large collections as is the case in eDiscovery. In this thesis, we exploit the peculiarity of eDiscovery collections being composed mainly of e-mail communications and their attachments, mining topics of discourse from e-mails and disambiguating these topics and queries for terms matching has been proven to be effective for retrieving relevant documents when compared to traditional stem-based retrieval.

In this work, we present an automated unsupervised approach for retrieval/classification in eDiscovery. This approach is an ad hoc retrieval which creates a representative for each original document in the collection using latent dirichlet allocation (LDA) model with Gibbs sampling and explores word sense disambiguation (WSD) to give these representative documents and queries deeper meanings for distributional semantic similarity. The word sense disambiguation technique by itself is a hybrid algorithm derived from the modified version of the original Lesk algorithm and the Jiang & Conrath similarity measure.

Evaluation was carried out on this technique using the TREC legal track. Results and observations are discussed in chapter 8. We conclude that WSD can improve ad hoc retrieval effectiveness. Finally, we suggest further work focusing on efficient algorithms for word sense disambiguation which can further improve retrieval effectiveness if applied to original document collections in contrast to using representative collections.

There is no file associated with this reference.

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.