![]() Glauner, Patrick ![]() ![]() in Proceedings of the 26th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2018) (2018) The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is ... [more ▼] The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems. [less ▲] Detailed reference viewed: 153 (14 UL)![]() Glauner, Patrick ![]() ![]() in Proceedings 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018) (2018) In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and ... [more ▼] In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for reducing multiple biases in high-dimensional data sets in order to train more reliable predictors. We apply our methodology to the detection of irregular power usage from real, noisy industrial data. In emerging markets, irregular power usage, and electricity theft in particular, may range up to 40% of the total electricity distributed. Biased data sets are of particular issue in this domain. We show that reducing these biases increases the accuracy of the trained predictors. Our models have the potential to generate significant economic value in a real world application, as they are being deployed in a commercial software for the detection of irregular power usage. [less ▲] Detailed reference viewed: 137 (8 UL)![]() Hommes, Stefan ![]() in Optimising Packet Forwarding in Multi-Tenant Networks using Rule Compilation (2017, November) Packet forwarding in Software-Defined Networks (SDN) relies on a centralised network controller which enforces network policies expressed as forwarding rules. Rules are deployed as sets of entries into ... [more ▼] Packet forwarding in Software-Defined Networks (SDN) relies on a centralised network controller which enforces network policies expressed as forwarding rules. Rules are deployed as sets of entries into network device tables. With heterogeneous devices, deployment is strongly bounded by the respective table constraints (size, lookup time, etc.) and forwarding pipelines. Hence, minimising the overall number of entries is paramount in reducing resource consumption and speeding up the search. Moreover, since multiple control plane applications can deploy own rules, conflicts may occur. To avoid those and ensure overall correctness, a rule validation mechanism is required. Here, we present a compilation mechanism for rules of diverging origins that minimises the number of entries. Since it exploits the semantics of rules and entries, our compiler fits a heterogeneous landscape of network devices. We evaluated compiler implementations on both software and hardware switches using a realistic testbed. Experimental results show a reduction in both produced table entries and forwarding delay. [less ▲] Detailed reference viewed: 130 (4 UL)![]() Glauner, Patrick ![]() in Proceedings of the 17th IEEE International Conference on Data Mining Workshops (ICDMW 2017) (2017) Power grids are critical infrastructure assets that face non-technical losses (NTL) such as electricity theft or faulty meters. NTL may range up to 40% of the total electricity distributed in emerging ... [more ▼] Power grids are critical infrastructure assets that face non-technical losses (NTL) such as electricity theft or faulty meters. NTL may range up to 40% of the total electricity distributed in emerging countries. Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers. Electricity providers are reluctant to move to large-scale deployments of automated systems that learn NTL profiles from data due to the latter's propensity to suggest a large number of unnecessary inspections. In this paper, we propose a novel system that combines automated statistical decision making with expert knowledge. First, we propose a machine learning framework that classifies customers into NTL or non-NTL using a variety of features derived from the customers' consumption data. The methodology used is specifically tailored to the level of noise in the data. Second, in order to allow human experts to feed their knowledge in the decision loop, we propose a method for visualizing prediction results at various granularity levels in a spatial hologram. Our approach allows domain experts to put the classification results into the context of the data and to incorporate their knowledge for making the final decisions of which customers to inspect. This work has resulted in appreciable results on a real-world data set of 3.6M customers. Our system is being deployed in a commercial NTL detection software. [less ▲] Detailed reference viewed: 201 (26 UL) |
||