Professor Schommer studied Computer Science at the
University of Saarbrücken and at the German Research Centre for
Artificial Intelligence. In 2000, he received his PhD degree from the
Goethe-University of Frankfurt/Main.
From 1997 until 2003, Professor Schommer worked for IBM Research and
Development as an IT Architect in several Business Intelligence projects
worldwide. Since 2003, he is an Associate Professor at the University
of Luxembourg and an international expert for Data Science.
Q1. Is domain knowledge necessary for a data scientist?
Absolutely yes! And this is, for example, for situations where additional (potentially unknown) data needs to be added, where seasonal effects emerge, and where decisions have to be taken regarding the preparation of the data in general. Also, knowledge about cultural differences is strictly important, for example, concerning applied color schemes in visualisations.
Q2. What should every data scientist know about machine learning?
I, personally, see Machine Learning as an attractor of a data life
cycle and the heart of each data science process. All this data, whether
it is in a raw or in a prepared state, will finally end there some
time. From that point of view, I believe that each data scientist should
not only be a specialist in the own field but should have a level of
armamentarium that enables to put oneself in a Machine Learner’s place.
Moreover, I believe as well that each data scientists, who works with
the results of Machine Learning processes, should brace oneself for a
comprehension.
Q3. What are the most effective machine learning algorithms?
Simplicity, comprehensibleness, and powerfulness are – in my eyes – the most conclusive arguments for an effectiveness. For that reason, the apriori algorithm (Association Discovery), C4.5 (Decision Tree; Classification), and k-means (Clustering) are to me the most effective algorithms. This is also confirmed by a voting among experts, which was published during an ICDM 2006 panel session (organised by Xindong Wu and Vipin Kumar): here, C4.5, k-means and apriori were ranked under the Top4 (together with SVM, which appeared on the third position).
Q4. What is your experience with data blending? (*)
It is a very tedious, but responsible, process. Particularly, because this kind of data fusioning takes place before the analytics takes place, but after a construction of a (potentially) stable data architecture is made. Working with data blending, therefore, means to me to risk a re-building and a changing of a running data system on the one side, but to ensure a data quality with respect to a further processing on the other. To work on such a bridge does not seem to be an easy task.
Q5. Predictive Modeling: How can you perform accurate feature engineering/extraction?
A certain number of factors should be implemented: first, a
predictive modeling should not be performed alone but in a team, which
itself should be composed of experts of different fields (domain
experts, statisticians, data engineers, machine learning experts).
Second: it should be clear that predictive patterns do not necessarily
justify a causality. It is wise to critically check gained results and
to involve domain experts (see question 1). Third, a developped
predictive model is not necessarily
the best one. Instead, alternatives should be developed and tested under different conditions.
Q6. Can data ingestion be automated?
I do not think that an automatization will be complete, ranging from collecting until taking decisions, which data can/should be stored and which data can/should be removed. But I believe that – particularly in the age of big data/texts – that a symbiosis of a human (data) care, a high-performance computing, and the right use of AI-related inventions (e.g., robots, self-healing) may become highly effective.
Q7. How do you ensure data quality?
To keep a data quality is mostly an adaptive process, for example,
because provisions of national law may change or because the analytical
aims and purposes of the data owner may vary. Therefore, the ensuring of
a data quality should be performed regularly, it should be consistent
with the law (data privacy aspects and others), and should be commonly
performed by
a team of experts of different education levels (e.g., data engineers, lawyers, computer scientists, mathematicians).
Q8. When data mining is a suitable technique to use and when not?
The application of data mining in a serious way requires a sufficient
amount of data. To remind, the idea of exploring data exists since
hundreds, even thousands, of years! It is dangerous and wrong to reason
and to interpret a causality if a solid data quantity is missing. A
second aspect is that Data Mining does not follow a straightforward
standard procedure but still follows a data-driven principle: from that
perspective, each data mining-project is new in itself and requires an
individual treatment.
Also, it should also reminded that “if there is nothing in the data,
then nothing can be found”. Even this last point should be respected.
Apart from that, Data Mining is one of several directions in the Data
Life Cycle: analytical results should always be verified with other
methods.
Q9. How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?
In my understanding, an insight is already a valuable/evaluated
information, which has been received after a detailed interpretation and
which can be used for any kind of follow-up activities, for example to
relocate the merchandise or to deeper dig in clusters showing a
fraudulent behavior.
However, it is less oportune to rely only on statistical values: an
association rule, which shows a conditional probability of, e.g., 90% or
more, may be an “insight”, but if the right-hand side of the rule
refers to a plastic bag only (which is to be paid (3 cents), at least in
Luxembourg), the discovered pattern might be uninteresting.
Q10. What were your most successful data projects? And why?
The most succesful projects, where I have been involved in, are
certainly those, where an effect could be seen immediately after having
taken some kind of reaction. In this regard, I remember a business
project regarding the detection of fraud in telco data as well as a
diverse number of Market Basket Analysis-projects, where the customers’
behavior and profiling
patterns have been used to improve a customer satisfaction.
Q11. What are the typical mistakes done when analyzing data for a large scale data project? Can they be avoided in practice?
I believe that missing expertises, an inappropriate communication
among the team members, and the favorising of quick-and-dirty solutions
are serious problems. Personally, I am not a friend of sampling. The
reason is that interesting data patterns may disappear and that a subset
of the data does not necessarily reflect the overall data structure.
Also, statistical values should not have the final word (see question
10) and should not be the only reason for an insight. The analysis of
data
is also a multi-disciplinary and multi-cultural concern and should performed correspondingly.
Q12. What are the ethical issues that a data scientist must always consider?
Because of the masses of data sensors – that emerge day by day and
that affect a data loading and a further use of data likewise -, each
data scientist bears a kind of responsibility in terms of, e.g., data
correctness, data privacy, and
data availability. As a central ethical issue, this (new) burden of work
should be accepted; each data scientist should be aware of that. Data
Science should also be sensed as a chance to do something good and
meaningful.