![]() Alcaraz, Benoît ![]() ![]() in Alcaraz, Benoît; Hosseini Kivanani, Nina; Najjar, Amro (Eds.) et al Advances in Information Retrieval (2023, March) Meetings are recurrent organizational tasks intended to drive progress in an interdisciplinary and collaborative manner. They are, however, prone to inefficiency due to factors such as differing knowledge ... [more ▼] Meetings are recurrent organizational tasks intended to drive progress in an interdisciplinary and collaborative manner. They are, however, prone to inefficiency due to factors such as differing knowledge among participants. The research goal of this paper is to design a recommendation-based meeting assistant that can improve the efficiency of meetings by helping to contextualize the information being discussed and reduce distractions for listeners. Following a Wizard-of-Oz setup, we gathered user feedback by thematically analyzing focus group discussions and identifying this kind of system’s key challenges and requirements. The findings point to shortcomings in contextualization and raise concerns about distracting listeners from the main content. Based on the findings, we have developed a set of design recommendations that address context, interactivity and personalization issues. These recommendations could be useful for developing a meeting assistant that is tailored to the needs of meeting participants, thereby helping to optimize the meeting experience. [less ▲] Detailed reference viewed: 62 (5 UL)![]() Gilles, Peter ![]() ![]() ![]() in IEEE, Spoken Language Technology (Ed.) Proceedings - 2022 IEEE Spoken Language Technology Workshop (SLT) (2023) We present a first system for automatic speech recognition (ASR) for the low-resource language Luxembourgish. By applying transfer-learning, we were able to fine-tune Meta’s wav2vec2-xls-r-300m checkpoint ... [more ▼] We present a first system for automatic speech recognition (ASR) for the low-resource language Luxembourgish. By applying transfer-learning, we were able to fine-tune Meta’s wav2vec2-xls-r-300m checkpoint with 35 hours of labeled Luxembourgish speech data. The best word error rate received lies at 14.47. [less ▲] Detailed reference viewed: 45 (6 UL)![]() Najjar, Amro ![]() ![]() ![]() in XAI: Using Smart Photobooth for Explaining History of Art (2022, December) The rise of Artificial Intelligence has led to advancements in daily life, including applications in industries, telemedicine, farming, and smart cities. It is necessary to have human-AI synergies to ... [more ▼] The rise of Artificial Intelligence has led to advancements in daily life, including applications in industries, telemedicine, farming, and smart cities. It is necessary to have human-AI synergies to guarantee user engagement and provide interactive expert knowledge, despite AI’s success in "less technical" fields. In this article, the possible synergies between humans and AI to explain the development of art history and artistic style transfer are discussed. This study is part of the "Smart Photobooth" project that is able to automatically transform a user’s picture into a well-known artistic style as an interactive approach to introduce the fundamentals of the history of art to the common people and provide them with a concise explanation of the various art painting styles. This study investigates human-AI synergies by combining the explanation produced by an explainable AI mechanism with a human expert’s insights to provide reasons for school students and a larger audience. [less ▲] Detailed reference viewed: 40 (3 UL)![]() ; ; Hosseini Kivanani, Nina ![]() in Proc. Interspeech 2022 (2022, September) Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering ... [more ▼] Motivational speaking usually conveys a highly emotional message and its purpose is to invite action. The goal of this paper is to investigate the prosodic realization of one particular type of cheering, namely inciting cheering for single addressees in sport events (here, long-distance running), using the name of that person. 31 native speakers of German took part in the experiment. They were asked to cheer up an individual marathon runner in a sporting event represented by video by producing his or her name (1-5 syllables long). For reasons of comparison, the participants also produced the same names in isolation and carrier sentences. Our results reveal that speakers use different strategies to meet their motivational communicative goals: while some speakers produced the runners’ names by dividing them into syllables, others pronounced the names as quickly as possible putting more emphasis on the first syllable. A few speakers followed a mixed strategy. Contrary to our expectations, it was not the intensity that mostly contributes to the differences between the different speaking styles (cheering vs. neutral), at least in the methods we were using. Rather, participants employed higher fundamental frequency and longer duration when cheering for marathon runners. [less ▲] Detailed reference viewed: 48 (7 UL)![]() Alcaraz, Benoît ![]() ![]() in IRRMA: An Image Recommender Robot Meeting Assistant (2022, July) The number of people who attend virtual meetings has increased as a result of COVID-19. In this paper, we present a system that consists of an expressive humanoid social robot called QTRobot, and a ... [more ▼] The number of people who attend virtual meetings has increased as a result of COVID-19. In this paper, we present a system that consists of an expressive humanoid social robot called QTRobot, and a recommender system that employs natural language processing techniques to recommend images related to the content of the presenter’s speech to the audience in real time. This is achieved utilising the QTRobot’s platform capabilities (microphone, computation power, and Wi-Fi). [less ▲] Detailed reference viewed: 75 (4 UL)![]() Hosseini Kivanani, Nina ![]() ![]() Poster (2022, July) Speakers’ voices are highly individual and for this reason speakers can be identified based on their voice. Nevertheless, voices are often more variable within the same speaker than they are between ... [more ▼] Speakers’ voices are highly individual and for this reason speakers can be identified based on their voice. Nevertheless, voices are often more variable within the same speaker than they are between speakers, which makes it difficult for humans and machines to differentiate between speakers (Hansen, J. H., & Hasan, T., 2015). To date, various machine learning methods have been developed to recognize speakers based on the acoustic characteristics of their speech; however, not all of them have proven equally effective in speaker identification, and depending on the obtained techniques, the system achieves a different result. Here, different machine learning classifiers have been applied to identify the best classification model (i.e., Naïve Bayes (NB), support vector machines (SVM), random forests (RF), & k-nearest neighbors (KNN)) for categorizing 4 speaking styles based on the segment types (voiceless fricatives) considering acoustic features of center of gravity, standard deviation, and skewness. We used a dataset consisting of speech samples from 7 native Persian subjects speaking in 4 different speaking styles: read, spontaneous, clear, and child-directed speech. The results revealed that the best performing model to predict the speakers based on the segment type was RF model with an accuracy of 81,3%, followed by SVM (76.3%), NB (75.4%), and KNN (74%) (Table 1). Our results showed that the RF performed the best for voiceless fricatives /f/, /s/, and / ʃ / which may indicate that these segments are much more speaker-specific than others (Gordon et al., 2002), and the model performance was low for the voiceless fricatives of /h/ and /x/. Performance can be seen in the confusion matrix (Figure 1), which produced high precision and recall values (above 80%) for /f/, /s/ and / ʃ / (Table 2). We found that the model performance improved when the data related to clear speaking style; the information in individual speakers (i.e., voiceless fricatives) are more distinguishable in clear style than other styles (Table 1). [less ▲] Detailed reference viewed: 62 (15 UL)![]() Hosseini Kivanani, Nina ![]() in Hosseini Kivanani, Nina; Gretter, Roberto; Matassoni, Marco (Eds.) et al BNAIC/BeneLearn 2021 (2021, November) Pronunciation is one of the fundamentals of language learning, and it is considered a primary factor of spoken language when it comes to an understanding and being understood by others. The persistent ... [more ▼] Pronunciation is one of the fundamentals of language learning, and it is considered a primary factor of spoken language when it comes to an understanding and being understood by others. The persistent presence of high error rates in speech recognition domains resulting from mispronunciations motivates us to find alternative techniques for handling mispronunciations. In this study, we develop a mispronunciation assessment system that checks the pronunciation of non-native English speakers, identifies the commonly mispronounced phonemes of Italian learners of English, and presents an evaluation of the non-native pronunciation observed in phonetically annotated speech corpora. In this work, to detect mispronunciations, we used a phone-based ASR implemented using Kaldi. We used two non-native English labeled corpora; (i) a corpus of Italian adults contains 5,867 utterances from 46 speakers, and (ii) a corpus of Italian children consists of 5,268 utterances from 78 children. Our results show that the selected error model can discriminate correct sounds from incorrect sounds in both native and non-native speech, and therefore can be used to detect pronunciation errors in nonnative speech. The phone error rates show improvement in using the error language model. Furthermore, the ASR system shows better accuracy after applying the error model on our selected corpora. [less ▲] Detailed reference viewed: 34 (2 UL) |
||