Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
SES B4: Sound 2
Time:
Wednesday, 07/Sept/2022:
9:55am - 10:40am

Location: Room B


Room B is room S02 at the FME building (Faculty of Mathematics and Statistics). The address is: C. Pau Gargallo 14 08028 Barcelona https://goo.gl/maps/QDEwQGp995qWGftC9

Presentations

Who are you talking to? Considerations on Designing Gender Ambiguous Voice User Interfaces

Matheus Tymburiba Elian1, Soh Masuko2, Toshimasa Yamanaka1

1University of Tsukuba; 2Rakuten Institute of Technology

With the widespread availability of Voice Assistants in smart devices, the usage of Voice User Interfaces has highly increased in the recent years. Although the anthropomorphized Voice Assistants present in these systems can aid users in many tasks, it also activates harmful gender bias and stereotypes. Since the usage of Gender Ambiguous voice agents in these interfaces is considered as solution for mitigating these gender effects, this paper analyzed different studies in the field of Voice User Interface design, while proposing a theoretical framework to design gender ambiguous voice agents, considering the type of recording, method of sound manipulation, method of evaluation for gender identification and contextual characteristics.



Visualization of Affective Information in Music Using Chironomie

Kana Tatsumi, Shinji Sako

Graduate School of Engineering, Nagoya Institute of Technology, Japan

The purpose of this study is to visualize affective information that cannot be conveyed by symbolic notation alone, to enhance the musical experience of the hearing impaired. To represent the rhythm of music effectively and uniquely, we focused on Chironomie, which represents the structure of rhythm with emotional impression. In general, Chironomie is drawn by a curve corresponding to the score, which is determined by whether a short segment of the score represents one of two classes, Arsis or Thesis. First, we utilized the machine learning technique to classify Arsis and Thesis from the score as input. We conducted experiments to confirm the accuracy of the classification, and the usefulness of the estimated Chironomie in conveying the rhythm of music. In the latter experiment, four types of stimuli combining visual and sound information were used to confirm the effects of Chironome: score only, Chironomie only, score and Chironomie, and score and sound. Results showed that Chironomie has certain usefulness in conveying the rhythmic structure of a piece of music. This paper mainly focuses on evaluation experiments and discusses experimental and analytical methods under these experimental conditions.



Modelling Emotional Valence and Arousal of Non-Linguistic Utterances for Sound Design Support

Ahmed Khota, Eric Cooper, Yu Yan, Mate Kovacs

RItsumeikan University, Japan

Non-Linguistic Utterances (NLUs), produced for popular media, computers, robots, and public spaces, can quickly and wordlessly convey the emotional characteristics of a message. They have been studied in terms of their ability to convey affect in Robot communication. The objective of this research is to develop a model that correctly infers the emotional Valence and Arousal of an NLU. On a Likert scale, 17 subjects evaluated the relative Valence and Arousal of 560 sounds collected from popular movies, TV shows, and video games, including NLUs and other character utterances. Three audio feature sets were used to extract features including spectral energy, spectral spread, zero-crossing rate (ZCR), Mel Frequency Cepstral Coefficients (MFCCs), and audio chroma, as well as Pitch, Jitter, Formant, Shimmer, Loudness, and Harmonics-to-Noise Ratio, among others. After feature reduction by Factor Analysis, the best-performing models inferred average Valence with a Mean Absolute Error (MAE) of 0.107 and Arousal with MAE of 0.097 on audio samples removed from the training stages. This means that the model was able to predict the Valence and Arousal of a given NLU to less than the difference between successive rating points on the 7-point Likert scale (0.14). This inference system is applicable to the development of novel NLUs to augment robot-human communication or to the design of sounds for other systems, machines, and settings.