Project Overview
SentiSing is an interactive VR singing system that externalizes vocal and musical affect as real-time environmental feedback.
The system treats voice and music as continuous affective signals, mapping arousal and valence to dynamic changes in a virtual forest, including color, growth, and atmospheric motion.
This design forms a real-time multisensory affective loop, allowing singers to directly perceive and shape their emotional expression through interaction.
Background
Singing is a highly embodied activity in which vocal expression unfolds dynamically over time. Beyond pitch and loudness, the human voice carries rich affective variation that reflects expressive intensity and emotional nuance.
However, in many digital karaoke and music interaction systems, singing is primarily supported through auditory output or symbolic visual cues, such as pitch indicators and scoring, while the affective dimension is rarely considered and immersive emotional feedback remains limited.
Virtual reality offers a space to reconsider what singing interfaces might respond to. Affective vocal expression remains an underexplored interaction input in immersive systems. Singing provides a continuous stream of emotional modulation that users intuitively shape with music. Treating vocal affect as interaction material suggests a way for VR environments to respond not only to action, but to felt experience through multisensory feedback.
Affective interaction research commonly represents emotional dynamics using arousal–valence models. Within music interaction systems, these representations support visual feedback that responds fluidly to expressive variation. Color, for example, has been widely used to convey affective valence through warm–cool mappings, while real-time visualization can support performers’ awareness and modulation of vocal expression.
Approach
The prototype is implemented in Unity 6 and deployed on an Oculus Quest 3, with vocal input captured via a dynamic microphone and analyzed in real time. Before singing begins, the system performs an individual calibration phase in which users complete controlled vocal tasks, including sustained phonation and variations in pitch and intensity, to establish a personal vocal baseline. This baseline normalizes subsequent arousal–valence outputs, enabling affective variation to be interpreted relative to the user’s expression. Signal-quality confidence gating further stabilizes real-time interaction while preserving differences in vocal range and expressive style.
To drive the virtual environment, the system integrates two complementary affect streams within a shared arousal–valence space. Vocal affect is analyzed in real time using openSMILE (eGeMAPS), functioning as a direct control signal that reflects moment-to-moment expressive variation. Musical affect is pre-analyzed using Essentia to capture the song’s emotional trajectory, which shapes the overall atmospheric context.
Users are situated in a fantasy forest where singing transforms the environment through metaphorical representations of affect in color, growth, and atmospheric motion, making vocal expression and musical emotion perceptible in space.
Interaction is driven by singing. Pitch controls the butterfly’s vertical movement and its interaction with MIDI-generated note orbs, providing real-time feedback on vocal activity. Visual feedback evolves through the interplay of vocal and musical affect, situating individual expression within the song’s emotional landscape.
Affective features modulate multiple visual elements within the forest scene. Vocal valence influences the color of flower bloom effects that appear when the butterfly passes through note orbs, using warm and cool tones to convey affective polarity. Vocal arousal controls the density of magical vegetation emerging along the butterfly’s flight path, linking expressive intensity to visible growth activity within the environment.
Musical affect modulates surrounding fog and leaf coloration to establish an ambient emotional atmosphere that mirrors the song’s mood dynamically. The broader background sky and aurora colors are jointly controlled by vocal and musical affect, with greater weight assigned to vocal expression to emphasize real-time performance nuances. To highlight voice-music affective alignment, the system triggers subtle particle effects as non-scoring visual cues when vocal and musical affect coincide within predefined regions of the arousal–valence space. These cues invite singers to sense moments when their expression aligns with the song’s emotional flow, supporting a felt experience of resonance.



