A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

Prabal Gupta

A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
Image credit: Prabal Gupta

Abstract:

We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as “warm jazz cafe at midnight” and steers it through direct parameter adjustments – stepping brightness down, switching a rhythm style – each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends – embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model – all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound – reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.