Building on the momentum of a landmark research year, our teams have made strides that not only propel the technology behind our next-generation generative AI studio but also enrich the broader scientific community through contributions to leading conferences and journals in signal processing, machine learning, and music information retrieval.
Together, these works showcase how our interdisciplinary approach—bridging generative modeling, signal processing, and large‑scale training—advances both practical applications and academic research. Below are some highlights from our 2025 publications.
Moises‑Light: Efficient Source Separation for Edge Devices
Y.‑N. Hung, I. Pereira, and F. Korzeniowski, “Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025.
Moises‑Light introduces a lightweight band‑split U‑Net architecture for music source separation. By incorporating band splitting, rotary position embedding (RoPE) transformer blocks, and an optimized encoder–decoder design, it achieves signal‑to‑distortion ratios comparable to much larger systems such as BS‑RoFormer while using up to 13× fewer parameters—enabling state‑of‑the‑art performance on embedded and real‑time setups.
Diff‑DMX: Diffusion‑based Singing Voice Separation
G. Plaja‑Roglans, Y.‑N. Hung, X. Serra, and I. Pereira, “Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures,” in Proc. WASPAA, 2025.
Diff‑DMX formulates vocal separation as a conditional generation problem. A diffusion model, guided by a dedicated conditioning module, directly synthesizes clean vocals from mixtures. The system attains separation quality on par with competitive non‑generative baselines, while adjustable sampling hyperparameters allow users to balance inference speed and output fidelity.
LDM‑DMX: Latent Diffusion for Fast Separation
G. Plaja‑Roglans, Y.‑N. Hung, X. Serra, and I. Pereira, “Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2025.
LDM‑DMX performs diffusion in the latent space of EnCodec, drastically reducing computational demands while retaining high separation quality. Trained solely on mixture–vocal pairs, the model produces clean spectral estimates with strong interference suppression and significantly faster inference, underscoring the practicality of diffusion‑based separation in production tools.
Consistency‑based Lyrics Transcription on Mixtures
J. Huang, F. Sousa, E. Demirel, E. Benetos, and I. Gadelha, “Enhancing Lyrics Transcription on Music Mixtures with Consistency Loss,” in Proc. INTERSPEECH, 2025.
This work fine‑tunes Whisper with LoRA for automatic lyrics transcription directly on music mixtures. A consistency loss aligns internal representations for isolated vocals and mixtures, improving robustness without a separate separation stage. On multilingual datasets, including new Italian and Portuguese test sets, the approach yields higher accuracy on mixtures than naïve fine‑tuning.
Simple and Effective Semantic Song Segmentation
F. Korzeniowski and R. Vogl, “Simple and Effective Semantic Song Segmentation,” in Proc. Int. Soc. Music Inf. Retrieval Conf. (ISMIR), 2025.
The paper proposes a compact convolutional–temporal model that jointly predicts structural boundaries and section labels (verse, chorus, etc.) from spectrograms and self‑similarity lag matrices. With carefully cleaned datasets and a new benchmark dataset, it surpasses more complex transformer and graph‑based methods while staying computationally efficient.
●●●Music AI researchers have also contributed to several collaborations that broaden the scope of music and audio intelligence—from new datasets and symbolic–audio alignment to diffusion‑based timbre transfer and efficient sound detection. These contributions reflect our commitment to advancing open research and supporting the wider scientific community.
J. Loth, P. Sarmento, S. Sarkar, Z. Guo, M. Barthet, and M. Sandler, “GOAT: A Large Dataset of Paired Guitar Audio Recordings and Tablatures,” in Proc. Int. Soc. Music Inf. Retrieval Conf. (ISMIR), 2025.
A large‑scale, high‑quality dataset pairing synchronized guitar audio and tablature annotations, designed to accelerate research on tablature transcription, generation, and performance modeling.
J. C. Martinez‑Sevilla, F. Foscarin, P. Garcia‑Iasci, D. Rizo, J. Calvo‑Zaragoza, and G. Widmer, “Optical Music Recognition of Jazz Lead Sheets,” in Proc. ISMIR, 2025.
This work presents an optical music recognition pipeline tailored to the complexity of handwritten jazz lead sheets, addressing chord symbol variability and notation ambiguities.
M. Mancusi, Y. Halychanskyi, K. W. Cheuk, E. Moliner, C.‑H. Lai, S. Uhlich, J. Koo, M. A. Martínez‑Ramírez, W.‑H. Liao, G. Fabbro, and Y. Mitsufuji, “Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2025.
Introduces a latent diffusion bridge approach for unsupervised timbre transfer, enabling smooth transitions between audio domains without parallel datasets.
T. Morocutti, F. Schmid, J. Greif, F. Foscarin, and G. Widmer, “Exploring Performance–Complexity Trade‑Offs in Sound Event Detection Models,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2025.
Analyzes the performance-complexity trade-off in sound event detection models, providing guidelines for building systems competitive with the large state-of-the-art models, at a fraction of the computational requirements.
E. Karystinaios, F. Foscarin, and G. Widmer, “EngravingGNN: A Hybrid Graph Neural Network for End-to-End Piano Score Engraving,” in Proc. Int. Conf. on Technologies for Music Notation and Representation (TENOR), 2025.
Focuses on automatic music engraving, i.e., the creation of a human-readable musical score from musical content, and proposes a unified graph neural network (GNN) framework for the case of piano music with quantized symbolic input.
J. Loth, P. Sarmento, M. Sandler, and M. Barthet, “GuitarFlow: Realistic Electric Guitar Synthesis from Tablatures via Flow Matching and Style Transfer,” in Proc. Comput. Music Multidisciplinary Res. Conf. (CMMR), 2025.
Proposes GuitarFlow, a neural audio synthesis system that converts symbolic tablatures into expressive electric‑guitar performances using flow matching and timbre style transfer techniques.
(Science aside, Moises is also known for providing the most popular swag at ISMIR.)
Collectively, these publications highlight Music AI’s growing impact across the spectrum of music‑intelligence research—from efficient audio separation and transcription to generative synthesis, symbolic representation, and dataset creation. By engaging in both fundamental and applied studies, our researchers contribute not only to advancing the capabilities of our in‑house generative music technologies but also to fostering open, collaborative progress within the global research community.
(Francesco Foscarin pushing the latest beat tracker into production. Seoul, Korea.)
Looking ahead to 2026, Music AI is poised to build on the strong scientific foundation established over the past year. With new projects already underway in multimodal audio–text modeling, interactive music generation, and music analysis, our teams aim to push the boundaries of how people create, understand, and experience music through AI. We anticipate deeper collaboration between research and production, expanding partnerships with the academic community, and continued contributions to open datasets and benchmarks. The next year promises not only technological advances but also new ways to connect artistic creativity with scientific innovation.

