Emmy Winning Sound Designer On Why AI Audio Needs Context To Meet Cinematic Demands

Charles Maynes - Emmy Award-Winning Sound Designer | Charles Maynes Sound Design

We’re essentially asking AI for an executive summary of what we’re trying to communicate, but that summary leaves out all the unused material that might still matter. That’s the core issue with AI in creative output.Charles Maynes - Emmy Award-Winning Sound Designer | Charles Maynes Sound Design

Silicon Valley builds AI audio tools for the masses, not the masters. With a 1,000-to-1 ratio between casual smartphone users and high-end audio professionals, it makes economic sense for developers to prioritize the former. That simple math helps explain why many commercial AI generators work well for quick social media clips but often fall short when dropped into a feature film or video game workflow requiring deep context and flexibility.

Charles Maynes is an Emmy Award-winning sound designer, composer, and re-recording mixer with 30 years of experience in feature film, television, and gaming audio. Serving as a historical consultant for 20th-century warfare and contributing to Academy Award-winning films such as U-571 and Letters From Iwo Jima, he's part of that cohort of high-end audio professionals and understands how current software intersects with, and falls short of, the mechanical realities of cinematic sound.

"We’re essentially asking AI for an executive summary of what we’re trying to communicate, but that summary leaves out all the unused material that might still matter. That’s the core issue with AI in creative output," says Maynes. Belonging to a highly specialized group numbering fewer than 3,000 people globally, he notes that recent technology changes and Hollywood strikes have cut available work by an estimated 65 percent. Meanwhile, many prominent AI audio tools target the much larger universe of casual users.

Blind to the battle: Training commercial models on broad, mass-market data helps explain why they can sometimes struggle with the narrative and cultural nuance needed in professional film sound. An AI model can easily generate the generic sound of crashing waves for a TikTok video. For that use case, a rough executive summary of the visual information usually suffices. Many professionals find the technology less effective, though, when asked to support sequences that depend on off-camera story and historical specificity. "If you're in a feature film and you have a long shot of the Dunkirk beachhead, the AI might look at it and generate something that would probably work as a background layer, but there's a battle happening. We don't see it happening inside the screen image, but we know this is the collapse of the American and French armies. You can script that into the AI prompting, but you might end up having to script a lot to get the objective you're looking for."
Paint by numbers: High-end work grounds that kind of context in real-world research and archives rather than purely synthetic sound. Maynes points to the BBC World War II “battle actuality” tapes famously utilized in Saving Private Ryan. He questions whether today’s commercial toolkits even have the licensing or database access to draw on such historically precise recordings, let alone the judgment to know when to evoke them in a soundtrack. "You wouldn't want to go to the National Gallery and see AI-generated paintings. You want to see real Monets. You want to go see the real artwork because the value in so much of it is somewhat due to its scarcity."

Generative AI compounds the contextual problem through the way it outputs files. In a professional DAW environment, sound designers routinely work without the composer’s final musical score. Audio elements are typically broken down and separated into independent tracks so a director can mute or manipulate specific groups on the final mixing stage once the music arrives. These separated tracks provide the margins required for a reactive mix. By generating a flattened, baked-in audio file from a prompt, many current AI tools collapse those margins and make it nearly impossible to adjust later.

Zipdo’s AI in the motion picture industry report notes that current deployments of AI tools are heavily concentrated in tedious, mechanical work in post-production, from noise reduction to clean-up or basic conforming, freeing humans to focus on higher-level decisions. But the highly modular layer of narrative mixing demands real-time human reaction to late-arriving music and editorial changes, something current AI tools struggle with.

Unbaking the cake: "You want this stuff to all be separated because you might find that certain elements are simply not needed because music is doing something that is covering that ground. Hopefully, we aren't looking at a future where somebody unattached to our discipline is able to essentially do an end around to get the outcome that we would normally provide."

If studios start treating flattened AI output as an acceptable standard, some fear that non-specialists could be tempted to bypass professional crews altogether. That tension is already playing out in the music industry, where a high volume of standardized audio is testing platform economics. A recent Guardian analysis of Deezer cited a report suggesting that up to 70 percent of streams of AI-generated music on the service could be fraudulent. Other investigations into bot farms and fake artists, along with “ghost artists” on streaming platforms, highlight how artificial plays can oversaturate recommendation feeds.

Fast food frequencies: As tech companies pitch new partnerships to reach mainstream audiences faster, Maynes points out the risk of identical, AI-assisted productions diluting distinctive human work. "Do we need 350 knockoffs of Nine Inch Nails? I guess those creative ideas essentially are in jeopardy of being diluted to such a degree that we end up with just kind of a McDonaldization of the output."
Sweating the details: For Maynes, the most promising vision for AI in sound editing is a high-level “sounding board” inside the workstation itself. In this model, the human still executes the core creative work, and the AI analyzes that work—along with the picture—to suggest ideas or reference points. "With the sound itself, it would entirely lack context. But you might include the visual in the AI process, and it might look at it and suggest that Randy Thom might create a sound for fog because you've got a sense of humidity in the air quality of the image. Ideally, You wouldn't want the AI to create something like that for you. But you want to just get the idea of, oh, maybe I should do something that's going to fulfill that observation."
Steal like an artist: Such a system could be prompted to compare the current track to the styles of other Oscar-winning sound designers, offering high-level observations based on those reference aesthetics in the same way that all artists pull from their influences to create something unique. "We're all processing through our influences. You listen to Nine Inch Nails, and you get to hear all sorts of other stuff that influenced Trent Reznor. And the Beatles or David Bowie, they were all influenced by other artists. So the magic comes from essentially the way that they interpreted those artists and created their own work, which really didn't sound like the artists that they were influenced by so much."

For Maynes, the defining question is how professionals will interact with their workstations as algorithms become more capable. If teams start leaning on automated systems to generate the final output, the most human parts of the process—the unexpected discoveries during a mix, the deep research behind a historically grounded scene—could be sidelined in favor of technically passable summaries that lose the soul of the creative work.

"It's a matter of coming to grips with how we can use it in a way that is going to be in the interest of creativity, as opposed to replacing our own creative input," Maynes concludes. "Because if we end up just being a bus driver, I don't know that that's where we want to go. We want to be able to determine where the bus is going as opposed to just being attached to the machine while it moves."

Key Points