Visual Language Models Train Robots to Read Human Emotions
This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore. As robots advance i...
Source Evidence
Low Confidence Warning: This story lacks strong corroboration from primary or official sources. Treat details as developing or speculative.
What Changed
This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore. As robots advance i...
Why It Matters
**Why it matters:** By grounding their models in full‑scene vision rather than isolated facial cues, researchers show that VLMs can elevate human‑robot teamwork by delivering more socially congruent apologies—yet the study confirms that functional reliability still trumps emotional finesse in building trust. This underscores a strategic shift for robot vendors: invest in context‑aware VLMs for smoother collaboration, but prioritize task accuracy to secure market adoption.
Confirmed Facts
This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.
As robots advance in terms of dexterity and other physical capabilities, it becomes more likely that humans may find themselves working alongside them. If that happens, how will robots’ emotional capabilities need to advance for them to successfully work with people?
In a recent study, researchers trained collaborative robots to read human emotions by not only accounting for facial expressions, but also contextual factors in the interactions as well. Through experiments with 40 volunteers, the researchers then evaluated how a robot’s ability to read human emotions and adjust its behavior in turn impacted a human’s perception of the robot and its capabilities as the two collaborated on tasks. The results—which show that the emotional capabilities of robots only go so far with humans—were published 18 May in IEEE Robotics and Automation Letters.
Seung Chan Hong led the study as part of his undergraduate thesis while studying at the University of Monash, in Melbourne, Australia. He notes that, while there has been a lot of hype in the advancing physical abilities of robots, this is only one piece of the puzzle. “We need to also innovate when it comes to them actually interacting with humans, not just their physical capabilities,” he says.
This prompted him to dig deeper into the emotional aspects of human-robot interactions. First, Hong and his co-authors decided to train a robot to read human emotions using a vision language model (VLM), which is similar to large language models such as ChatGPT, but which can also take visual inputs.
Training VLMs for Human Emotion Recognition
To evaluate their VLM, which used Gemini 2.5, the researchers had volunteers watch videos of robots handing over objects to humans—with varying degrees of success—and describe the emotions the humans were expressing. Importantly, the volunteers labeling these videos were able to take into account more context in these interactions, rather than reporting solely on the facial expressions of the humans in the video. For example, a person pausing to think with a furrowed brow may simply be concentrating on their task at hand, and not necessarily be angry. Contextual factors such as drumming their fingers, pursing their lips, or other behaviors can point to the real cause of a person’s furrowed brow.
Who Is Affected
- OpenAI
- Google DeepMind
- AI infrastructure teams
- AI product teams
What To Watch Next
- Watch for availability, cloud support, benchmark claims, and production timelines.
- Watch whether additional sources confirm the same claim.
Still Developing
- Source confidence is below the high-confidence threshold.
You will be redirected to spectrum.ieee.org.