Research

(* equal contribution)

2026

Towards Scene-Aware Video-to-Spatial Audio Generation

Jaeyeon Kim*, Heeseung Yun*, Gunhee Kim
International Journal of Computer Vision (IJCV 2026)

Extending ViSAGe with 4x more efficient framework & 3x larger benchmark for video to ambisonics generation

#AudioVisual #360 #SpatialAudio

2025

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalization

Preprint; under review

Assessing low-level auditory perception and cognition capability of large audio-language models (In collaboration with NVIDIA)

#LargeMultimodalModel #AudioUnderstanding

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Advances in Neural Information Processing Systems (NeurIPS 2025 Spotlight)

Introducing SLAM-based visual attention lifting to anticipate where we look (before we leap) in the 3D world

#Egocentric #3D #Multisensory

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

Assessing lateral thinking and long-term memory of GUI agents with adventure games (In collaboration with Krafton)

#LargeMultimodalModel #EmbodiedAgent

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Proceedings of the Association for Computational Linguistics (ACL 2025)

Demonstrating vulnerabilities of any X-language models with LLMs combined with diversity-promoting self-training

#LargeMultimodalModel #KnowledgeDistillation

ReSpec: Relevance and Specificity Grounded Online Filter for Learning on Video-Text Data Streams

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

Training state-of-the-art video-language model with as little as 5\% of pretraining data (In collaboration with AI2 \& LG AI Research)

#LargeMultimodalModel #VideoUnderstanding

ViSAGe: Video to Spatial Audio Generation

Jaeyeon Kim, Heeseung Yun, Gunhee Kim
International Conference on Learning Representations (ICLR 2025)

Proposing the first end-to-end framework and benchmark for generating immersive ambisonics from silent video

#AudioVisual #360 #SpatialAudio

2024

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

European Conference on Computer Vision (ECCV 2024)

Leveraging egocentric self-motion to better stabilize and localize any user-centric multisensory signals (Work done during internship at Meta)

#Egocentric #AudioVisual #Multisensory #SpatialAudio #3D

2023

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Heeseung Yun*, Joonil Na*, Gunhee Kim
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023)

Predicting 2D depth/semantics and 3D indoor structure solely from binaural audio observations without seeing

#Egocentric #SpatialAudio #KnowledgeDistillation #3D #Multisensory

Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

Aligning text-only models to multimodal prompts without paired image/audio supervision (In collaboration with AI2)

#LargeMultimodalModel #VideoUnderstanding #AudioUnderstanding

2022

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Heeseung Yun, Sehun Lee, Gunhee Kim
European Conference on Computer Vision (ECCV 2022)

Enabling ViT variants to process omnidirectional imagery with minimal distortion via a single step geometric approximation

#360 #Multisensory #VideoUnderstanding

2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021)

Establishing the first large-scale benchmark \& framework for audio-visual QA with panoramic videos

#AudioVisual #360 #VideoUnderstanding

Transitional Adaptation of Pretrained Models for Visual Storytelling

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)

Proposing a transitional stage to better align pretrained vision encoders and language models prior to fine-tuning

#LargeMultimodalModel #VideoUnderstanding

2020

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

European Conference on Computer Vision (ECCV 2020)

Unifying movie character identity matching across video, text, and story context into a single end-to-end loop

#VideoUnderstanding

2019

A Social Robot Generating Video Summaries of Seniors' Indoor Activities

Chih-Yuan Yang, Heeseung Yun, Srenavis Varadaraj, Jane Yung-jen Hsu
ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI 2019)

Developing a social robot framework that actively tracks seniors to synthesize long-term indoor footage into glanceable summaries

#EmbodiedAgent #VideoUnderstanding