Research - Heeseung Yun

2026

Towards Scene-Aware Video-to-Spatial Audio Generation

Jaeyeon Kim*, Heeseung Yun*, Gunhee Kim

International Journal of Computer Vision (IJCV 2026)

Extending ViSAGe with 4x more efficient framework & 3x larger benchmark for video to ambisonics generation

PDF HTML Code

#AudioVisual #360 #SpatialAudio

2025

WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalization

Jaeyeon Kim, Heeseung Yun, Sang Hoon Woo, Chao-Han Huck Yang, Gunhee Kim

Preprint; under review

Assessing low-level auditory perception and cognition capability of large audio-language models (In collaboration with NVIDIA)

PDF HTML

#LargeMultimodalModel #AudioUnderstanding

Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

Heeseung Yun, Joonil Na, Jaeyeon Kim, Calvin Murdock, Gunhee Kim

Advances in Neural Information Processing Systems (NeurIPS 2025 Spotlight)

Introducing SLAM-based visual attention lifting to anticipate where we look (before we leap) in the 3D world

PDF HTML Code

#Egocentric #3D #Multisensory

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn*, Junseo Kim*, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

Assessing lateral thinking and long-term memory of GUI agents with adventure games (In collaboration with Krafton)

PDF HTML Code

#LargeMultimodalModel #EmbodiedAgent

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Jaewoo Ahn*, Heeseung Yun*, Dayoon Ko, Gunhee Kim

Proceedings of the Association for Computational Linguistics (ACL 2025)

Demonstrating vulnerabilities of any X-language models with LLMs combined with diversity-promoting self-training

PDF HTML Code

#LargeMultimodalModel #KnowledgeDistillation

ReSpec: Relevance and Specificity Grounded Online Filter for Learning on Video-Text Data Streams

Chris Dongjoo Kim*, Jihwan Moon*, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

Training state-of-the-art video-language model with as little as 5\% of pretraining data (In collaboration with AI2 \& LG AI Research)

PDF Code

#LargeMultimodalModel #VideoUnderstanding

ViSAGe: Video to Spatial Audio Generation

Jaeyeon Kim, Heeseung Yun, Gunhee Kim

International Conference on Learning Representations (ICLR 2025)

Proposing the first end-to-end framework and benchmark for generating immersive ambisonics from silent video

PDF HTML Code

#AudioVisual #360 #SpatialAudio

2024

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock

European Conference on Computer Vision (ECCV 2024)

Leveraging egocentric self-motion to better stabilize and localize any user-centric multisensory signals (Work done during internship at Meta)

PDF HTML

#Egocentric #AudioVisual #Multisensory #SpatialAudio #3D

2023

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Heeseung Yun*, Joonil Na*, Gunhee Kim

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023)

Predicting 2D depth/semantics and 3D indoor structure solely from binaural audio observations without seeing

PDF HTML Code

#Egocentric #SpatialAudio #KnowledgeDistillation #3D #Multisensory

Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

Youngjae Yu*, Jiwan Chung*, Heeseung Yun, Jack Hessel, JaeSung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, Yejin Choi

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

Aligning text-only models to multimodal prompts without paired image/audio supervision (In collaboration with AI2)

PDF Code

#LargeMultimodalModel #VideoUnderstanding #AudioUnderstanding

2022

Panoramic Vision Transformer for Saliency Detection in 360° Videos

Heeseung Yun, Sehun Lee, Gunhee Kim

European Conference on Computer Vision (ECCV 2022)

Enabling ViT variants to process omnidirectional imagery with minimal distortion via a single step geometric approximation

PDF Code

#360 #Multisensory #VideoUnderstanding

2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2021)

Establishing the first large-scale benchmark \& framework for audio-visual QA with panoramic videos

PDF Code

#AudioVisual #360 #VideoUnderstanding

Transitional Adaptation of Pretrained Models for Visual Storytelling

Youngjae Yu*, Jiwan Chung*, Heeseung Yun, Jongseok Kim, Gunhee Kim

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021)

Proposing a transitional stage to better align pretrained vision encoders and language models prior to fine-tuning

PDF Code

#LargeMultimodalModel #VideoUnderstanding

2020

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Youngjae Yu, Jongseok Kim, Heeseung Yun, Jiwan Chung, Gunhee Kim

European Conference on Computer Vision (ECCV 2020)

Unifying movie character identity matching across video, text, and story context into a single end-to-end loop

PDF Code

#VideoUnderstanding

2019

A Social Robot Generating Video Summaries of Seniors' Indoor Activities

Chih-Yuan Yang, Heeseung Yun, Srenavis Varadaraj, Jane Yung-jen Hsu

ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI 2019)

Developing a social robot framework that actively tracks seniors to synthesize long-term indoor footage into glanceable summaries

PDF

#EmbodiedAgent #VideoUnderstanding