2026
Towards Scene-Aware Video-to-Spatial Audio Generation
Extending ViSAGe with 4x more efficient framework & 3x larger benchmark for video to ambisonics generation
2025
WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalization
Assessing low-level auditory perception and cognition capability of large audio-language models (In collaboration with NVIDIA)
Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span
Introducing SLAM-based visual attention lifting to anticipate where we look (before we leap) in the 3D world
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Assessing lateral thinking and long-term memory of GUI agents with adventure games (In collaboration with Krafton)
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Demonstrating vulnerabilities of any X-language models with LLMs combined with diversity-promoting self-training
ReSpec: Relevance and Specificity Grounded Online Filter for Learning on Video-Text Data Streams
Training state-of-the-art video-language model with as little as 5\% of pretraining data (In collaboration with AI2 \& LG AI Research)
ViSAGe: Video to Spatial Audio Generation
Proposing the first end-to-end framework and benchmark for generating immersive ambisonics from silent video
2024
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Leveraging egocentric self-motion to better stabilize and localize any user-centric multisensory signals (Work done during internship at Meta)
2023
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
Predicting 2D depth/semantics and 3D indoor structure solely from binaural audio observations without seeing
Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning
Aligning text-only models to multimodal prompts without paired image/audio supervision (In collaboration with AI2)
2022
Panoramic Vision Transformer for Saliency Detection in 360° Videos
Enabling ViT variants to process omnidirectional imagery with minimal distortion via a single step geometric approximation
2021
Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos
Establishing the first large-scale benchmark \& framework for audio-visual QA with panoramic videos
Transitional Adaptation of Pretrained Models for Visual Storytelling
Proposing a transitional stage to better align pretrained vision encoders and language models prior to fine-tuning
2020
Character Grounding and Re-Identification in Story of Videos and Text Descriptions
Unifying movie character identity matching across video, text, and story context into a single end-to-end loop
2019
A Social Robot Generating Video Summaries of Seniors' Indoor Activities
Developing a social robot framework that actively tracks seniors to synthesize long-term indoor footage into glanceable summaries