1

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such …

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Finetuning Pre-trained Model with Limited Data for LiDAR-based 3D Object Detection by Bridging Domain Gaps

LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., …

Enhanced Motion Forecasting with Visual Relation Reasoning

In this work, we emphasize and demonstrate the importance of visual relation learning for motion forecasting task in autonomous driving (AD). Since exploiting the benefits of RGB images in the existing vision-based joint perception and prediction …

LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System

Simultaneous Localization and Mapping (SLAM) has been crucial across various domains, including autonomous driving, mobile robotics, and mixed reality. Dense visual SLAM, leveraging RGB-D camera systems, offers advantages but faces challenges in …

MEVG Multi-event Video Generation with Text-to-Video Models

We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained …

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, …

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

In autonomous driving and robotics, there is a growing interest in utilizing short-term historical data to enhance multi camera 3D object detection, leveraging the continuous and correlated nature of input video streams. Recent work has focused on …

Robust Sound-guided Image Manipulation

Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input “raining”. These approaches often utilize a StyleCLIP-based …

EGTR: Extracting Graph from Transformer for Scene Graph Generation

Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex …