1

Bridging the Domain Gap by Clustering-based Image-Text Graph Matching

Learning domain-invariant representations is important to train a model that can generalize well to unseen target task domains. Text descriptions inherently contain semantic structures of concepts and such auxiliary semantic cues can be used as …

Learning Temporal Cues by Predicting Objects Move for Multi-camera 3D Object Detection

In autonomous driving and robotics, there is a growing interest in utilizing short-term historical data to enhance multi camera 3D object detection, leveraging the continuous and correlated nature of input video streams. Recent work has focused on …

Robust Sound-guided Image Manipulation

Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input “raining”. These approaches often utilize a StyleCLIP-based …

EGTR: Extracting Graph from Transformer for Scene Graph Generation

Higher-order Relational Reasoning for Pedestrian Trajectory Prediction

Social relations have substantial impacts on the potential trajectories of each individual. Modeling these dynamics has been a central solution for more precise and accurate trajectory forecasting. However, previous works ignore the importance of …

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource …

CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-based 3D Object Detection

Recent LiDAR-based 3D Object Detection (3DOD) methods show promising results, but they often do not generalize well to target domains outside the source (or training) data distribution. To reduce such domain gaps and thus to make 3DOD models more …

InstructBooth: Instruction-following Personalized Text-to-Image Generation

Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the …

BEVMap: Map-Aware BEV Modeling for 3D Perception

In autonomous driving applications, there is a strong preference for modeling the world in Bird’s-Eye View (BEV), as it leads to improved accuracy and performance. BEV features are widely used in perception tasks since they allow fusing information …

Localization and Manipulation of Immoral Visual Cues for Safe Text-to-Image Generation

Current text-to-image generation methods produce high-resolution and high-quality images, but they should not produce immoral images that may contain inappropriate content from the perspective of commonsense morality. Conventional approaches, …