1

ENTP: Encoder-only Next Token Prediction

Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent “cheating” by masking future tokens. We challenge this widely …

InstructBooth: Instruction-following Personalized Text-to-Image Generation

Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the …

Sparse-to-Dense LiDAR Point Generation by LiDAR-Camera Fusion for 3D Object Detection

Accurately detecting objects at long distances remains a critical challenge in 3D object detection when relying solely on LiDAR sensors due to the inherent limitations of data sparsity. To address this issue, we propose the LiDAR-Camera Augmentation …

H-Direct: Homeostasis-aware Direct Spike Encoding for Deep Spiking Neural Networks

Deep spiking neural networks (SNNs), gaining attention as the next generation of artificial neural networks, have been successfully applied to many applications thanks to the development of various algorithms, such as spike encoding. Spike encoding …

Just Add $100 More, Augmenting Pseudo-LiDAR Point Cloud for Resolving Class-imbalance Problem

Typical LiDAR-based 3D object detection models are trained with real-world data collection, which is often imbalanced over classes. To deal with it, augmentation techniques are commonly used, such as copying ground truth LiDAR points and pasting them …

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory …

Bridging the Domain Gap by Clustering-based Image-Text Graph Matching

Learning domain-invariant representations is important to train a model that can generalize well to unseen target task domains. Text descriptions inherently contain semantic structures of concepts and such auxiliary semantic cues can be used as …

Text-Driven Prototype Learning for Few-Shot Class-Incremental Learning

Few-shot class-incremental learning (FSCIL) aims to learn generalizable representations with large amounts of initial data and incrementally adapt to new classes with limited data (i.e., few-shot). Recently, prototype-based approaches have shown …

Who Should Have Been Focused: Transferring Attention-based Knowledge from Future Observations for Trajectory Prediction

Accurately predicting the trajectories of dynamic agents is crucial for the safe navigation of autonomous robotics. However, achieving precise predictions based solely on past and current observations is challenging due to the inherent uncertainty in …

Leveraging Inductive Bias in ViT for Medical Image Diagnosis

Recent advances in attention-based models have raised expectations for an automated diagnosis application in computer vision due to their high performance. However, attention-based models tend to lack some of the inherent assumptions for images, …