1

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural …

Distillation for High-Quality Knowledge Extraction via Explainable Oracle Approach

Recent successes suggest that knowledge distillation techniques can usefully transfer knowledge between deep neural networks as compression and acceleration techniques, e.g., effectively and reliably compress a large teacher model into a smaller …

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and …

Bridging the Domain Gap by Clustering-based Image-Text Graph Matching

Learning domain-invariant representations is important to train a model that can generalize well to unseen target task domains. Text descriptions inherently contain semantic structures of concepts and such auxiliary semantic cues can be used as …

Localization and Manipulation of Immoral Visual Cues for Safe Text-to-Image Generation

Current text-to-image generation methods produce high-resolution and high-quality images, but they should not produce immoral images that may contain inappropriate content from the perspective of commonsense morality. Conventional approaches, …

The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

RUFI: Reducing Uncertainty in Behavior Prediction with Future Information

Autonomous driving has shown significant progress in recent years, but accurately predicting the movements of surrounding traffic agents remains a challenge for ensuring safety. Previous studies have focused on behavior prediction using large-scale …

CloudNet: A LiDAR-Based Face Anti-Spoofing Model That Is Robust Against Light Variation

Face anti-spoofing (FAS) is a technology that protects face recognition systems from presentation attacks. The current challenge faced by FAS studies is the difficulty in creating a generalized light variation model. This is because face data are …

An Embedding-Dynamic Approach to Self-supervised Learning

A number of recent self-supervised learning methods have shown impressive performance on image classification and other tasks. A somewhat bewildering variety of techniques have been used, not always with a clear understanding of the reasons for their …

Resolving Class Imbalance Problem for LiDAR-based Object Detector by Balanced Gradients and Contextual Ground Truth Sampling

An autonomous driving system requires a 3D object detector, which must perceive all present road agents reliably to navigate an environment safely. However, real world driving datasets often suffer from the problem of data imbalance, which causes …