1

An Embedding-Dynamic Approach to Self-supervised Learning

A number of recent self-supervised learning methods have shown impressive performance on image classification and other tasks. A somewhat bewildering variety of techniques have been used, not always with a clear understanding of the reasons for their …

Resolving Class Imbalance Problem for LiDAR-based Object Detector by Balanced Gradients and Contextual Ground Truth Sampling

An autonomous driving system requires a 3D object detector, which must perceive all present road agents reliably to navigate an environment safely. However, real world driving datasets often suffer from the problem of data imbalance, which causes …

ORA3D Overlap Region Aware Multi-view 3D Object Detection

Current multi-view 3D object detection methods often fail to detect objects in the overlap region properly, and the networks' understanding of the scene is often limited to that of a monocular detection network. Moreover, objects in the overlap …

Zero-shot Visual Commonsense Immorality Prediction

Artificial intelligence is currently powering diverse real-world applications. These applications have shown promising performance, but raise complicated ethical issues, i.e. how to embed ethics to make AI applications behave morally. One way toward …

Bridging the Domain Gap towards Generalization in Automatic Colorization

We propose a novel automatic colorization technique that learns domain-invariance across multiple source domains and is able to leverage such invariance to colorize grayscale images in unseen target domains. This would be particularly useful for …

Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We …

Sound-guided Semantic Video Generation

The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the …

Zero-shot Visual Commonsense Immorality Prediction (Abstracted Version)

Artificial intelligence is currently powering diverse realworld applications. These applications have shown promising performance, but raise complicated ethical issues, i.e. how to embed ethics to make AI applications behave morally. One way toward …

Sound-Guided Semantic Image Manipulation

The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the …

StopNet: Scalable Trajectory and Occupancy Prediction for Urban Autonomous Driving

We introduce a motion forecasting (behavior prediction) method that meets the latency requirements for autonomous driving in dense urban environments without sacrificing accuracy. A whole-scene sparse input representation allows StopNet to scale to …