Left: Inpainting result on ultra high-resolution images. Right: Comparison of LPIPS performance and Latency among different state-of-the-art methods..

Abstract

Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ($\leq$ 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being 100x than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.

Method

Proposed Inpainting Pipeline. Given a HR image y and a binary mask m with corrupted pixels as inputs (left), our model first downsamples x = ym to a lower resolution xLR, and forwards it to the coarse model fθ obtaining coarse. It is then refined by the NeuralPatchMatch module obtaining LR and the attention map A. From A and x, our Attention Upscaling module yields HR.

Given a high-resolution RGB image y ∈ ℝHHR × WHR × 3 (where HHR and WHR denote, respectively, the height and width of the high-resolution image in pixels) and a binary mask m ∈ ℝHHR × WHR containing the corrupted pixels, our goal is to fill-in with plausible information the masked image x = ym.

To achieve this goal, we first downsample x to a lower resolution obtaining xLR ∈ ℝH × W × 3 (where H < HHR and W < WHR) and forward it to the coarse model, obtaining the coarse inpainted image coarse of size H × W. Then, we use the NeuralPatchMatch module to refine coarse by propagating known content from the input image xLR, obtaining LR and the corresponding attention map A.

Finally our Attention Upscaling module uses the learned attention map A and x to resemble high texture details found on the base image, finally obtaining a high-resolution image HR.

Figure 3. Proposed NeuralPatchMatch Inpainting Module. (Corrupted patches are displayed as red while uncorrupted ones as green .) First, we project patch embedding to embedding space of dimension dk (Sect. 3.2). Then token similarity is computed in a self-attention manner, obtaining attention map A (where lighter colors correspond to a large softmax value while darker colors correspond to a low value). The self-attention masking allows to inpaint only on corrupted regions, maintaining high-frequency details from uncorrupted zones. To obtain the final inpainted image, we mix the tokens via a weighted sum based on the attention map A.

Results