RETHINED: Real-Time High Resolution Image Inpainting On Edge Devices (WACV 2025 Oral)

Left: Inpainting result on ultra high-resolution images. Right: Comparison of LPIPS performance and Latency among different state-of-the-art methods..

Abstract

Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ($\leq$ 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being 100x than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.

Method

**Proposed Inpainting Pipeline.** Given a HR image y and a binary mask m with corrupted pixels as inputs (left), our model first downsamples x = y ⊙ m to a lower resolution x_LR, and forwards it to the coarse model f_θ obtaining x̂_coarse. It is then refined by the NeuralPatchMatch module obtaining x̂_LR and the attention map A. From A and x, our Attention Upscaling module yields x̂_HR.

Given a high-resolution RGB image y ∈ ℝ^{H_HR × W_HR × 3} (where H_HR and W_HR denote, respectively, the height and width of the high-resolution image in pixels) and a binary mask m ∈ ℝ^{H_HR × W_HR} containing the corrupted pixels, our goal is to fill-in with plausible information the masked image x = y ⊙ m.

To achieve this goal, we first downsample x to a lower resolution obtaining x_LR ∈ ℝ^{H × W × 3} (where H < H_HR and W < W_HR) and forward it to the coarse model, obtaining the coarse inpainted image x̂_coarse of size H × W. Then, we use the NeuralPatchMatch module to refine x̂_coarse by propagating known content from the input image x_LR, obtaining x̂_LR and the corresponding attention map A.

Finally our Attention Upscaling module uses the learned attention map A and x to resemble high texture details found on the base image, finally obtaining a high-resolution image x̂_HR.

Figure 3. **Proposed NeuralPatchMatch Inpainting Module.** (Corrupted patches are displayed as red while uncorrupted ones as green .) First, we project patch embedding to embedding space of dimension *d_k* (Sect. 3.2). Then token similarity is computed in a self-attention manner, obtaining attention map A (where lighter colors correspond to a large softmax value while darker colors correspond to a low value). The self-attention masking allows to inpaint only on corrupted regions, maintaining high-frequency details from uncorrupted zones. To obtain the final inpainted image, we mix the tokens via a weighted sum based on the attention map A.

Results

Comparison of different inpainting methods able to work on mobile devices. Latency speed appears in parentheses and has been calculated on Apple M2 Ipad Pro.

Poster