MAD: Makeup Removal Issues & Troubleshooting

by Admin 45 views
MAD Makeup Removal: A Deep Dive into Troubleshooting

Hey guys! 👋 We're diving deep into some tricky issues when using MAD for reference-free makeup removal. We've been putting the model through its paces, and while it's super promising, we've hit a few snags. This article is all about sharing those struggles and hopefully finding some solutions. Let's get into it!

1. The Mystery of the Missing Makeup Removal: MT Dataset Masks

Alright, so the first thing we ran into was the official example for reference-free makeup removal not quite delivering the goods. We followed the instructions, using the --source-label 0 --target-label 1 settings, which should, in theory, remove makeup. But the results? Pretty underwhelming. We’re talking almost zero visible change between the input and output images. Kinda frustrating, right?

We double-checked everything. We made sure the parsing masks were in the right place (data/mtdataset/parsing/makeup/), and the code was loading them without any errors. But, it's like the blending step, where the magic should happen, just wasn't working as intended.

Here’s the command we were using, just like the documentation suggests:

python generate_translation.py \
  --config configs/model_256_256.yaml \
  --save-folder removal_results \
  --source-root data/mtdataset/images \
  --source-list assets/mt_makeup.txt \
  --source-label 0 --target-label 1 \
  --num-process 1 \
  --opts MODEL.PRETRAINED Justin900/MAD

We spent a good chunk of time on this, and the results consistently showed minimal to no makeup removal. Even more confusing is that the processing time for each image was still around 9-10 minutes on an RTX 3080 Ti. So the code was running, just not… well, working as expected.

Key Takeaway:

  • The Problem: The provided MT dataset masks aren't effectively removing makeup in the reference-free setting.

2. Custom Masks to the Rescue (Sort Of): Segformer & Face Parsing

Now, here's where things got interesting. We decided to try generating our own masks using jonathandinu/face-parsing and the CONVERT_DICT from misc/convert_beauty_face.py. We're talking about segmenting the faces, identifying the makeup regions, and then trying the removal.

Guess what? With these custom masks, makeup removal worked like a charm on the MT dataset images. We were seeing clean, natural-looking results – exactly what we were hoping for!

Here’s a snippet of the code we used for generating these custom masks:

import argparse
import numpy as np
import torch
from PIL import Image
from torch import nn
from transformers import SegformerForSemanticSegmentation, SegformerImageProcessor

# Official remap for jonathandinu/face-parsing (Segformer) to MAD labels (from code CONVERT_DICT + analysis)
CONVERT_DICT = {
    1: 4,   # skin → face skin
    2: 8,   # nose → nose
    4: 6,   # l_eye → left eye
    5: 1,   # r_eye → right eye
    6: 7,   # l_brow → left eyebrow
    7: 2,   # r_brow → right eyebrow
    8: 3,   # l_ear → left ear
    9: 5,   # r_ear → right ear
    10: 11, # mouth → teeth
    11: 9,  # u_lip → upper lip
    12: 13, # l_lip → lower lip
    13: 12, # hair → hair
    15: 0,  # ear_r (earring) → bg
    16: 0,  # neck_l (necklace) → bg
    17: 10, # neck → neck
    18: 0,  # cloth → bg
    3: 0,   # eye_g (glasses) → bg
    14: 0,  # hat → bg
    0: 0,   # bg
}

def generate_mask(img_path: str, save_path: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    image_processor = SegformerImageProcessor.from_pretrained("jonathandinu/face-parsing")
    model = SegformerForSemanticSegmentation.from_pretrained("jonathandinu/face-parsing")
    model.to(device)
    image = Image.open(img_path).convert("RGB")
    inputs = image_processor(images=image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled_logits = nn.functional.interpolate(
        logits,
        size=image.size[::-1],
        mode="bilinear",
        align_corners=False,
    )
    labels = upsampled_logits.argmax(dim=1)[0]
    labels = labels.cpu().numpy()
    new_labels = np.copy(labels)
    for key, value in CONVERT_DICT.items():
        new_labels[labels == key] = value
    new_labels = new_labels.astype("uint8")
    new_labels = Image.fromarray(new_labels)
    new_labels.save(save_path)

generate_mask('/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/images/3147_aligned_256.jpg', '/home/muaz/Desktop/Upwork/makeup-removal/MAD/data/custom/parsing/3147_aligned_256.png')

This script takes an image, runs it through a Segformer model to create a face parsing mask, and then remaps the labels to match the expected format for MAD. The CONVERT_DICT is crucial here, as it translates the labels from the face parsing model to the labels MAD expects.

The Twist:

Here’s the kicker. While these custom masks worked beautifully on the MT dataset, they didn't perform well on real-world images. We saw incomplete removal and some pretty obvious artifacts. So, we're left with a bit of a puzzle.

Key Takeaway:

  • The Good: Custom masks can achieve excellent makeup removal on the MT dataset.
  • The Bad: The same custom masks don't work well on real-world images.

3. Seeking Answers: Questions for the MAD Experts

We're hoping to get some clarity from the folks behind MAD. Here are a few questions we're pondering:

  1. Preprocessing Power: What exactly is the secret sauce when it comes to preprocessing and alignment for real-world images? Are we talking about centering the face, aligning the eyes, or maybe even a specific zoom level? Understanding this is key to getting the model to work reliably.
  2. Mask Mysteries: Why aren't the official MT parsing masks working as expected? Is there something fundamentally different about the masks or the way they're being applied?
  3. Label Landscape: Are the MT parsing masks already in the right format for MAD, or do they need some kind of conversion, like the CONVERT_DICT we used? This could explain why our custom masks work on the MT dataset but not on real images, and why the originals seem ineffective.
  4. Mask Label Differences: The labels in the MT masks don't seem to match what we would expect (eyes at 1/6, lips at 9/13). Is this expected, or is this pointing to a deeper problem?
  5. Performance Ponderings: Is it normal for a 256x256 image to take over 9 minutes to process on an RTX 3080 Ti? We've provided some log outputs in the original post, including the encode and generate steps, but we're unsure if this is a typical processing time.

Log Snippets:

Here are some snippets from our output logs:

encode:  699/699 [04:50<...]  # The encoding phase
generate: 700/700 [04:52<...] # The generation phase

This shows the progress, but it takes a long time!

4. Environment Details

Here's a look at the environment we're working in:

python=3.10.19
torch=2.7.1+cu118
transformers=4.52.3
diffusers=0.33.1
accelerate=1.7.0
dlib=20.0.0
opencv-python=4.11.0.86
GPU: RTX 3080 Ti (laptop)

5. Wrapping Up & Seeking Solutions

We're super excited about the potential of MAD and want to use it in our projects. Right now, though, we’re stuck. We need some help figuring out how to get reliable makeup removal on real images. We're hoping that by sharing our experiences and questions, we can spark a conversation and find solutions.

Any insights into preprocessing, mask formats, or known issues would be a massive help. We'd also love to know if the long processing times are typical or if we're doing something wrong. Thank you all for the amazing work on this model!

Let's get those makeup-free selfies looking flawless! 😉