VLLM DeepSeekR1 FP4/FP8: Async Scheduling Performance Drop

by Admin 59 views
vLLM DeepSeekR1 FP4/FP8: Async Scheduling Performance Drop

Hey guys, have you ever run into a puzzling performance snag that just makes you scratch your head? Well, we’ve got a big one on our hands involving vLLM, specifically with the DeepSeekR1 model when using FP4/FP8 quantization and async-scheduling. It turns out, enabling asynchronous scheduling, which is usually meant to boost efficiency, is actually causing a noticeable performance drop! This is a pretty big deal for anyone pushing the limits of large language model (LLM) inference on powerful hardware like the B200.

Unraveling the Mystery: DeepSeekR1 FP4/FP8 with vLLM Async Scheduling Performance Issues

When we talk about optimizing LLM inference, vLLM is often the first name that comes to mind. It’s an incredible library designed to maximize throughput and minimize latency, especially with its continuous batching and efficient KV cache management. But, like any complex system, sometimes things don't go exactly as planned. We've discovered a significant regression that impacts the DeepSeekR1-0528 model when running with FP4 and FP8 KV cache data types and speculative decoding (specifically using the MTP method with num_speculative_tokens 3), particularly when the --async-scheduling flag is activated. This isn't just a minor blip; we're talking about a tangible performance hit that can severely impact the efficiency of your deployment.

DeepSeekR1, a powerful language model, benefits immensely from quantization techniques like FP8 and FP4 for its KV cache. FP8 (8-bit floating point) and FP4 (4-bit floating point) are crucial for reducing memory footprint and increasing effective memory bandwidth, allowing us to serve larger models or more concurrent requests on the same hardware. On a beastly machine like the B200, every bit of optimization counts. The idea behind asynchronous scheduling in vLLM is typically to improve device utilization and overall throughput by decoupling the scheduler from the execution engine, allowing for more flexible resource management. However, in this specific scenario, enabling --async-scheduling doesn't give us the expected gains; instead, it leads to a reduction in output token throughput. This is counter-intuitive and points to a deeper interaction problem within the system, potentially related to how speculative decoding or the specific attention backend (FLASHINFER_MLA) interacts with the async scheduler under these low-precision KV cache settings. The regression, which seems to stem from a specific change in vllm-project/vllm/pull/24799, suggests that a recent update might have inadvertently introduced this bottleneck. Understanding and fixing this is crucial for anyone relying on DeepSeekR1's efficiency with vLLM.

Diving Deep into the Bug: What's Really Going On?

So, what exactly are we seeing here? The core of the problem is a pretty stark performance difference in output token throughput. When we enable --async-scheduling with the DeepSeekR1 model, using either FP8 or FP4 for the KV cache, the throughput takes a dive. Let’s look at the numbers, because they tell the real story. For FP8, with async-scheduling enabled, we're seeing an output token throughput of around 1994 tokens per second (tok/s). But here’s the kicker: when we disable async-scheduling, that number jumps significantly to 2421 tok/s. That’s a hefty drop of over 400 tokens per second! Now, switch over to FP4, aiming for even greater memory efficiency. With async-scheduling, the throughput is roughly 2375 tok/s. Turn it off, and boom, it shoots up to 2895 tok/s. Again, a substantial performance hit, this time almost 520 tok/s. This isn't just a rounding error, guys; these are major regressions that directly impact the cost and responsiveness of running these models in production.

These numbers highlight that something is amiss when the asynchronous scheduling mechanism interacts with the specific configurations required for DeepSeekR1, especially with low-precision KV caches and speculative decoding. The context here involves highly optimized settings like VLLM_USE_NCCL_SYMM_MEM=1, NCCL_NVLS_ENABLE=1, NCCL_CUMEM_ENABLE=1, VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1, and VLLM_ATTENTION_BACKEND=FLASHINFER_MLA. These are all flags designed to squeeze every last drop of performance out of powerful GPUs like the ones in a B200 machine, leveraging specialized hardware features and optimized kernels. The fact that a typically beneficial feature like async-scheduling is causing a negative impact suggests a complex interplay between these optimizations. It might be that the async scheduler isn't correctly accounting for the specific latency characteristics introduced by FP4/FP8 memory accesses, or perhaps there's an unforeseen synchronization overhead when combined with the speculative decoding (mtp) and specific attention fusion passes (enable_fi_allreduce_fusion, enable_attn_fusion). Whatever the root cause, the implication for real-world scenarios is clear: if you’re deploying DeepSeekR1 with these advanced optimizations, you must be aware of this potential performance pitfall and consider disabling async scheduling until a fix is in place. Otherwise, you’re leaving a lot of potential throughput on the table, which translates directly to higher operational costs or lower user experience.

The FP8 Scenario: A Closer Look at the Throughput Dip

Let's really zoom in on the FP8 performance numbers because they set the stage for this whole conundrum. We're talking about running the deepseek-ai/DeepSeek-R1-0528 model, a formidable LLM, on a B200 machine. The configuration is quite specific, including --kv-cache-dtype fp8 and the crucial environment variable VLLM_USE_FLASHINFER_MOE_FP8=1. This tells vLLM to use 8-bit floating-point precision for the key-value cache, which is fantastic for saving GPU memory and often boosting performance by allowing more sequences or larger context windows. The FLASHINFER_MLA attention backend, combined with VLLM_FLASHINFER_MOE_BACKEND=latency, indicates a highly optimized setup for Mixture-of-Experts (MoE) models, aiming for minimal latency. We're also using speculative decoding with method mtp and num_speculative_tokens 3, which should further accelerate generation by predicting future tokens.

In this highly tuned FP8 environment, the regression is undeniable. When --async-scheduling is enabled, our output token throughput clocks in at approximately 1994 tok/s. Now, if we run the exact same setup but without the --async-scheduling flag, that throughput soars to roughly 2421 tok/s. That's a performance difference of nearly 20% in favor of disabling a feature that's supposed to improve performance! This is a head-scratcher, folks. The compilation configurations, like compilation_config.pass_config.enable_fi_allreduce_fusion true and enable_attn_fusion true, are designed to further optimize the execution graph, and typically, async-scheduling would complement such optimizations by ensuring the GPU pipeline remains saturated. However, here we see the opposite effect. For DeepSeekR1, especially on modern architectures like the B200 that excel at these low-precision operations, achieving maximum throughput is paramount. A 20% drop isn't something you can just ignore; it has real implications for the number of users you can serve or the inference costs you incur. It suggests a potential contention or miscoordination between the asynchronous scheduler and the deeply optimized FP8 kernels or speculative decoding logic. This scenario underscores the importance of rigorous benchmarking and paying close attention to specific flag interactions, even with highly optimized open-source libraries like vLLM. It's a reminder that sometimes, less is more, especially when a feature inadvertently introduces overhead instead of reducing it.

Unpacking the FP4 Impact: Even Greater Efficiency, Same Async-Scheduling Pitfall

Now, let's talk about FP4. If FP8 is good for memory and speed, FP4 takes it to another level! Using FP4 (4-bit floating point) for the KV cache with models like nvidia/DeepSeek-R1-0528-FP4-v2 is pushing the boundaries of what’s possible in terms of memory efficiency. This allows for even larger batch sizes or context windows, and significantly reduces the memory footprint on your GPUs. The environment variable VLLM_USE_FLASHINFER_MOE_FP4=1 explicitly tells vLLM to leverage this ultra-low precision. In theory, this setup should give us blazing fast and super memory-efficient inference for the DeepSeekR1 model on our B200 machine, making the most out of every byte and every cycle.

However, the same async-scheduling pitfall that plagued FP8 rears its head, and arguably with an even more pronounced effect. When we enable --async-scheduling in this highly optimized FP4 environment, the output token throughput registers at roughly 2375 tok/s. But, if we disable that very same async-scheduling flag, the throughput surges to an impressive 2895 tok/s. That's an even larger absolute difference of around 520 tok/s, representing a substantial performance increase of over 21% by simply turning off async-scheduling! This is a massive improvement that you'd definitely want to capture. The ability to utilize FP4 for the KV cache is a huge win for deploying massive models, but if async-scheduling is actively hindering the benefits, it becomes a critical bottleneck. This phenomenon suggests that the root cause of the regression is likely not specific to the 8-bit or 4-bit precision itself, but rather a more fundamental interaction problem between the async scheduler and either the speculative decoding mechanism (mtp), the FLASHINFER_MLA backend, or the overall execution flow when dealing with these highly compressed KV cache formats. The challenge with FP4 often lies in managing the quantization and dequantization overheads, as well as potential numerical stability issues. While vLLM and FlashInfer generally handle this elegantly, the combination with asynchronous scheduling appears to introduce an unexpected serialization or synchronization overhead that negates the intended parallelism. For developers and researchers pushing for the ultimate in inference efficiency with models like DeepSeekR1-FP4-v2, understanding and mitigating this issue is absolutely essential to fully realize the power of 4-bit quantization.

Reproducing the Issue: Your Guide to Validation

Alright, if you want to see this issue firsthand and validate it in your own environment, here’s how you can reproduce the performance drop with DeepSeekR1 FP4/FP8 and async-scheduling on a B200 machine. It’s pretty straightforward, but you need to follow the steps precisely to ensure you’re hitting the exact conditions. First things first, you'll need a B200 machine and a suitable Python environment. Remember, this setup uses specific vLLM features and optimizations, so ensure your vLLM installation is up-to-date and compatible with these flags.

Step 1: Get the Benchmark Script

You'll need a specific benchmarking tool. Clone the bench_serving repository, which provides the benchmark_serving.py script we'll be using:

git clone https://github.com/kimbochen/bench_serving.git
cd bench_serving
pip install pandas datasets --break-system-packages
cd .. # Go back to your main directory where you'll run the vLLM server

Step 2: Set Up Environment Variables (Common for FP8 and FP4)

Before launching the vLLM server, you need to set some critical environment variables that enable advanced optimizations. These are essential for both FP8 and FP4 scenarios:

export VLLM_USE_NCCL_SYMM_MEM=1
export NCCL_NVLS_ENABLE=1
export NCCL_CUMEM_ENABLE=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1
export VLLM_ATTENTION_BACKEND=FLASHINFER_MLA
export VLLM_FLASHINFER_MOE_BACKEND=latency

Step 3: Run the FP8 Scenario

  • FP8-Specific Environment Variable:

    export VLLM_USE_FLASHINFER_MOE_FP8=1
    
  • Server-Side (vLLM API Server) - With Async Scheduling:

    python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 --port 8087 \
        --model deepseek-ai/DeepSeek-R1-0528 \
        --tokenizer deepseek-ai/DeepSeek-R1-0528 \
        --dtype auto --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 --pipeline-parallel-size 1 --data-parallel-size 1 \
        --swap-space 16 --max-num-seqs 1024 --trust-remote-code \
        --max-model-len 2176 --gpu-memory-utilization 0.9 \
        --max-num-batched-tokens 8192 --no-enable-prefix-caching \
        --async-scheduling \
        --compilation_config.pass_config.enable_fi_allreduce_fusion true \
        --compilation_config.pass_config.enable_attn_fusion true \
        --compilation_config.max_cudagraph_capture_size 2048 \
        --speculative_config.method mtp --speculative_config.num_speculative_tokens 3
    
  • Client-Side (Benchmark) - Run in a separate terminal:

    python3 bench_serving/benchmark_serving.py --backend vllm --host 0.0.0.0 --port 8087 \
        --model deepseek-ai/DeepSeek-R1-0528 --num-prompts 1280 --trust-remote-code \
        --ignore-eos --max-concurrency 256 --random-input-len 1024 --random-output-len 1024 \
        --random-range-ratio 1.0 --use-chat-template --dataset-name random --save-result \
        --result-filename benchmark_serving_results_fp8_async.json
    
  • Server-Side (vLLM API Server) - Without Async Scheduling (Repeat after killing the previous server): Remove the --async-scheduling flag from the server command and save the result to a different file:

    # ... (all other flags are the same, just remove --async-scheduling)
    python3 -m vllm.entrypoints.openai.api_server ... (flags excluding --async-scheduling) ...
    
  • Client-Side (Benchmark):

    python3 bench_serving/benchmark_serving.py ... --result-filename benchmark_serving_results_fp8_no_async.json
    

Compare the Output token throughput (tok/s) from both benchmark_serving_results_fp8_async.json and benchmark_serving_results_fp8_no_async.json. You should observe the performance drop.

Step 4: Run the FP4 Scenario

  • FP4-Specific Environment Variable:

    export VLLM_USE_FLASHINFER_MOE_FP4=1
    # Make sure to unset VLLM_USE_FLASHINFER_MOE_FP8 if it was set for FP8 testing
    unset VLLM_USE_FLASHINFER_MOE_FP8
    
  • Server-Side (vLLM API Server) - With Async Scheduling:

    python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 --port 8087 \
        --model nvidia/DeepSeek-R1-0528-FP4-v2 \
        --tokenizer nvidia/DeepSeek-R1-0528-FP4-v2 \
        --dtype auto --kv-cache-dtype fp8 \
        --tensor-parallel-size 8 --pipeline-parallel-size 1 --data-parallel-size 1 \
        --swap-space 16 --max-num-seqs 1024 --trust-remote-code \
        --max-model-len 2176 --gpu-memory-utilization 0.9 \
        --max-num-batched-tokens 8192 --no-enable-prefix-caching \
        --async-scheduling \
        --compilation_config.pass_config.enable_fi_allreduce_fusion true \
        --compilation_config.pass_config.enable_attn_fusion true \
        --compilation_config.max_cudagraph_capture_size 2048 \
        --speculative_config.method mtp --speculative_config.num_speculative_tokens 3
    
  • Client-Side (Benchmark) - Run in a separate terminal:

    python3 bench_serving/benchmark_serving.py --backend vllm --host 0.0.0.0 --port 8087 \
        --model nvidia/DeepSeek-R1-0528-FP4-v2 --num-prompts 1280 --trust-remote-code \
        --ignore-eos --max-concurrency 256 --random-input-len 1024 --random-output-len 1024 \
        --random-range-ratio 1.0 --use-chat-template --dataset-name random --save-result \
        --result-filename benchmark_serving_results_fp4_async.json
    
  • Server-Side (vLLM API Server) - Without Async Scheduling (Repeat after killing the previous server): Remove the --async-scheduling flag from the server command.

    # ... (all other flags are the same, just remove --async-scheduling)
    python3 -m vllm.entrypoints.openai.api_server ... (flags excluding --async-scheduling) ...
    
  • Client-Side (Benchmark):

    python3 bench_serving/benchmark_serving.py ... --result-filename benchmark_serving_results_fp4_no_async.json
    

Again, compare the Output token throughput (tok/s) from benchmark_serving_results_fp4_async.json and benchmark_serving_results_fp4_no_async.json. You should see a similar, if not more pronounced, performance drop. This detailed reproduction guide will help you confirm the issue and aid in any debugging efforts.

What's Next? Community, Solutions, and Workarounds

Alright, so we've identified and rigorously reproduced this performance regression in vLLM when using async-scheduling with DeepSeekR1 FP4/FP8 and speculative decoding. What’s the game plan now? The immediate and most straightforward workaround for anyone currently deploying or testing DeepSeekR1 with these advanced optimizations is simple: disable the --async-scheduling flag. As the benchmarks clearly show, running without this flag currently yields significantly higher output token throughput for both FP8 and FP4 configurations. While async-scheduling is designed to be a performance enhancer, in this specific interaction, it's acting as a bottleneck, so turning it off is your best bet for maximizing performance right now.

Looking forward, this is a critical issue for the vLLM project and the broader community leveraging these cutting-edge optimizations. The regression points to a complex interaction, possibly between the asynchronous scheduling logic, the FLASHINFER_MLA attention backend, the low-precision FP4/FP8 KV cache, and the speculative decoding (MTP) method. It's plausible that there's an unforeseen synchronization overhead, a scheduling inefficiency that doesn't properly account for the specific latencies or resource utilization patterns of these highly specialized kernels, or even a subtle race condition introduced by the async nature. This bug report is already on the radar for the vllm-project team, and active investigation will be needed to pinpoint the exact root cause and implement a fix. This is where the power of open-source really shines, guys! The more eyes we have on this, the quicker we can get to a stable and performant solution.

If you're a vLLM contributor, a developer with expertise in GPU scheduling or low-precision inference, or a researcher passionate about LLM optimization, your insights and contributions could be invaluable. Engaging with the vLLM GitHub repository, discussing this issue, and even proposing potential fixes would be incredibly helpful. For users who encounter similar unexpected performance drops with other models or configurations, remember to thoroughly benchmark your setup and, if warranted, file a detailed bug report. Providing clear reproduction steps and concrete performance numbers, just like in this case, is the fastest way to get these issues addressed. Ultimately, resolving this will ensure that vLLM continues to deliver on its promise of unparalleled LLM inference efficiency, allowing all of us to fully harness the power of models like DeepSeekR1, even with the most aggressive quantization techniques. Stay tuned for updates, and let’s work together to make vLLM even better!