GDS With NVMe-oF RDMA On Nixl: P2PDMA=false?
Hey guys, I've been wrestling with getting GDS (GPU Direct Storage) to play nice with NVMe-oF RDMA (Remote Direct Memory Access) on a nixl system, and I wanted to share my findings and see if anyone else has run into this. Specifically, I'm trying to figure out the correct configuration, especially concerning the P2PDMA setting in the cufile.json file. Let's dive into it, shall we?
The Setup and the Problem
So, I've been validating GDS with NVMe-oF RDMA on my nixl setup. The goal is to get the GPU directly accessing storage over RDMA, which should give us a nice performance boost by bypassing the CPU for data transfers. Sounds great, right? Well, there's a little snag. When the use_pci_p2pdma parameter within my cufile.json is set to true, things just don't work as expected. The cufile_sample_001 program, which is a test utility for GDS, throws an Input/output error when it tries to write data from the device memory. Essentially, the write operation fails. Let's take a closer look at the error when P2PDMA is enabled.
$ ./cufile_sample_001 /mnt_nvmf/test/reg1G 0
opening file /mnt_nvmf/test/reg1G
registering device memory of size :131072
writing from device memory
write failed : Input/output error
deregistering device memory
As you can see, the sample program attempts to write to the NVMe-oF storage, but the write operation fails. This is a common issue when GDS isn't correctly configured. The error message is quite generic, which doesn't help much in pinpointing the root cause directly. However, based on my experimentation, this issue is linked to how GDS and NVMe-oF are interacting when P2PDMA is enabled. This failure suggests that there's a compatibility issue or a conflict in how the data transfers are being handled when the P2P DMA functionality is enabled alongside the NVMe-oF RDMA setup. It's likely that the underlying RDMA mechanisms or the way the GPU interacts with the network adapter is causing this I/O error. The exact nature of the problem could stem from various factors, such as improper memory mapping, incorrect RDMA configuration, or conflicts between GDS and the network interface card's settings.
The Solution: P2PDMA = false
Now, here's the interesting part. When I set the use_pci_p2pdma parameter to false, everything works perfectly fine. The cufile_sample_001 program successfully writes data, and the test completes without any errors. This is a clear indication that disabling P2PDMA is a crucial step for getting GDS to function correctly with NVMe-oF RDMA in my nixl environment. Let's look at the success log when P2PDMA is disabled:
$ ./cufile_sample_001 /mnt_nvmf/test/reg1G 0
opening file /mnt_nvmf/test/reg1G
registering device memory of size :131072
writing from device memory
written bytes :131072
deregistering device memory
As the output shows, the program successfully writes the data, which indicates that the GDS is functioning correctly with NVMe-oF RDMA when P2PDMA is disabled. This is a critical finding, as it signifies a clear dependency between the two technologies when they are used together. This success suggests that, with P2PDMA turned off, the communication protocols and memory management methods employed by GDS and NVMe-oF RDMA are compatible and can work seamlessly. The Input/output error we saw before is completely gone, which clearly demonstrates the importance of configuring use_pci_p2pdma setting for the setup. This could imply that there is some conflict or incompatibility between the P2P DMA mechanism and the RDMA functionality of NVMe-oF. When P2PDMA is disabled, the system may default to using alternative data transfer methods that are compatible with RDMA.
Deep Dive into cufile.json and Configuration
Let's break down the relevant parts of my cufile.json configuration file. This file tells GDS how to behave. It's the key to making everything work correctly. Here is my configuration file for cufile.json:
{
"execution" : {
"max_io_queue_depth": 128,
"max_io_threads" : 4,
"parallel_io" : true,
"min_io_threshold_size_kb" : 8192,
"max_request_parallelism" : 4
},
"properties": {
"max_direct_io_size_kb" : 16384,
"max_device_cache_size_kb" : 131072,
"per_buffer_cache_size_kb": 1024,
"max_device_pinned_mem_size_kb" : 33554432,
"use_poll_mode" : false,
"poll_mode_max_size_kb": 4,
"use_pci_p2pdma": true,
"allow_compat_mode": false,
"gds_rdma_write_support": true,
"io_batchsize": 128,
"io_priority": "default",
"rdma_dev_addr_list": ["192.168.150.109" ],
"rdma_load_balancing_policy": "RoundRobin",
"rdma_dynamic_routing": false,
"rdma_dynamic_routing_order": [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ]
},
}
The most important line here is "use_pci_p2pdma": true, which, as we've seen, is the culprit when set to true. Notice the other settings as well; they play a role in how GDS interacts with the system. For example, gds_rdma_write_support must be enabled (true) to allow GDS to work with RDMA-based storage. The rdma_dev_addr_list provides the IP addresses of the RDMA devices. The rdma_load_balancing_policy setting is how RDMA memory registration (MR) is distributed across network interface cards (NICs). And finally rdma_dynamic_routing controls how RDMA routes data.
Changing use_pci_p2pdma to false is the critical adjustment that enabled the correct operation of GDS with NVMe-oF RDMA. When the use_pci_p2pdma is set to false, it disables the direct peer-to-peer DMA over the PCI bus. Instead, it allows GDS to use alternative data transfer methods that are better suited for RDMA, which is a technology designed for high-speed data transfer between a server and storage devices. This is essential for optimal data transfer. This configuration change suggests that GDS may have issues with how it interacts with RDMA when P2PDMA is enabled, leading to an I/O error. The specific reason for the conflict is complex and could involve memory mapping, data alignment, or other low-level interactions.
The Question: Why?
So, the million-dollar question is: Why does P2PDMA need to be set to false? I'm still digging into the exact technical reasons behind this behavior. My initial thought is that the P2PDMA mechanism might conflict with the way NVMe-oF RDMA handles memory transfers, or it could be that the underlying drivers and libraries used by GDS aren't fully compatible with P2PDMA when also using NVMe-oF RDMA. Perhaps there's a memory mapping issue or some other low-level conflict that prevents data from being written correctly. It could also relate to the way the memory is being addressed or the way data is being transferred between the GPU and the storage device. My guess is there is a conflict in how the P2PDMA and the NVMe-oF RDMA try to handle the same memory regions. When P2PDMA is enabled, it may try to use direct peer-to-peer communication over the PCI bus, which can conflict with the RDMA capabilities of the storage system.
Potential Causes and Troubleshooting Tips
Let's brainstorm some potential causes and some things to check if you run into similar problems. Here are some thoughts:
- Driver Versions: Make sure you're using the latest compatible drivers for your GPUs, network cards, and storage devices. Driver incompatibilities can cause a wide array of issues.
- Firmware: Ensure that the firmware for your network cards and storage devices is up-to-date. Firmware updates often include performance improvements and bug fixes.
- Memory Alignment: Check if memory alignment is correct. Misaligned memory can lead to performance issues and errors.
- RDMA Configuration: Double-check your RDMA configuration, including settings like the IP addresses and the load balancing policy. Incorrect RDMA configuration can cause communication failures.
- GDS Version: Make sure you are using a compatible version of GDS with your NVMe-oF RDMA setup. Sometimes, new versions of GDS have different requirements.
- Kernel Modules: Ensure that the necessary kernel modules for NVMe-oF and RDMA are loaded correctly.
- Resource Conflicts: Check for any resource conflicts, especially in the PCI bus. Conflicts could interfere with data transfers.
- Debugging Tools: Use debugging tools like
ibstat(InfiniBand status) or tools specific to your network card to diagnose any potential issues.
Conclusion
In conclusion, if you're setting up GDS with NVMe-oF RDMA on a nixl system, remember to set use_pci_p2pdma to false in your cufile.json configuration file. This seems to be a critical step for getting everything to work correctly. I'm still trying to understand the exact reasons behind this, but for now, this setting is the key to unlocking the power of GDS and NVMe-oF RDMA. I hope this helps anyone else out there struggling with a similar setup. If you have any insights or have encountered similar issues, please share them! Let's help each other out and get the most out of our hardware!
I'll keep digging and update this post if I find more specific information. Happy computing, guys! If you have any ideas, let me know! It's always great to learn from each other!