[Discussion] Bandwidth speeds of Hetzner's multi-SSD VPS, serving as a cautionary note.

admin · 发表于 2024-7-29 14:34:42

The text you provided has been translated into English as follows:
Before we start, let me first share my experience.
I have a 5950x with five 3.84T NVMe disks from HZ Germany. The model number of these is MZQL23T8HCLS, which are Samsung PM9A3 series SSDs. I am a PT player and I'm not very good at things like this, so I just blindly put the SSD in and RAID0 it after installing Windows without any extra effort.
After filling up almost 10TB of data on it within a few days, I noticed that the system would load up more than usual when doing some work, especially if I had long-running processes running. The IO wait time was also quite high, but there wasn't much CPU usage for a process named "xxxxxx".
I checked my QBS client and found that there were no significant transfer tasks going on. I could definitely tell that it wasn't because of the load pressure on the client causing CPU or IO bottleneck. This issue would come and go over and over again.
Then I thought about what other people might say, that SSDs can become slower over time due to excessive use or trimming, but only writing zero operations will solve this problem permanently. If it still doesn't recover, then this SSD should be discarded. So I went to the rescue and wrote all the drives to zero using mdadm:
```bash
sudo mdadm --stop /dev/md/*
sudo wipefs -fa /dev/nvme*
sudo blkdiscard -z -f -v /dev/nvme0n1
sudo blkdiscard -z -f -v /dev/nvme1n1
```
After writing zero, I continued with soft raid0 installation and the same process. However, the issue persisted.
Now I started suspecting that maybe I couldn't continue using RAID. So I came back to the rescue and tried installing the OS on the first drive, nvme0n1, instead of using RAID. After installing the OS, I used the script to output the Uptime every minute.
This problem disappeared, and it's been gone for nearly three months now. LOAD showed either 1 minute, 5 minutes, or 15 minutes never exceeding 1 during any time period.

Here, I was able to confirm that the previous issue was related to RAID problems. But the cause still needed further investigation.
One day, I remembered that a friend mentioned before that an SX133 machine had an expansion card connected to its internal 12 HDDs. I happened to have one of those machines and did some research:
```bash
ls -l /dev/disk/by-path/
```
This gave the following results:
```bash
lrwxrwxrwx 1 root root 13 Jul 11 16:23 pci-0000:04:00.0-nvme-1 -> ../../nvme0n1
lrwxrwxrwx 1 root root 15 Jul 11 16:23 pci-0000:04:00.0-nvme-1-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Jul 11 16:23 pci-0000:04:00.0-nvme-1-part2 -> ../../nvme0n1p2
lrwxrwxrwx 1 root root 15 Jul 11 16:23 pci-0000:04:00.0-nvme-1-part3 -> ../../nvme0n1p3
lrwxrwxrwx 1 root root 13 Jul 11 16:23 pci-0000:08:00.0-nvme-1 -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Jul 11 16:23 pci-0000:09:00.0-nvme-1 -> ../../nvme2n1
lrwxrwxrwx 1 root root 13 Jul 11 16:23 pci-0000:0a:00.0-nvme-1 -> ../../nvme3n1
lrwxrwxrwx 1 root root 13 Jul 11 16:23 pci-0000:0b:00.0-nvme-1 -> ../../nvme4n1
```
This showed the expansion cards:
04:00.0 Expansion Card (or interface) connected to 1 SSD
08:00.0 Expansion Card (or interface) connected to 1 SSD
09:00.0 Expansion Card (or interface) connected to 4 HDDs
0a:00.0 Expansion Card (or interface) connected to 1 SSD
0b:00.0 Expansion Card (or interface) connected to 1 SSD
Let's look at the HDD interfaces:
```bash
# lspci -vv -s 04:00.0 | grep -i LnkSta
lnksta: speed 8Gbps, width x4
lnksta2: current de-emphasis level: -6dB, equalization complete + equalization phase1+
# lspci -vv -s b3:00.0 | grep -i LnkSta
lnksta: speed 8Gbps, width x4
lnksta2: current de-emphasis level: -3.5dB, equalization complete + equalization phase1+
# lspci -vv -s 08:00.0 | grep -i LnkSta
lnksta: speed 16Gbps, width x4
lnksta2: current de-emphasis level: -3.5dB, equalization complete + equalization phase1+
# lspci -vv -s 09:00.0 | grep -i LnkSta
lnksta: speed 16Gbps, width x4
lnksta2: current de-emphasis level: -3.5dB, equalization complete + equalization phase1+
# lspci -vv -s 0a:00.0 | grep -i LnkSta
lnksta: speed 16Gbps, width x4
lnksta2: current de-emphasis level: -3.5dB, equalization complete + equalization phase1+
# lspci -vv -s 0b:00.0 | grep -i LnkSta
lnksta: speed 16Gbps, width x4
lnksta2: current de-emphasis level: -3.5dB, equalization complete + equalization phase1+
```
So each of the SSDs have the same PCIE 3.0x4 speeds. While the other 4 HDDs have the same PCIE 4.0x4 speeds. Maybe this explains why loading exploded with the soft raid0 setup.
Finally, thanks to everyone who took the time to read this post. My conclusions may not be correct. If anyone else has similar experiences, please feel free to share them. It will help us learn together!
Thank you!

		自动登录	找回密码
密码			立即注册