Drop-In Hardware Diagnostics: How I Built (and use) a GPT to Analyze Systems (running Linux) in Minutes
https://chatgpt.com/g/g-693ccd5961008191be1c44b2945adf2e-diagzip-analyzer
(GPT link)
Modern systems are complicated. Not just servers — everything.
Between PCIe Gen5, NUMA weirdness, NIC lane constraints, GPU fabrics, RDMA stacks, and vendor-specific quirks, it’s incredibly easy to misinterpret, or flat out not understand, what a system is actually doing versus what the spec sheet says it could or should be doing.
To solve that, I built a drop-in diagnostic GPT that lets me:
- Run one script on any Linux system (x86, ARM, DGX, NUC, etc.)
- Upload the resulting bundle
- Get an instant, structured analysis highlighting:
- PCIe bottlenecks
- NIC capabilities vs reality
- GPU / accelerator topology
- Storage constraints
- Kernel and firmware quirks
- And the actual limits of the platform
This post explains how it works, how to use it, and why this approach is better than ad-hoc unstructured analysis.
The Problem: Specs Lie, Topology Doesn’t
If you’ve ever debugged performance issues, you’ve probably seen this:
- A NIC that supports Gen5 x16… running at x4
- A GPU that reports PCIe Gen1 x1 (and it’s actually fine)
- NUMA nodes that don’t exist — or worse, half exist
- “400Gb capable” networking capped by PCIe before the PHY even matters
The problem isn’t lack of tools — it’s too many tools, inconsistent outputs, and zero standardization.
I wanted:
- One portable diagnostic bundle
- One repeatable analysis format
- One place to drop the results and get an answer
So I built both.
Step 1: Collecting the Diagnostic Bundle
The foundation is a single portable shell script that runs on virtually any modern Linux distribution.
It works on:
- x86_64
- ARM / aarch64
- NVIDIA DGX-style images
- Cloud images
- Bare metal
- Laptops
- Mini PCs
- Servers
It degrades gracefully if tools aren’t installed, and it works even when:
dmesgis restricted- Mellanox tools aren’t present
- RDMA isn’t configured
- GPUs aren’t installed
How to Run It
- Download the script:diag_bundle.sh (full code at bottom of this page)
- Make it executable:
chmod+x diag_bundle.sh - Run it (sudo recommended):
sudo./diag_bundle.sh - When it finishes, you’ll have:
- A directory:
diag_<hostname>_<timestamp>/ - And two archives:
diag_<hostname>_<timestamp>.tar.gzdiag_<hostname>_<timestamp>.zip
- A directory:
Upload either one to the GPT.
What the Script Actually Collects
This is where most “diagnostic scripts” fall down — they’re either too shallow or absurdly noisy.
This one focuses on high-signal data:
System & CPU
- Architecture (x86 vs ARM)
- Core counts
- Memory
- NUMA topology (or lack thereof)
- CPU governors
PCIe (The Important Part)
- Full PCIe device inventory
- Negotiated link speed & width for every device
- PCIe drivers
- Optional AER / DPC / ACS capability blocks
- Raw PCI config snapshots (safe, read-only)
GPUs & Accelerators
- NVIDIA (
nvidia-smi, topology maps) - AMD ROCm
- Intel iGPU visibility
- Grace / DGX fabric quirks
Networking
- Interface list
- Drivers
ethtoolcapability dumps- Supported link modes, speeds, FEC
- RDMA presence (or absence)
Storage
- NVMe topology
- PCIe generation and width
- SMART/NVMe health where available
Kernel & Firmware Clues
- PCIe training issues
- IOMMU hints
- GPU mailbox errors
- Silent performance killers
The goal isn’t to log everything.
The goal is to log what actually explains performance.
Step 2: Drop the Bundle Into the GPT
Once you have the .zip or .tar.gz:
- Open the Diagnostic Analysis GPT
- Upload the archive
- Ask for:
- Highlights (default)
- Or Deep Dive
- Or Comparison (if you upload two bundles)
The GPT automatically:
- Extracts the bundle
- Parses the relevant files
- Normalizes ARM vs x86 differences
- Flags anomalies
- Produces a Discord-ready summary
You don’t need to tell it what platform it is — the data speaks for itself.
What You Get Back
A typical output includes:
- System overview
- GPU / NIC / Storage topology
- PCIe bottlenecks ranked by severity
- Things that look wrong (and why)
- What’s expected behavior vs misconfiguration
- Concrete repro commands to validate the findings yourself
It’s the difference between:
“This feels slow”
and:
“Your NIC is Gen5 x4, your GPU doesn’t use PCIe for data, and your workload is PCIe-bound — not compute-bound.”
Why a GPT Instead of a Script Alone?
Because interpretation is the hard part.
Raw outputs don’t tell you:
- Which constraints are intentional
- Which ones are bugs
- Which ones matter
- And which ones are just noise
The GPT encodes:
- Architectural knowledge
- Vendor quirks
- Real-world performance implications
And in early testing, it does it consistently, every time.
Who This Is For
This workflow is ideal if you:
- Debug performance issues?
- Review hardware?
- Build AI or HPC systems?
- Work with PCIe Gen5/Gen6 platforms?
- Deal with networking at 200G+ speeds?
- Want answers without spending hours in
lspci
Try It Yourself
https://chatgpt.com/g/g-693ccd5961008191be1c44b2945adf2e-diagzip-analyzer
Run the script.
Upload the bundle.
Get answers.
If you want to understand what your hardware is actually doing, not what it claims to do, this is the fastest way I know how to do it.
#!/usr/bin/env bash
# diag_bundle.sh - Portable Linux hardware + PCIe + GPU + NIC + storage diagnostic bundle
# Works across x86_64 / aarch64 (including NVIDIA DGX-style images) with graceful degradation.
# Output: a timestamped directory plus a compressed .tar.gz bundle.
set -euo pipefail
HOST="$(hostname -s 2>/dev/null || hostname)"
TS="$(date +%Y%m%d_%H%M%S)"
OUT="diag_${HOST}_${TS}"
# Allow user override: OUTDIR=/path ./diag_bundle.sh
OUT="${OUTDIR:-$OUT}"
mkdir -p "$OUT"
log() { echo "[*] $*"; }
have() { command -v "$1" >/dev/null 2>&1; }
run() {
# run "cmd..." > file ; never fail the whole script
local outfile="$1"; shift
{
echo "==== CMD: $* ===="
"$@" 2>&1 || true
} > "$outfile"
}
run_sh() {
# run a shell pipeline safely
local outfile="$1"; shift
{
echo "==== SH: $* ===="
bash -lc "$*" 2>&1 || true
} > "$outfile"
}
log "Collecting diagnostics into: $OUT"
###############################################################################
# 0) System / OS / kernel
###############################################################################
log "System / OS / kernel"
run "$OUT/uname.txt" uname -a
run "$OUT/os_release.txt" bash -lc 'cat /etc/os-release 2>/dev/null || true; cat /etc/issue 2>/dev/null || true'
run "$OUT/cmdline.txt" bash -lc 'cat /proc/cmdline 2>/dev/null || true'
run "$OUT/uptime.txt" bash -lc 'uptime; who -b 2>/dev/null || true'
run "$OUT/limits.txt" bash -lc 'ulimit -a 2>/dev/null || true'
###############################################################################
# 1) CPU / memory / NUMA / topology
###############################################################################
log "CPU / memory / NUMA"
have lscpu && run "$OUT/lscpu.txt" lscpu || run "$OUT/lscpu.txt" bash -lc 'echo "lscpu not found"'
run "$OUT/meminfo.txt" bash -lc 'cat /proc/meminfo 2>/dev/null || true'
have numactl && run "$OUT/numa.txt" bash -lc 'numactl --hardware; echo; numactl --show' || run "$OUT/numa.txt" bash -lc 'echo "numactl not found"'
have lstopo && run "$OUT/hwloc_lstopo.txt" lstopo --no-io || true
have dmidecode && run "$OUT/dmidecode.txt" bash -lc 'dmidecode -t system -t baseboard -t bios 2>/dev/null || true' || true
run "$OUT/cpu_governor.txt" bash -lc 'for p in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f "$p" ] && echo "$p: $(cat "$p")"; done'
###############################################################################
# 2) PCIe inventory + topology + drivers
###############################################################################
log "PCIe inventory + topology + drivers"
# Full lspci dumps if present
if have lspci; then
run "$OUT/pci_tree.txt" bash -lc 'lspci -tvnn'
run "$OUT/pci_full.txt" bash -lc 'lspci -nnvvv'
run "$OUT/pci_drivers.txt" bash -lc 'lspci -k'
else
run "$OUT/pci_tree.txt" bash -lc 'echo "lspci not found"; ls -1 /sys/bus/pci/devices 2>/dev/null || true'
run "$OUT/pci_full.txt" bash -lc 'echo "lspci not found"'
run "$OUT/pci_drivers.txt" bash -lc 'echo "lspci not found"'
fi
# Sysfs link speeds/widths for ALL PCI devices
mkdir -p "$OUT/pci_links"
for devpath in /sys/bus/pci/devices/*; do
dev="$(basename "$devpath")"
{
echo "=== $dev ==="
for f in max_link_speed max_link_width current_link_speed current_link_width; do
if [ -f "$devpath/$f" ]; then
echo "$f = $(cat "$devpath/$f")"
fi
done
# NUMA node (if present)
if [ -f "$devpath/numa_node" ]; then
echo "numa_node = $(cat "$devpath/numa_node")"
fi
# Driver
if [ -L "$devpath/driver" ]; then
echo "driver = $(basename "$(readlink "$devpath/driver")")"
fi
# Vendor/device IDs
for f in vendor device subsystem_vendor subsystem_device class; do
[ -f "$devpath/$f" ] && echo "$f = $(cat "$devpath/$f")"
done
} > "$OUT/pci_links/$dev.txt"
done
# Optional: capture PCI config space (small + safe)
mkdir -p "$OUT/pci_config"
for devpath in /sys/bus/pci/devices/*; do
dev="$(basename "$devpath")"
if [ -r "$devpath/config" ]; then
# Limit to first 4096 bytes; some kernels expose 256/4k config
dd if="$devpath/config" of="$OUT/pci_config/$dev.bin" bs=1 count=4096 status=none 2>/dev/null || true
fi
done
###############################################################################
# 3) PCIe capabilities (AER/DPC/ACS) if lspci present
###############################################################################
log "PCIe capability blocks (AER/DPC/ACS)"
mkdir -p "$OUT/pci_caps"
if have lspci; then
for dev in $(ls /sys/bus/pci/devices); do
run_sh "$OUT/pci_caps/$dev.txt" "lspci -vvv -s $dev | sed -n '1,220p'; echo; lspci -vvv -s $dev | grep -A25 -Ei 'Advanced Error Reporting|Downstream Port Containment|Access Control Services|AER|DPC|ACS' || true"
done
else
run "$OUT/pci_caps/README.txt" bash -lc 'echo "lspci not available; pci_caps limited to sysfs links in pci_links/"'
fi
###############################################################################
# 4) GPU / accelerator inventory
###############################################################################
log "GPU / accelerator inventory"
# NVIDIA
if have nvidia-smi; then
run "$OUT/nvidia_smi_q.txt" nvidia-smi -q
run "$OUT/nvidia_topo.txt" nvidia-smi topo -m
run "$OUT/nvidia_smi_l.txt" nvidia-smi -L
else
run "$OUT/nvidia_smi_q.txt" bash -lc 'echo "nvidia-smi not found"'
run "$OUT/nvidia_topo.txt" bash -lc 'echo "nvidia-smi not found"'
run "$OUT/nvidia_smi_l.txt" bash -lc 'echo "nvidia-smi not found"'
fi
# AMD ROCm
if have rocm-smi; then
run "$OUT/rocm_smi.txt" rocm-smi -a
elif have amd-smi; then
run "$OUT/amd_smi.txt" amd-smi list
else
run "$OUT/rocm_smi.txt" bash -lc 'echo "rocm-smi/amd-smi not found"'
fi
# Generic GPU (useful on Intel iGPU / headless nodes)
have lshw && run "$OUT/lshw_display.txt" bash -lc 'lshw -C display 2>/dev/null || true' || true
have lsmod && run "$OUT/lsmod_gpu.txt" bash -lc "lsmod | egrep -i 'nvidia|amdgpu|i915|nouveau|mlx5|irdma|bnxt|ice|ixgbe|igc|r8169|r8125|r8126' || true" || true
###############################################################################
# 5) Networking: interfaces, drivers, ethtool capabilities, stats
###############################################################################
log "Networking"
run "$OUT/ip_link.txt" ip link show
run "$OUT/ip_addr.txt" ip -br addr
run "$OUT/ip_route.txt" ip route
have ss && run "$OUT/ss.txt" ss -s || true
mkdir -p "$OUT/ethtool"
if have ethtool; then
for iface in $(ls /sys/class/net); do
run "$OUT/ethtool/${iface}.txt" bash -lc "echo '=== ethtool $iface ==='; ethtool $iface; echo; echo '=== ethtool -i $iface ==='; ethtool -i $iface || true; echo; echo '=== ethtool -k $iface ==='; ethtool -k $iface || true; echo; echo '=== ethtool --show-fec $iface ==='; ethtool --show-fec $iface 2>/dev/null || true; echo; echo '=== ethtool -S $iface (truncated) ==='; ethtool -S $iface 2>/dev/null | head -n 200 || true"
done
else
run "$OUT/ethtool/README.txt" bash -lc 'echo "ethtool not found"'
fi
# RDMA device inventory (if present)
mkdir -p "$OUT/rdma"
if [ -d /sys/class/infiniband ]; then
run "$OUT/rdma/ibv_devices.txt" bash -lc 'ls -l /sys/class/infiniband; echo; for d in /sys/class/infiniband/*; do echo "=== $(basename "$d") ==="; ls -l "$d"; done'
have ibv_devinfo && run "$OUT/rdma/ibv_devinfo.txt" ibv_devinfo || true
else
run "$OUT/rdma/README.txt" bash -lc 'echo "No /sys/class/infiniband found (RDMA not configured or not present)"'
fi
###############################################################################
# 6) Storage: NVMe, block devices, filesystems, SMART-ish hints
###############################################################################
log "Storage"
run "$OUT/lsblk.txt" lsblk -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,ROTA,MOUNTPOINT,FSTYPE -e7
run "$OUT/df.txt" df -hT
if have nvme; then
run "$OUT/nvme_list.txt" nvme list
# id-ctrl for namespaces/devices if present
mkdir -p "$OUT/nvme"
for n in /dev/nvme*n1 /dev/nvme*; do
[ -e "$n" ] || continue
run "$OUT/nvme/$(basename "$n")_idctrl.txt" nvme id-ctrl "$n"
run "$OUT/nvme/$(basename "$n")_smart.txt" nvme smart-log "$n"
done
else
run "$OUT/nvme_list.txt" bash -lc 'echo "nvme-cli not found"'
fi
# SATA / generic SMART (if smartctl exists)
if have smartctl; then
run "$OUT/smartctl_scan.txt" smartctl --scan
mkdir -p "$OUT/smartctl"
while read -r dev rest; do
[[ "$dev" =~ ^/dev/ ]] || continue
run "$OUT/smartctl/$(basename "$dev")_info.txt" smartctl -i "$dev"
done < <(smartctl --scan 2>/dev/null | awk '{print $1}' | sort -u)
else
run "$OUT/smartctl_scan.txt" bash -lc 'echo "smartctl not found"'
fi
###############################################################################
# 7) Kernel logs: handle restricted dmesg by preferring journalctl
###############################################################################
log "Kernel logs"
mkdir -p "$OUT/logs"
# dmesg may be restricted; capture if possible
if dmesg >/dev/null 2>&1; then
run "$OUT/logs/dmesg.txt" dmesg -T
else
run "$OUT/logs/dmesg.txt" bash -lc 'echo "dmesg not permitted (kernel.dmesg_restrict or lockdown)"'
fi
# journalctl kernel logs (often works even when dmesg is restricted, if user has permission)
if have journalctl; then
run "$OUT/logs/journal_kernel_err.txt" bash -lc 'journalctl -k -p err --no-pager 2>/dev/null || true'
run "$OUT/logs/journal_kernel_warn.txt" bash -lc 'journalctl -k -p warning --no-pager 2>/dev/null || true'
run "$OUT/logs/journal_kernel_pcie.txt" bash -lc "journalctl -k --no-pager 2>/dev/null | egrep -i 'pcie|aer|dpc|acs|link down|training|nvme|mlx|nvidia|amdgpu|i915|iommu|dma' || true"
else
run "$OUT/logs/journalctl.txt" bash -lc 'echo "journalctl not found"'
fi
###############################################################################
# 8) IOMMU / DMA / hugepages / misc
###############################################################################
log "IOMMU / DMA / hugepages / misc"
run "$OUT/iommu.txt" bash -lc 'dmesg 2>/dev/null | egrep -i "iommu|dma remap|smmu|amd-vi|intel-iommu" || true; echo; cat /proc/cmdline 2>/dev/null || true'
run "$OUT/hugepages.txt" bash -lc 'cat /proc/meminfo | egrep -i "Huge|AnonHuge|ShmemHuge" || true; grep -R . /sys/kernel/mm/hugepages/*/nr_hugepages 2>/dev/null || true'
###############################################################################
# 9) Package it up
###############################################################################
log "Packaging bundle"
TAR="${OUT}.tar.gz"
tar -czf "$TAR" "$OUT"
ZIP="${OUT}.zip"
if have zip; then
# -r recurse, -q quiet, -9 best compression
(cd "$(dirname "$OUT")" && zip -rq9 "$(basename "$ZIP")" "$(basename "$OUT")") || true
else
# Try python3 zipfile as a fallback (common on many images)
if have python3; then
python3 - <<'PY' || true
import os, zipfile
out = os.environ.get("OUT")
zip_path = f"{out}.zip"
def add_dir(z, d):
for root, dirs, files in os.walk(d):
for f in files:
p = os.path.join(root, f)
arc = os.path.relpath(p, os.path.dirname(d))
z.write(p, arc)
if out and os.path.isdir(out):
with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=9) as z:
add_dir(z, out)
PY
else
echo "[!] zip not found and python3 not found; skipping .zip creation" >> "$OUT/_warnings.txt"
fi
fi
log "Done."
echo
echo "Bundle directory: $OUT/"
echo "Compressed archive (tar.gz): $TAR"
echo "Compressed archive (zip): $ZIP"