Drop-In Hardware Diagnostics: How I Built (and use) a GPT to Analyze Systems (running Linux) in Minutes

Drop-In Hardware Diagnostics: How I Built (and use) a GPT to Analyze Systems (running Linux) in Minutes
https://chatgpt.com/g/g-693ccd5961008191be1c44b2945adf2e-diagzip-analyzer
(GPT link)

Modern systems are complicated. Not just servers — everything.

Between PCIe Gen5, NUMA weirdness, NIC lane constraints, GPU fabrics, RDMA stacks, and vendor-specific quirks, it’s incredibly easy to misinterpret, or flat out not understand, what a system is actually doing versus what the spec sheet says it could or should be doing.

To solve that, I built a drop-in diagnostic GPT that lets me:

  • Run one script on any Linux system (x86, ARM, DGX, NUC, etc.)
  • Upload the resulting bundle
  • Get an instant, structured analysis highlighting:
    • PCIe bottlenecks
    • NIC capabilities vs reality
    • GPU / accelerator topology
    • Storage constraints
    • Kernel and firmware quirks
    • And the actual limits of the platform

This post explains how it works, how to use it, and why this approach is better than ad-hoc unstructured analysis.


The Problem: Specs Lie, Topology Doesn’t

If you’ve ever debugged performance issues, you’ve probably seen this:

  • A NIC that supports Gen5 x16… running at x4
  • A GPU that reports PCIe Gen1 x1 (and it’s actually fine)
  • NUMA nodes that don’t exist — or worse, half exist
  • “400Gb capable” networking capped by PCIe before the PHY even matters

The problem isn’t lack of tools — it’s too many tools, inconsistent outputs, and zero standardization.

I wanted:

  • One portable diagnostic bundle
  • One repeatable analysis format
  • One place to drop the results and get an answer

So I built both.


Step 1: Collecting the Diagnostic Bundle

The foundation is a single portable shell script that runs on virtually any modern Linux distribution.

It works on:

  • x86_64
  • ARM / aarch64
  • NVIDIA DGX-style images
  • Cloud images
  • Bare metal
  • Laptops
  • Mini PCs
  • Servers

It degrades gracefully if tools aren’t installed, and it works even when:

  • dmesg is restricted
  • Mellanox tools aren’t present
  • RDMA isn’t configured
  • GPUs aren’t installed

How to Run It

  1. Download the script:diag_bundle.sh (full code at bottom of this page)
  2. Make it executable:chmod +x diag_bundle.sh
  3. Run it (sudo recommended):sudo ./diag_bundle.sh
  4. When it finishes, you’ll have:
    • A directory:diag_<hostname>_<timestamp>/
    • And two archives:diag_<hostname>_<timestamp>.tar.gz
      diag_<hostname>_<timestamp>.zip

Upload either one to the GPT.


What the Script Actually Collects

This is where most “diagnostic scripts” fall down — they’re either too shallow or absurdly noisy.

This one focuses on high-signal data:

System & CPU

  • Architecture (x86 vs ARM)
  • Core counts
  • Memory
  • NUMA topology (or lack thereof)
  • CPU governors

PCIe (The Important Part)

  • Full PCIe device inventory
  • Negotiated link speed & width for every device
  • PCIe drivers
  • Optional AER / DPC / ACS capability blocks
  • Raw PCI config snapshots (safe, read-only)

GPUs & Accelerators

  • NVIDIA (nvidia-smi, topology maps)
  • AMD ROCm
  • Intel iGPU visibility
  • Grace / DGX fabric quirks

Networking

  • Interface list
  • Drivers
  • ethtool capability dumps
  • Supported link modes, speeds, FEC
  • RDMA presence (or absence)

Storage

  • NVMe topology
  • PCIe generation and width
  • SMART/NVMe health where available

Kernel & Firmware Clues

  • PCIe training issues
  • IOMMU hints
  • GPU mailbox errors
  • Silent performance killers

The goal isn’t to log everything.
The goal is to log what actually explains performance.


Step 2: Drop the Bundle Into the GPT

Once you have the .zip or .tar.gz:

  1. Open the Diagnostic Analysis GPT
  2. Upload the archive
  3. Ask for:
    • Highlights (default)
    • Or Deep Dive
    • Or Comparison (if you upload two bundles)

The GPT automatically:

  • Extracts the bundle
  • Parses the relevant files
  • Normalizes ARM vs x86 differences
  • Flags anomalies
  • Produces a Discord-ready summary

You don’t need to tell it what platform it is — the data speaks for itself.


What You Get Back

A typical output includes:

  • System overview
  • GPU / NIC / Storage topology
  • PCIe bottlenecks ranked by severity
  • Things that look wrong (and why)
  • What’s expected behavior vs misconfiguration
  • Concrete repro commands to validate the findings yourself

It’s the difference between:

“This feels slow”

and:

“Your NIC is Gen5 x4, your GPU doesn’t use PCIe for data, and your workload is PCIe-bound — not compute-bound.”

Why a GPT Instead of a Script Alone?

Because interpretation is the hard part.

Raw outputs don’t tell you:

  • Which constraints are intentional
  • Which ones are bugs
  • Which ones matter
  • And which ones are just noise

The GPT encodes:

  • Architectural knowledge
  • Vendor quirks
  • Real-world performance implications

And in early testing, it does it consistently, every time.


Who This Is For

This workflow is ideal if you:

  • Debug performance issues?
  • Review hardware?
  • Build AI or HPC systems?
  • Work with PCIe Gen5/Gen6 platforms?
  • Deal with networking at 200G+ speeds?
  • Want answers without spending hours in lspci

Try It Yourself

https://chatgpt.com/g/g-693ccd5961008191be1c44b2945adf2e-diagzip-analyzer

Run the script.
Upload the bundle.
Get answers.

If you want to understand what your hardware is actually doing, not what it claims to do, this is the fastest way I know how to do it.


#!/usr/bin/env bash
# diag_bundle.sh - Portable Linux hardware + PCIe + GPU + NIC + storage diagnostic bundle
# Works across x86_64 / aarch64 (including NVIDIA DGX-style images) with graceful degradation.
# Output: a timestamped directory plus a compressed .tar.gz bundle.

set -euo pipefail

HOST="$(hostname -s 2>/dev/null || hostname)"
TS="$(date +%Y%m%d_%H%M%S)"
OUT="diag_${HOST}_${TS}"

# Allow user override: OUTDIR=/path ./diag_bundle.sh
OUT="${OUTDIR:-$OUT}"

mkdir -p "$OUT"

log() { echo "[*] $*"; }
have() { command -v "$1" >/dev/null 2>&1; }

run() {
  # run "cmd..." > file ; never fail the whole script
  local outfile="$1"; shift
  {
    echo "==== CMD: $* ===="
    "$@" 2>&1 || true
  } > "$outfile"
}

run_sh() {
  # run a shell pipeline safely
  local outfile="$1"; shift
  {
    echo "==== SH: $* ===="
    bash -lc "$*" 2>&1 || true
  } > "$outfile"
}

log "Collecting diagnostics into: $OUT"

###############################################################################
# 0) System / OS / kernel
###############################################################################
log "System / OS / kernel"
run "$OUT/uname.txt" uname -a
run "$OUT/os_release.txt" bash -lc 'cat /etc/os-release 2>/dev/null || true; cat /etc/issue 2>/dev/null || true'
run "$OUT/cmdline.txt" bash -lc 'cat /proc/cmdline 2>/dev/null || true'
run "$OUT/uptime.txt" bash -lc 'uptime; who -b 2>/dev/null || true'
run "$OUT/limits.txt" bash -lc 'ulimit -a 2>/dev/null || true'

###############################################################################
# 1) CPU / memory / NUMA / topology
###############################################################################
log "CPU / memory / NUMA"
have lscpu && run "$OUT/lscpu.txt" lscpu || run "$OUT/lscpu.txt" bash -lc 'echo "lscpu not found"'
run "$OUT/meminfo.txt" bash -lc 'cat /proc/meminfo 2>/dev/null || true'
have numactl && run "$OUT/numa.txt" bash -lc 'numactl --hardware; echo; numactl --show' || run "$OUT/numa.txt" bash -lc 'echo "numactl not found"'
have lstopo && run "$OUT/hwloc_lstopo.txt" lstopo --no-io || true
have dmidecode && run "$OUT/dmidecode.txt" bash -lc 'dmidecode -t system -t baseboard -t bios 2>/dev/null || true' || true
run "$OUT/cpu_governor.txt" bash -lc 'for p in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f "$p" ] && echo "$p: $(cat "$p")"; done'

###############################################################################
# 2) PCIe inventory + topology + drivers
###############################################################################
log "PCIe inventory + topology + drivers"

# Full lspci dumps if present
if have lspci; then
  run "$OUT/pci_tree.txt" bash -lc 'lspci -tvnn'
  run "$OUT/pci_full.txt" bash -lc 'lspci -nnvvv'
  run "$OUT/pci_drivers.txt" bash -lc 'lspci -k'
else
  run "$OUT/pci_tree.txt" bash -lc 'echo "lspci not found"; ls -1 /sys/bus/pci/devices 2>/dev/null || true'
  run "$OUT/pci_full.txt" bash -lc 'echo "lspci not found"'
  run "$OUT/pci_drivers.txt" bash -lc 'echo "lspci not found"'
fi

# Sysfs link speeds/widths for ALL PCI devices
mkdir -p "$OUT/pci_links"
for devpath in /sys/bus/pci/devices/*; do
  dev="$(basename "$devpath")"
  {
    echo "=== $dev ==="
    for f in max_link_speed max_link_width current_link_speed current_link_width; do
      if [ -f "$devpath/$f" ]; then
        echo "$f = $(cat "$devpath/$f")"
      fi
    done
    # NUMA node (if present)
    if [ -f "$devpath/numa_node" ]; then
      echo "numa_node = $(cat "$devpath/numa_node")"
    fi
    # Driver
    if [ -L "$devpath/driver" ]; then
      echo "driver = $(basename "$(readlink "$devpath/driver")")"
    fi
    # Vendor/device IDs
    for f in vendor device subsystem_vendor subsystem_device class; do
      [ -f "$devpath/$f" ] && echo "$f = $(cat "$devpath/$f")"
    done
  } > "$OUT/pci_links/$dev.txt"
done

# Optional: capture PCI config space (small + safe)
mkdir -p "$OUT/pci_config"
for devpath in /sys/bus/pci/devices/*; do
  dev="$(basename "$devpath")"
  if [ -r "$devpath/config" ]; then
    # Limit to first 4096 bytes; some kernels expose 256/4k config
    dd if="$devpath/config" of="$OUT/pci_config/$dev.bin" bs=1 count=4096 status=none 2>/dev/null || true
  fi
done

###############################################################################
# 3) PCIe capabilities (AER/DPC/ACS) if lspci present
###############################################################################
log "PCIe capability blocks (AER/DPC/ACS)"
mkdir -p "$OUT/pci_caps"
if have lspci; then
  for dev in $(ls /sys/bus/pci/devices); do
    run_sh "$OUT/pci_caps/$dev.txt" "lspci -vvv -s $dev | sed -n '1,220p'; echo; lspci -vvv -s $dev | grep -A25 -Ei 'Advanced Error Reporting|Downstream Port Containment|Access Control Services|AER|DPC|ACS' || true"
  done
else
  run "$OUT/pci_caps/README.txt" bash -lc 'echo "lspci not available; pci_caps limited to sysfs links in pci_links/"'
fi

###############################################################################
# 4) GPU / accelerator inventory
###############################################################################
log "GPU / accelerator inventory"

# NVIDIA
if have nvidia-smi; then
  run "$OUT/nvidia_smi_q.txt" nvidia-smi -q
  run "$OUT/nvidia_topo.txt" nvidia-smi topo -m
  run "$OUT/nvidia_smi_l.txt" nvidia-smi -L
else
  run "$OUT/nvidia_smi_q.txt" bash -lc 'echo "nvidia-smi not found"'
  run "$OUT/nvidia_topo.txt" bash -lc 'echo "nvidia-smi not found"'
  run "$OUT/nvidia_smi_l.txt" bash -lc 'echo "nvidia-smi not found"'
fi

# AMD ROCm
if have rocm-smi; then
  run "$OUT/rocm_smi.txt" rocm-smi -a
elif have amd-smi; then
  run "$OUT/amd_smi.txt" amd-smi list
else
  run "$OUT/rocm_smi.txt" bash -lc 'echo "rocm-smi/amd-smi not found"'
fi

# Generic GPU (useful on Intel iGPU / headless nodes)
have lshw && run "$OUT/lshw_display.txt" bash -lc 'lshw -C display 2>/dev/null || true' || true
have lsmod && run "$OUT/lsmod_gpu.txt" bash -lc "lsmod | egrep -i 'nvidia|amdgpu|i915|nouveau|mlx5|irdma|bnxt|ice|ixgbe|igc|r8169|r8125|r8126' || true" || true

###############################################################################
# 5) Networking: interfaces, drivers, ethtool capabilities, stats
###############################################################################
log "Networking"
run "$OUT/ip_link.txt" ip link show
run "$OUT/ip_addr.txt" ip -br addr
run "$OUT/ip_route.txt" ip route
have ss && run "$OUT/ss.txt" ss -s || true

mkdir -p "$OUT/ethtool"
if have ethtool; then
  for iface in $(ls /sys/class/net); do
    run "$OUT/ethtool/${iface}.txt" bash -lc "echo '=== ethtool $iface ==='; ethtool $iface; echo; echo '=== ethtool -i $iface ==='; ethtool -i $iface || true; echo; echo '=== ethtool -k $iface ==='; ethtool -k $iface || true; echo; echo '=== ethtool --show-fec $iface ==='; ethtool --show-fec $iface 2>/dev/null || true; echo; echo '=== ethtool -S $iface (truncated) ==='; ethtool -S $iface 2>/dev/null | head -n 200 || true"
  done
else
  run "$OUT/ethtool/README.txt" bash -lc 'echo "ethtool not found"'
fi

# RDMA device inventory (if present)
mkdir -p "$OUT/rdma"
if [ -d /sys/class/infiniband ]; then
  run "$OUT/rdma/ibv_devices.txt" bash -lc 'ls -l /sys/class/infiniband; echo; for d in /sys/class/infiniband/*; do echo "=== $(basename "$d") ==="; ls -l "$d"; done'
  have ibv_devinfo && run "$OUT/rdma/ibv_devinfo.txt" ibv_devinfo || true
else
  run "$OUT/rdma/README.txt" bash -lc 'echo "No /sys/class/infiniband found (RDMA not configured or not present)"'
fi

###############################################################################
# 6) Storage: NVMe, block devices, filesystems, SMART-ish hints
###############################################################################
log "Storage"
run "$OUT/lsblk.txt" lsblk -o NAME,TYPE,SIZE,MODEL,SERIAL,TRAN,ROTA,MOUNTPOINT,FSTYPE -e7
run "$OUT/df.txt" df -hT

if have nvme; then
  run "$OUT/nvme_list.txt" nvme list
  # id-ctrl for namespaces/devices if present
  mkdir -p "$OUT/nvme"
  for n in /dev/nvme*n1 /dev/nvme*; do
    [ -e "$n" ] || continue
    run "$OUT/nvme/$(basename "$n")_idctrl.txt" nvme id-ctrl "$n"
    run "$OUT/nvme/$(basename "$n")_smart.txt" nvme smart-log "$n"
  done
else
  run "$OUT/nvme_list.txt" bash -lc 'echo "nvme-cli not found"'
fi

# SATA / generic SMART (if smartctl exists)
if have smartctl; then
  run "$OUT/smartctl_scan.txt" smartctl --scan
  mkdir -p "$OUT/smartctl"
  while read -r dev rest; do
    [[ "$dev" =~ ^/dev/ ]] || continue
    run "$OUT/smartctl/$(basename "$dev")_info.txt" smartctl -i "$dev"
  done < <(smartctl --scan 2>/dev/null | awk '{print $1}' | sort -u)
else
  run "$OUT/smartctl_scan.txt" bash -lc 'echo "smartctl not found"'
fi

###############################################################################
# 7) Kernel logs: handle restricted dmesg by preferring journalctl
###############################################################################
log "Kernel logs"
mkdir -p "$OUT/logs"

# dmesg may be restricted; capture if possible
if dmesg >/dev/null 2>&1; then
  run "$OUT/logs/dmesg.txt" dmesg -T
else
  run "$OUT/logs/dmesg.txt" bash -lc 'echo "dmesg not permitted (kernel.dmesg_restrict or lockdown)"'
fi

# journalctl kernel logs (often works even when dmesg is restricted, if user has permission)
if have journalctl; then
  run "$OUT/logs/journal_kernel_err.txt" bash -lc 'journalctl -k -p err --no-pager 2>/dev/null || true'
  run "$OUT/logs/journal_kernel_warn.txt" bash -lc 'journalctl -k -p warning --no-pager 2>/dev/null || true'
  run "$OUT/logs/journal_kernel_pcie.txt" bash -lc "journalctl -k --no-pager 2>/dev/null | egrep -i 'pcie|aer|dpc|acs|link down|training|nvme|mlx|nvidia|amdgpu|i915|iommu|dma' || true"
else
  run "$OUT/logs/journalctl.txt" bash -lc 'echo "journalctl not found"'
fi

###############################################################################
# 8) IOMMU / DMA / hugepages / misc
###############################################################################
log "IOMMU / DMA / hugepages / misc"
run "$OUT/iommu.txt" bash -lc 'dmesg 2>/dev/null | egrep -i "iommu|dma remap|smmu|amd-vi|intel-iommu" || true; echo; cat /proc/cmdline 2>/dev/null || true'
run "$OUT/hugepages.txt" bash -lc 'cat /proc/meminfo | egrep -i "Huge|AnonHuge|ShmemHuge" || true; grep -R . /sys/kernel/mm/hugepages/*/nr_hugepages 2>/dev/null || true'

###############################################################################
# 9) Package it up
###############################################################################
log "Packaging bundle"

TAR="${OUT}.tar.gz"
tar -czf "$TAR" "$OUT"

ZIP="${OUT}.zip"
if have zip; then
  # -r recurse, -q quiet, -9 best compression
  (cd "$(dirname "$OUT")" && zip -rq9 "$(basename "$ZIP")" "$(basename "$OUT")") || true
else
  # Try python3 zipfile as a fallback (common on many images)
  if have python3; then
    python3 - <<'PY' || true
import os, zipfile
out = os.environ.get("OUT")
zip_path = f"{out}.zip"
def add_dir(z, d):
    for root, dirs, files in os.walk(d):
        for f in files:
            p = os.path.join(root, f)
            arc = os.path.relpath(p, os.path.dirname(d))
            z.write(p, arc)
if out and os.path.isdir(out):
    with zipfile.ZipFile(zip_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=9) as z:
        add_dir(z, out)
PY
  else
    echo "[!] zip not found and python3 not found; skipping .zip creation" >> "$OUT/_warnings.txt"
  fi
fi

log "Done."
echo
echo "Bundle directory: $OUT/"
echo "Compressed archive (tar.gz): $TAR"
echo "Compressed archive (zip): $ZIP"

Read more