View previous topic :: View next topic |
Author |
Message |
eeckwrk99 Apprentice


Joined: 14 Mar 2021 Posts: 264 Location: Gentoo forums
|
Posted: Wed May 21, 2025 6:51 pm Post subject: [SOLVED] NVMe drive stops working |
|
|
I've been using the following setup for years without any issue:
- Gentoo installed on a Samsung 840 EVO SSD (/dev/sda)
- Arch Linux installed on a Samsung SSD 970 EVO Plus NVMe (/dev/nvme0n1):
Code: | fdisk -l
Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 840
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ...
Device Start End Sectors Size Type
/dev/sda1 2048 1050623 1048576 512M EFI System
/dev/sda2 1050624 488396799 487346176 232.4G Linux LVM
Disk /dev/nvme0n1: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 970 EVO Plus 250GB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ...
Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p2 1050624 488396799 487346176 232.4G Linux LVM |
Gentoo is my main distro and I can chroot into my Arch Linux install by adding it to my /etc/fstab file:
Code: | % cat /etc/fstab
# <file system> <dir> <type> <options> <dump> <pass>
# /dev/sda1 LABEL=Gentoo-Boot
UUID=... /boot vfat rw,noatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro 0 2
# /dev/mapper/vg2-gentoo_swap LABEL=Gentoo-Swap
UUID=... none swap defaults 0 0
# /dev/mapper/vg2-gentoo_root LABEL=Gentoo-Root
UUID=... / ext4 rw,noatime 0 1
# /dev/mapper/vg2-gentoo_home LABEL=Gentoo-Home
UUID=... /home ext4 rw,noatime 0 2
tmpfs /var/tmp/portage tmpfs size=16G,uid=portage,gid=portage,mode=775 0 0
# Arch Linux
/dev/mapper/vg1-arch_root /media/Arch ext4 defaults,nofail 0 2
/dev/mapper/vg1-arch_home /media/Arch/home ext4 defaults,nofail 0 2
/dev/nvme0n1p1 /media/Arch/boot vfat defaults,noatime 0 2 |
Since a few weeks or so, the NVMe drive completely stops working when doing some relatively intensive I/O operations such as:
- running VMs (using QEMU/KVM on Gentoo) that are located on the NVMe
- updating Arch Linux kernel and nvidia-dkms package from within the Arch chroot
Code: | dmesg
[82593.041535] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[82593.041545] nvme nvme0: Does your device have a faulty power saving mode enabled?
[82593.041548] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[82593.118217] nvme 0000:02:00.0: enabling device (0000 -> 0002)
[82593.118404] nvme nvme0: Disabling device after reset failure: -19
[83446.935928] usb 1-11: USB disconnect, device number 6
[83476.356673] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.356728] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83476.561519] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.561589] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83476.612580] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.612617] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83481.097260] EXT4-fs warning (device dm-6): ext4_end_bio:342: I/O error 10 writing to inode 525289 starting block 8956052)
[83481.097294] Buffer I/O error on device dm-6, logical block 8956052
[83481.758290] Aborting journal on device dm-7-8.
[83481.758315] Buffer I/O error on dev dm-7, logical block 21528576, lost sync page write
[83481.758330] JBD2: I/O error when updating journal superblock for dm-7-8.
[83482.611537] Aborting journal on device dm-6-8.
[83482.611570] Buffer I/O error on dev dm-6, logical block 6324224, lost sync page write
[83482.611580] JBD2: I/O error when updating journal superblock for dm-6-8.
[83483.479226] EXT4-fs error (device dm-6): ext4_journal_check_start:84: comm cp: Detected aborted journal
[83483.479268] Buffer I/O error on dev dm-6, logical block 0, lost sync page write
[83483.479279] EXT4-fs (dm-6): I/O error while writing superblock
[83483.479281] EXT4-fs (dm-6): Remounting filesystem read-only
[83587.609753] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83587.625935] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83587.641763] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83588.251035] EXT4-fs warning (device dm-7): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm lsd: error -5 reading directory block
[83667.302475] Buffer I/O error on dev dm-4, logical block 0, async page read
[83667.302485] Buffer I/O error on dev dm-4, logical block 0, async page read
[83667.302677] Buffer I/O error on dev dm-5, logical block 0, async page read
[83667.302684] Buffer I/O error on dev dm-5, logical block 0, async page read |
Code: | journalctl
May 21 11:37:15 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:15 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0 |
At this point, any command targetting /media/Arch fail:
Code: | ls /media/Arch/
ls: reading directory '/media/Arch/': Input/output error |
Then I reboot to my Arch system. I'm getting "recovering journal" and "Clearing orphaned inode" messages after unlocking the LUKS container and everything seems to work just fine.
I couldn't reproduce the issue by any other method other than running any VM or updating Arch kernel or nvidia-dkms package.
Writing a random 10GB file doesn't seem to cause any harm (running this when the disk is working normally, of course):
Code: | dd bs=1M count=10240 if=/dev/zero of=/media/Arch/file_10GB |
Various commands outputs run from a live Arch Linux ISO just after the issue occurred:
Code: | fsck -a /dev/nvme0n1p1
fsck from util-linux 2.41
fsck.fat 4.2 (2021-01-31)
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
65:01/00
Not automatically fixing this.
Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
Automatically removing dirty bit.
*** Filesystem was changed ***
Writing changes.
/dev/nvme0n1p1: 625 files, 21098/130812 clusters |
Code: | fsck -a /dev/mapper/vg1-arch_root
fsck from util-linux 2.41
Arch-Root: recovering journal
Arch-Root: Clearing orphaned inode 2245369 (uid=1000, gid=984, mode=010644, size=0)
Arch-Root: clean, 403794/3276800 files, 4535520/13107200 blocks |
Code: | smartctl -H /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.14.4-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED |
Code: | smartctl --all /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.14.4-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 250GB
Serial Number: S4EUNF0M753005A
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 250,059,350,016 [250 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 250,059,350,016 [250 GB]
Namespace 1 Utilization: 250,058,321,920 [250 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5791b395da
Local Time is: Wed May 21 11:16:15 2025 UTC
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.80W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 3.40W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0100W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 19%
Data Units Read: 29,539,438 [15.1 TB]
Data Units Written: 34,870,455 [17.8 TB]
Host Read Commands: 355,171,832
Host Write Commands: 634,976,535
Controller Busy Time: 3,589
Power Cycles: 4,176
Power On Hours: 2,703
Unsafe Shutdowns: 568
Media and Data Integrity Errors: 0
Error Information Log Entries: 3,930
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 41 Celsius
Temperature Sensor 2: 45 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message
0 3930 0 0x0004 0x4004 - 0 0 - Invalid Field in Command
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Short Completed without error 2696 - - - - -
1 Short Aborted: Controller Reset 2694 - - - - -
2 Short Completed without error 2694 - - - - - |
The drive has been running just fine since late 2019 until now.
I don't think it has anything to do with the kernel, I've been running 6.12.21 since 2025-03-31 and the issue started to occur in late April/early May (cannot remember exactly):
Code: | emlop l -n 5 sys-kernel/gentoo-sources
2024-12-26 11:04:16 1:27 sys-kernel/gentoo-sources-6.6.67
2025-01-30 22:23:17 1:29 sys-kernel/gentoo-sources-6.6.74
2025-02-24 12:05:48 1:24 sys-kernel/gentoo-sources-6.12.16
2025-02-24 19:08:53 1:29 sys-kernel/gentoo-sources-6.12.16
2025-03-31 21:52:17 1:28 sys-kernel/gentoo-sources-6.12.21 |
The only recent suspicious update was sys-libs/libnvme but after downgrading to 1.11.1-r1, the issue occurred again (so I updated to 1.12-r1 again today):
Code: | emlop l -n 5 sys-libs/libnvme
2025-04-21 13:10:13 5:16 sys-libs/libnvme-1.11.1-r1
2025-04-22 14:35:41 12 sys-libs/libnvme-1.12-r1
2025-05-01 15:55:28 30 sys-libs/libnvme-1.12-r1
2025-05-19 18:55:17 13 sys-libs/libnvme-1.11.1-r1
2025-05-21 13:41:22 13 sys-libs/libnvme-1.12-r1 |
I haven't changed any BIOS setting. No recent hardware change.
Code: | emerge --info
Portage 3.0.67 (python 3.13.3-final-0, default/linux/amd64/23.0/desktop/systemd, gcc-14, glibc-2.40-r8, 6.12.21-gentoo-custom x86_64)
=================================================================
System uname: Linux-6.12.21-gentoo-custom-x86_64-Intel-R-_Core-TM-_i7-5820K_CPU_@_3.30GHz-with-glibc2.40
KiB Mem: 32772564 total, 6630924 free
KiB Swap: 33554428 total, 32879444 free
Timestamp of repository gentoo: Wed, 21 May 2025 13:09:31 +0000
Head commit of repository gentoo: 8b464f8a58dd7daa1fd2dfa5a88640206e81c6fa
Timestamp of repository guru: Tue, 20 May 2025 17:55:01 +0000
Head commit of repository guru: 538f7a4a0a1a700a05584ff741af562854258f2b
sh bash 5.2_p37
ld GNU ld (Gentoo 2.44 p1) 2.44.0
app-misc/pax-utils: 1.3.8::gentoo
app-shells/bash: 5.2_p37::gentoo
dev-build/autoconf: 2.72-r1::gentoo
dev-build/automake: 1.17-r1::gentoo
dev-build/cmake: 3.31.5::gentoo
dev-build/libtool: 2.5.4::gentoo
dev-build/make: 4.4.1-r100::gentoo
dev-build/meson: 1.7.0::gentoo
dev-lang/perl: 5.40.2::gentoo
dev-lang/python: 3.13.3::gentoo
dev-lang/rust-bin: 1.86.0-r1::gentoo, 1.87.0::gentoo
llvm-core/clang: 19.1.7::gentoo
llvm-core/llvm: 19.1.7::gentoo
sys-apps/baselayout: 2.17::gentoo
sys-apps/sandbox: 2.46::gentoo
sys-apps/systemd: 256.10::gentoo
sys-devel/binutils: 2.44-r1::gentoo
sys-devel/binutils-config: 5.5.2::gentoo
sys-devel/gcc: 14.2.1_p20241221::gentoo
sys-devel/gcc-config: 2.12.1::gentoo
sys-kernel/linux-headers: 6.12::gentoo (virtual/os-headers)
sys-libs/glibc: 2.40-r8::gentoo
Repositories:
gentoo
location: /var/db/repos/gentoo
sync-type: git
sync-uri: https://github.com/gentoo-mirror/gentoo.git
priority: -1000
volatile: False
guru
location: /var/db/repos/guru
sync-type: git
sync-uri: https://github.com/gentoo-mirror/guru.git
masters: gentoo
volatile: False
ABI="amd64"
ABI_X86="64"
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
ACCEPT_PROPERTIES="*"
ACCEPT_RESTRICT="*"
ADA_TARGET="gcc_14"
APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_anon authn_dbm authn_file authz_dbm authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir env expires ext_filter file_cache filter headers include info log_config logio mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias"
ARCH="amd64"
AUTOCLEAN="no"
BINPKG_COMPRESS="bzip2"
BINPKG_FORMAT="xpak"
BINPKG_GPG_SIGNING_BASE_COMMAND="/usr/bin/flock /run/lock/portage-binpkg-gpg.lock /usr/bin/gpg --sign --armor [PORTAGE_CONFIG]"
BINPKG_GPG_SIGNING_DIGEST="SHA512"
BINPKG_GPG_VERIFY_BASE_COMMAND="/usr/bin/gpg --verify --batch --no-tty --no-auto-check-trustdb --status-fd 2 [PORTAGE_CONFIG] [SIGNATURE]"
BINPKG_GPG_VERIFY_GPG_HOME="/etc/portage/gnupg"
BOOTSTRAP_USE="unicode pkg-config split-usr xml python_targets_python3_13 python_single_target_python3_13 multilib zstd cet systemd sysv-utils udev"
BROOT=""
CALLIGRA_FEATURES="karbon sheets words"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=haswell -O2 -pipe"
CFLAGS_amd64="-m64"
CFLAGS_x32="-mx32"
CFLAGS_x86="-m32 -mfpmath=sse"
CHOST="x86_64-pc-linux-gnu"
CHOST_amd64="x86_64-pc-linux-gnu"
CHOST_x32="x86_64-pc-linux-gnux32"
CHOST_x86="i686-pc-linux-gnu"
CLEAN_DELAY="5"
COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog"
COLLISION_IGNORE="/boot/dtbs/* /lib/modules/*"
COMMON_FLAGS="-march=haswell -O2 -pipe"
CONFIG_PROTECT="/etc /usr/lib64/libreoffice/program/sofficerc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d"
CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sse sse2 sse3 sse4_1 sse4_2 ssse3"
CXXFLAGS="-march=haswell -O2 -pipe"
DEFAULT_ABI="amd64"
DISPLAY=":0"
DISTDIR="/var/cache/distfiles"
EDITOR="nvim"
ELIBC="glibc"
EMERGE_DEFAULT_OPTS=" -j12 -l10.8 --alert --ask --keep-going=y --misspell-suggestions=y --quiet --quiet-build=y --verbose"
EMERGE_WARNING_DELAY="10"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GDK_PIXBUF_MODULE_FILE GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR XDG_STATE_HOME"
EPREFIX=""
EROOT="/"
ESYSROOT="/"
FCFLAGS="-march=haswell -O2 -pipe"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs buildpkg-live candy config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox merge-sync merge-wait multilib-strict network-sandbox news parallel-fetch parallel-install pid-sandbox pkgdir-index-trusted preserve-libs protect-owned qa-unresolved-soname-deps sandbox strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FETCHCOMMAND="wget -t 3 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}""
FETCHCOMMAND_RSYNC="rsync -LtvP "${URI}" "${DISTDIR}/${FILE}""
FETCHCOMMAND_SFTP="bash -c "x=\${2#sftp://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; eval \"declare -a ssh_opts=(\${3})\" ; exec sftp \${port:+-P \${port}} \"\${ssh_opts[@]}\" \"\${host}:/\${x#*/}\" \"\$1\"" sftp "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
FETCHCOMMAND_SSH="bash -c "x=\${2#ssh://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; exec rsync --rsh=\"ssh \${port:+-p\${port}} \${3}\" -avP \"\${host}:/\${x#*/}\" \"\$1\"" rsync "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
FFLAGS="-march=haswell -O2 -pipe"
GCC_SPECS=""
GENTOO_MIRRORS="http://distfiles.gentoo.org"
GPG_VERIFY_GROUP_DROP="nogroup"
GPG_VERIFY_USER_DROP="nobody"
GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax navcom oceanserver oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 tsip tripmate tnt ublox"
GRUB_PLATFORMS="efi-64"
GSETTINGS_BACKEND="dconf"
GUILE_SINGLE_TARGET="3-0"
GUILE_TARGETS="3-0"
HOME="/root"
INFOPATH="/usr/share/gcc-data/x86_64-pc-linux-gnu/14/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.44/info:/usr/share/autoconf-2.72/info:/usr/share/automake-1.17/info:/usr/share/info"
INPUT_DEVICES="libinput"
IUSE_IMPLICIT="abi_x86_64 prefix prefix-guest prefix-stack"
KERNEL="linux"
L10N="en en-US"
LANG="en_US.UTF-8"
LCD_DEVICES="bayrad cfontz glk hd44780 lb216 lcdm001 mtxorb text"
LC_COLLATE="C.UTF-8"
LC_MESSAGES="C"
LC_TIME="en_US.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-z,pack-relative-relocs"
LDFLAGS_amd64="-m elf_x86_64"
LDFLAGS_x32="-m elf32_x86_64"
LDFLAGS_x86="-m elf_i386"
LESS="-sFRiMX --shift 5"
LESSOPEN="|lesspipeno %s"
LEX="flex"
LIBDIR_amd64="lib64"
LIBDIR_x32="libx32"
LIBDIR_x86="lib"
LOGNAME="root"
LUA_SINGLE_TARGET="lua5-1"
LUA_TARGETS="lua5-1"
LV2_PATH="/usr/lib64/lv2"
MAKEOPTS="-j12 -l10.8"
MANPAGER="manpager"
MANPATH="/usr/share/gcc-data/x86_64-pc-linux-gnu/14/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.44/man:/usr/local/share/man:/usr/share/man:/usr/lib/rust/man-bin-1.86.0:/usr/lib/rust/man-bin-1.87.0:/usr/lib/llvm/19/share/man"
MULTILIB_ABIS="amd64 x86"
MULTILIB_STRICT_DENY="64-bit.*shared object"
MULTILIB_STRICT_DIRS="/lib32 /lib /usr/lib32 /usr/lib /usr/kde/*/lib32 /usr/kde/*/lib /usr/qt/*/lib32 /usr/qt/*/lib /usr/X11R6/lib32 /usr/X11R6/lib"
MULTILIB_STRICT_EXEMPT="(perl5|gcc|binutils|eclipse-3|debug|portage|udev|systemd|clang|python-exec|llvm)"
NPM_CONFIG_GLOBALCONFIG="/etc/npm/npmrc"
OFFICE_IMPLEMENTATION="libreoffice"
OLDPWD="/root"
PAGER="/usr/bin/less"
PATH="/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/bin:/usr/lib/llvm/19/bin"
PERL_FEATURES="ithreads"
PHP_TARGETS="php8-2"
PKGDIR="/var/cache/binpkgs"
PORTAGE_ARCHLIST="alpha amd64 amd64-linux arm arm-linux arm64 arm64-linux arm64-macos hppa loong m68k mips ppc ppc-macos ppc64 ppc64-linux riscv riscv-linux s390 sparc x64-macos x64-solaris x86 x86-linux x86-macos"
PORTAGE_BIN_PATH="/usr/lib/portage/python3.13"
PORTAGE_COMPRESS_EXCLUDE_SUFFIXES="css gif htm[l]? jp[e]?g js pdf png"
PORTAGE_CONFIGROOT="/"
PORTAGE_DEBUG="0"
PORTAGE_DEPCACHEDIR="/var/cache/edb/dep"
PORTAGE_ELOG_CLASSES="log warn error"
PORTAGE_ELOG_MAILFROM="portage@localhost"
PORTAGE_ELOG_MAILSUBJECT="[portage] ebuild log for ${PACKAGE} on ${HOST}"
PORTAGE_ELOG_MAILURI="root"
PORTAGE_ELOG_SYSTEM="save_summary:log,warn,error,qa echo"
PORTAGE_FETCH_CHECKSUM_TRY_MIRRORS="5"
PORTAGE_FETCH_RESUME_MIN_SIZE="350K"
PORTAGE_GID="250"
PORTAGE_GPG_SIGNING_COMMAND="gpg --sign --digest-algo SHA256 --clearsign --yes --default-key "${PORTAGE_GPG_KEY}" --homedir "${PORTAGE_GPG_DIR}" "${FILE}""
PORTAGE_GRPNAME="portage"
PORTAGE_INST_GID="0"
PORTAGE_INST_UID="0"
PORTAGE_INTERNAL_CALLER="1"
PORTAGE_LOGDIR_CLEAN="find "${PORTAGE_LOGDIR}" -type f ! -name "summary.log*" -mtime +7 -delete"
PORTAGE_OVERRIDE_EPREFIX=""
PORTAGE_PYM_PATH="/usr/lib/python3.13/site-packages"
PORTAGE_PYTHONPATH="/usr/lib/python3.13/site-packages"
PORTAGE_QUIET="1"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_RSYNC_RETRIES="-1"
PORTAGE_SCHEDULING_POLICY="idle"
PORTAGE_SYNC_STALE="30"
PORTAGE_TMPDIR="/var/tmp"
PORTAGE_TRUST_HELPER="/usr/bin/getuto"
PORTAGE_USERNAME="portage"
PORTAGE_VERBOSE="1"
PORTAGE_WORKDIR_MODE="0700"
PORTAGE_XATTR_EXCLUDE="bcachefs.* bcachefs_effective.* btrfs.* security.evm security.ima security.selinux system.nfs4_acl user.apache_handler user.Beagle.* user.dublincore.* user.mime_encoding user.xdg.*"
POSTGRES_TARGETS="postgres17"
PROFILE_ONLY_VARIABLES="ARCH ELIBC IUSE_IMPLICIT KERNEL USE_EXPAND_IMPLICIT USE_EXPAND_UNPREFIXED USE_EXPAND_VALUES_ARCH USE_EXPAND_VALUES_ELIBC USE_EXPAND_VALUES_KERNEL"
PWD="/root"
PYTHONDONTWRITEBYTECODE="1"
PYTHON_SINGLE_TARGET="python3_13"
PYTHON_TARGETS="python3_13"
QT_QPA_PLATFORMTHEME="qt5ct"
RESUMECOMMAND="wget -c -t 3 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}""
RESUMECOMMAND_RSYNC="rsync -LtvP "${URI}" "${DISTDIR}/${FILE}""
RESUMECOMMAND_SSH="bash -c "x=\${2#ssh://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; exec rsync --rsh=\"ssh \${port:+-p\${port}} \${3}\" -avP \"\${host}:/\${x#*/}\" \"\$1\"" rsync "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
ROOT="/"
ROOTPATH="/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/bin:/usr/lib/llvm/19/bin"
RPMDIR="/var/cache/rpm"
RUBY_TARGETS="ruby32"
SHELL="/bin/zsh"
SHLVL="1"
SYMLINK_LIB="no"
SYSROOT="/"
TERM="xterm-kitty"
TWISTED_DISABLE_WRITING_OF_PLUGIN_CACHE="1"
UNINSTALL_IGNORE="/boot/dtbs/* /lib/modules/* /var/run /var/lock /bin /lib /lib32 /lib64 /libx32 /sbin /usr/sbin /usr/lib/modules/*"
USE="X a52 aac acl acpi alsa amd64 branding bzip2 cairo cdda cdr cet crypt dbus dri dts dvd dvdr encode exif flac gdbm gif gpm gtk gui iconv icu ipv6 jpeg lcms libnotify libtirpc mad mng mp3 mp4 mpeg multilib ncurses nls ogg opengl openmp pam pango pcre pdf png policykit ppds pulseaudio qml qt5 qt6 readline sdl seccomp sound spell ssl startup-notification svg systemd test-rust tiff truetype udev udisks unicode upower usb vorbis vulkan wayland wxwidgets x264 xattr xcb xft xml xv xvid zlib" ABI_X86="64" ADA_TARGET="gcc_14" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_anon authn_dbm authn_file authz_dbm authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir env expires ext_filter file_cache filter headers include info log_config logio mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax navcom oceanserver oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 tsip tripmate tnt ublox" GRUB_PLATFORMS="efi-64" GUILE_SINGLE_TARGET="3-0" GUILE_TARGETS="3-0" INPUT_DEVICES="libinput" KERNEL="linux" L10N="en en-US" LCD_DEVICES="bayrad cfontz glk hd44780 lb216 lcdm001 mtxorb text" LUA_SINGLE_TARGET="lua5-1" LUA_TARGETS="lua5-1" OFFICE_IMPLEMENTATION="libreoffice" PERL_FEATURES="ithreads" PHP_TARGETS="php8-2" POSTGRES_TARGETS="postgres17" PYTHON_SINGLE_TARGET="python3_13" PYTHON_TARGETS="python3_13" RUBY_TARGETS="ruby32" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipp2p iface geoip fuzzy condition tarpit sysrq proto logmark ipmark dhcpmac delude chaos account"
USER="root"
USERLAND="GNU"
USE_EXPAND="ABI_MIPS ABI_S390 ABI_X86 ADA_TARGET ALSA_CARDS AMDGPU_TARGETS APACHE2_MODULES APACHE2_MPMS CALLIGRA_FEATURES CAMERAS COLLECTD_PLUGINS CPU_FLAGS_ARM CPU_FLAGS_PPC CPU_FLAGS_X86 CURL_QUIC CURL_SSL ELIBC FFTOOLS GPSD_PROTOCOLS GRUB_PLATFORMS GUILE_SINGLE_TARGET GUILE_TARGETS INPUT_DEVICES KERNEL L10N LCD_DEVICES LIBREOFFICE_EXTENSIONS LLVM_SLOT LLVM_TARGETS LUA_SINGLE_TARGET LUA_TARGETS NGINX_MODULES_HTTP NGINX_MODULES_MAIL NGINX_MODULES_STREAM OFFICE_IMPLEMENTATION OPENMPI_FABRICS OPENMPI_OFED_FEATURES OPENMPI_RM PERL_FEATURES PHP_TARGETS POSTGRES_TARGETS PYTHON_SINGLE_TARGET PYTHON_TARGETS QEMU_SOFTMMU_TARGETS QEMU_USER_TARGETS RUBY_TARGETS SANE_BACKENDS UWSGI_PLUGINS VIDEO_CARDS VOICEMAIL_STORAGE XTABLES_ADDONS"
USE_EXPAND_HIDDEN="ABI_MIPS ABI_S390 CPU_FLAGS_ARM CPU_FLAGS_PPC ELIBC KERNEL"
USE_EXPAND_IMPLICIT="ARCH ELIBC KERNEL"
USE_EXPAND_UNPREFIXED="ARCH"
USE_EXPAND_VALUES_ARCH="alpha amd64 amd64-linux arm arm64 arm64-macos hppa loong m68k mips ppc ppc64 ppc64-linux ppc-macos riscv s390 sparc x64-macos x64-solaris x86 x86-linux"
USE_EXPAND_VALUES_ELIBC="bionic Darwin glibc mingw musl SunOS"
USE_EXPAND_VALUES_KERNEL="Darwin linux SunOS"
USE_ORDER="env:pkg:conf:defaults:pkginternal:features:repo:env.d"
VIDEO_CARDS="nvidia"
XAUTHORITY="/root/.xauthW7EE58"
XDG_CONFIG_DIRS="/etc/xdg"
XDG_DATA_DIRS="/usr/local/share:/usr/share"
XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipp2p iface geoip fuzzy condition tarpit sysrq proto logmark ipmark dhcpmac delude chaos account"
ac_cv_c_undeclared_builtin_options="none needed"
enable_year2038="no"
gl_cv_compiler_check_decl_option="-Werror=implicit-function-declaration"
gl_cv_func_getcwd_path_max="yes" |
Any suggestion on how to troubleshoot this?
Note that I suspend my system at least once a day (while booted from Gentoo).
I might try to either shut down the system (or stop suspending) or use Arch for a few days to see if the issue occurs as well.
An NVIDIA issue keeps me from resuming from suspend with 570 drivers on Arch though, so if the issue has anything to do with suspend, I'd have to use 550 from the AUR.
Thanks.
Last edited by eeckwrk99 on Fri Jun 06, 2025 12:41 pm; edited 1 time in total |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55341 Location: 56N 3W
|
Posted: Wed May 21, 2025 7:05 pm Post subject: |
|
|
eeckwrk99,
smartmontools is not well adapted to nvme. Try Code: | nvme smart-log /dev/nvme0n1 |
You will need sys-apps/nvme-cli _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
eeckwrk99 Apprentice


Joined: 14 Mar 2021 Posts: 264 Location: Gentoo forums
|
Posted: Wed May 21, 2025 7:08 pm Post subject: |
|
|
NeddySeagoon wrote: | eeckwrk99,
smartmontools is not well adapted to nvme. Try Code: | nvme smart-log /dev/nvme0n1 |
You will need sys-apps/nvme-cli |
Thanks for the heads-up.
Code: | nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 104 °F (313 K)
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 19%
endurance group critical warning summary: 0
Data Units Read : 29543659 (15.13 TB)
Data Units Written : 34870990 (17.85 TB)
host_read_commands : 355235692
host_write_commands : 634996803
controller_busy_time : 3589
power_cycles : 4176
power_on_hours : 2703
unsafe_shutdowns : 568
media_errors : 0
num_err_log_entries : 3932
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 104 °F (313 K)
Temperature Sensor 2 : 109 °F (316 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0 |
|
|
Back to top |
|
 |
zen_desu Apprentice

Joined: 25 Oct 2024 Posts: 298
|
Posted: Wed May 21, 2025 7:10 pm Post subject: |
|
|
my guess is a bad power supply or even bad cable. NVME's can draw lots of power in bursts, and this could make the voltage sag enough that the drive goes offline. Some chance a kernel update added new power saving features that are too aggressive.
I had a similar issue which I could reproduce by running a filesystem scrub on all of my nvme's, one would go down first every time, and the issue was a bad 24 pin cable. _________________ µgRD dev
Wiki writer |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55341 Location: 56N 3W
|
Posted: Wed May 21, 2025 7:20 pm Post subject: |
|
|
eeckwrk99,
On TLC FLASH media, you are usually guaranteed 600 erase cycles.
With a 256G drive, that's 125TB written before your warranty expires.
You have 17.85 TB written, so are well within the write life.
Your dmesg includes the text
Code: | [82593.041548] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug |
Add that mouthful to your kernel command line to disable power savings and see what happens. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
Anon-E-moose Watchman


Joined: 23 May 2008 Posts: 6301 Location: Dallas area
|
Posted: Wed May 21, 2025 7:43 pm Post subject: |
|
|
find your hwmon for the device "cat /sys/class/hwmon/*/name" and find which dir it's under and monitor the temp when doing lots of IO.
If temps are fine (long term) then it could very well be either power or hardware (mb and/or nvme itself) _________________ UM780, 6.14 zen kernel, gcc 13, openrc, wayland |
|
Back to top |
|
 |
eeckwrk99 Apprentice


Joined: 14 Mar 2021 Posts: 264 Location: Gentoo forums
|
Posted: Wed May 21, 2025 8:47 pm Post subject: |
|
|
Thanks for the input, everyone.
For now, I've added the suggested options to the kernel command line:
Code: | cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.12.21-gentoo-custom root=/dev/mapper/vg2-gentoo_root ro rd.luks.uuid=luks-... rd.lvm.lv=vg2/gentoo_root rd.lvm.lv=vg2/gentoo_swap root=... resume=UUID=... rd.luks.options=password-echo=no nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off quiet rw |
I still have my 6.1 and 6.6 .config files so trying these kernel versions if still an option.
I've just fired up two VMs for about half an hour (after adding the above kernel parameters) and did my usual stuff on them while monitoring the NVMe temperatures with sys-process/bottom. The latter reported between 50°C and 55°C for "Composite" and "Sensor 1". "Sensor 2" temp varied between 60°C and 81°C, with an average of 67-70°C I'd say.
I noticed that the issue was most likely to occur after a resume from suspend to RAM so I'll see how it goes and report back. |
|
Back to top |
|
 |
eeckwrk99 Apprentice


Joined: 14 Mar 2021 Posts: 264 Location: Gentoo forums
|
Posted: Fri Jun 06, 2025 12:41 pm Post subject: |
|
|
After 15+ days with
Code: | nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off | added to the kernel parameters, the issue never occurred again.
Marking as solved. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|