Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] NVMe drive stops working
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
eeckwrk99
Apprentice
Apprentice


Joined: 14 Mar 2021
Posts: 264
Location: Gentoo forums

PostPosted: Wed May 21, 2025 6:51 pm    Post subject: [SOLVED] NVMe drive stops working Reply with quote

I've been using the following setup for years without any issue:

- Gentoo installed on a Samsung 840 EVO SSD (/dev/sda)
- Arch Linux installed on a Samsung SSD 970 EVO Plus NVMe (/dev/nvme0n1):

Code:
fdisk -l

Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 840
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ...

Device       Start       End   Sectors   Size Type
/dev/sda1     2048   1050623   1048576   512M EFI System
/dev/sda2  1050624 488396799 487346176 232.4G Linux LVM



Disk /dev/nvme0n1: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Samsung SSD 970 EVO Plus 250GB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: ...

Device           Start       End   Sectors   Size Type
/dev/nvme0n1p1    2048   1050623   1048576   512M EFI System
/dev/nvme0n1p2 1050624 488396799 487346176 232.4G Linux LVM


Gentoo is my main distro and I can chroot into my Arch Linux install by adding it to my /etc/fstab file:

Code:
% cat /etc/fstab

# <file system> <dir> <type> <options> <dump> <pass>

# /dev/sda1 LABEL=Gentoo-Boot
UUID=...   /boot   vfat    rw,noatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro   0 2

# /dev/mapper/vg2-gentoo_swap LABEL=Gentoo-Swap
UUID=...   none    swap    defaults      0 0

# /dev/mapper/vg2-gentoo_root LABEL=Gentoo-Root
UUID=...   /       ext4    rw,noatime    0 1

# /dev/mapper/vg2-gentoo_home LABEL=Gentoo-Home
UUID=...   /home   ext4    rw,noatime    0 2

tmpfs        /var/tmp/portage    tmpfs   size=16G,uid=portage,gid=portage,mode=775    0 0

# Arch Linux
/dev/mapper/vg1-arch_root     /media/Arch                  ext4    defaults,nofail     0 2
/dev/mapper/vg1-arch_home     /media/Arch/home             ext4    defaults,nofail     0 2
/dev/nvme0n1p1                /media/Arch/boot             vfat    defaults,noatime    0 2


Since a few weeks or so, the NVMe drive completely stops working when doing some relatively intensive I/O operations such as:
- running VMs (using QEMU/KVM on Gentoo) that are located on the NVMe
- updating Arch Linux kernel and nvidia-dkms package from within the Arch chroot

Code:
dmesg

[82593.041535] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[82593.041545] nvme nvme0: Does your device have a faulty power saving mode enabled?
[82593.041548] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[82593.118217] nvme 0000:02:00.0: enabling device (0000 -> 0002)
[82593.118404] nvme nvme0: Disabling device after reset failure: -19
[83446.935928] usb 1-11: USB disconnect, device number 6
[83476.356673] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.356728] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83476.561519] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.561589] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83476.612580] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
[83476.612617] EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
[83481.097260] EXT4-fs warning (device dm-6): ext4_end_bio:342: I/O error 10 writing to inode 525289 starting block 8956052)
[83481.097294] Buffer I/O error on device dm-6, logical block 8956052
[83481.758290] Aborting journal on device dm-7-8.
[83481.758315] Buffer I/O error on dev dm-7, logical block 21528576, lost sync page write
[83481.758330] JBD2: I/O error when updating journal superblock for dm-7-8.
[83482.611537] Aborting journal on device dm-6-8.
[83482.611570] Buffer I/O error on dev dm-6, logical block 6324224, lost sync page write
[83482.611580] JBD2: I/O error when updating journal superblock for dm-6-8.
[83483.479226] EXT4-fs error (device dm-6): ext4_journal_check_start:84: comm cp: Detected aborted journal
[83483.479268] Buffer I/O error on dev dm-6, logical block 0, lost sync page write
[83483.479279] EXT4-fs (dm-6): I/O error while writing superblock
[83483.479281] EXT4-fs (dm-6): Remounting filesystem read-only
[83587.609753] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83587.625935] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83587.641763] EXT4-fs warning (device dm-6): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm zsh: error -5 reading directory block
[83588.251035] EXT4-fs warning (device dm-7): htree_dirblock_to_tree:1083: inode #2: lblock 0: comm lsd: error -5 reading directory block
[83667.302475] Buffer I/O error on dev dm-4, logical block 0, async page read
[83667.302485] Buffer I/O error on dev dm-4, logical block 0, async page read
[83667.302677] Buffer I/O error on dev dm-5, logical block 0, async page read
[83667.302684] Buffer I/O error on dev dm-5, logical block 0, async page read


Code:
journalctl

May 21 11:37:15 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:15 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3935576: comm zsh: reading directory lblock 0
May 21 11:37:16 gentoo-desktop kernel: EXT4-fs error (device dm-7): __ext4_find_entry:1639: inode #3952462: comm zsh: reading directory lblock 0



At this point, any command targetting /media/Arch fail:

Code:
ls /media/Arch/
ls: reading directory '/media/Arch/': Input/output error


Then I reboot to my Arch system. I'm getting "recovering journal" and "Clearing orphaned inode" messages after unlocking the LUKS container and everything seems to work just fine.

I couldn't reproduce the issue by any other method other than running any VM or updating Arch kernel or nvidia-dkms package.

Writing a random 10GB file doesn't seem to cause any harm (running this when the disk is working normally, of course):

Code:
dd bs=1M count=10240 if=/dev/zero of=/media/Arch/file_10GB



Various commands outputs run from a live Arch Linux ISO just after the issue occurred:

Code:
fsck -a /dev/nvme0n1p1

fsck from util-linux 2.41
fsck.fat 4.2 (2021-01-31)
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
  65:01/00
  Not automatically fixing this.
Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
 Automatically removing dirty bit.

*** Filesystem was changed ***
Writing changes.
/dev/nvme0n1p1: 625 files, 21098/130812 clusters


Code:
fsck -a /dev/mapper/vg1-arch_root

fsck from util-linux 2.41
Arch-Root: recovering journal
Arch-Root: Clearing orphaned inode 2245369 (uid=1000, gid=984, mode=010644, size=0)
Arch-Root: clean, 403794/3276800 files, 4535520/13107200 blocks


Code:
smartctl -H /dev/nvme0n1

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.14.4-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED


Code:
smartctl --all /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.14.4-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 250GB
Serial Number:                      S4EUNF0M753005A
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            250,058,321,920 [250 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5791b395da
Local Time is:                      Wed May 21 11:16:15 2025 UTC
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    19%
Data Units Read:                    29,539,438 [15.1 TB]
Data Units Written:                 34,870,455 [17.8 TB]
Host Read Commands:                 355,171,832
Host Write Commands:                634,976,535
Controller Busy Time:               3,589
Power Cycles:                       4,176
Power On Hours:                     2,703
Unsafe Shutdowns:                   568
Media and Data Integrity Errors:    0
Error Information Log Entries:      3,930
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               45 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       3930     0  0x0004  0x4004      -            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Completed without error                2696            -     -   -   -    -
 1   Short             Aborted: Controller Reset              2694            -     -   -   -    -
 2   Short             Completed without error                2694            -     -   -   -    -



The drive has been running just fine since late 2019 until now.

I don't think it has anything to do with the kernel, I've been running 6.12.21 since 2025-03-31 and the issue started to occur in late April/early May (cannot remember exactly):

Code:
emlop l -n 5 sys-kernel/gentoo-sources

2024-12-26 11:04:16  1:27 sys-kernel/gentoo-sources-6.6.67
2025-01-30 22:23:17  1:29 sys-kernel/gentoo-sources-6.6.74
2025-02-24 12:05:48  1:24 sys-kernel/gentoo-sources-6.12.16
2025-02-24 19:08:53  1:29 sys-kernel/gentoo-sources-6.12.16
2025-03-31 21:52:17  1:28 sys-kernel/gentoo-sources-6.12.21


The only recent suspicious update was sys-libs/libnvme but after downgrading to 1.11.1-r1, the issue occurred again (so I updated to 1.12-r1 again today):

Code:
emlop l -n 5 sys-libs/libnvme

2025-04-21 13:10:13  5:16 sys-libs/libnvme-1.11.1-r1
2025-04-22 14:35:41    12 sys-libs/libnvme-1.12-r1
2025-05-01 15:55:28    30 sys-libs/libnvme-1.12-r1
2025-05-19 18:55:17    13 sys-libs/libnvme-1.11.1-r1
2025-05-21 13:41:22    13 sys-libs/libnvme-1.12-r1


I haven't changed any BIOS setting. No recent hardware change.

Code:
emerge --info

Portage 3.0.67 (python 3.13.3-final-0, default/linux/amd64/23.0/desktop/systemd, gcc-14, glibc-2.40-r8, 6.12.21-gentoo-custom x86_64)
=================================================================
System uname: Linux-6.12.21-gentoo-custom-x86_64-Intel-R-_Core-TM-_i7-5820K_CPU_@_3.30GHz-with-glibc2.40
KiB Mem:    32772564 total,   6630924 free
KiB Swap:   33554428 total,  32879444 free
Timestamp of repository gentoo: Wed, 21 May 2025 13:09:31 +0000
Head commit of repository gentoo: 8b464f8a58dd7daa1fd2dfa5a88640206e81c6fa

Timestamp of repository guru: Tue, 20 May 2025 17:55:01 +0000
Head commit of repository guru: 538f7a4a0a1a700a05584ff741af562854258f2b

sh bash 5.2_p37
ld GNU ld (Gentoo 2.44 p1) 2.44.0
app-misc/pax-utils:        1.3.8::gentoo
app-shells/bash:           5.2_p37::gentoo
dev-build/autoconf:        2.72-r1::gentoo
dev-build/automake:        1.17-r1::gentoo
dev-build/cmake:           3.31.5::gentoo
dev-build/libtool:         2.5.4::gentoo
dev-build/make:            4.4.1-r100::gentoo
dev-build/meson:           1.7.0::gentoo
dev-lang/perl:             5.40.2::gentoo
dev-lang/python:           3.13.3::gentoo
dev-lang/rust-bin:         1.86.0-r1::gentoo, 1.87.0::gentoo
llvm-core/clang:           19.1.7::gentoo
llvm-core/llvm:            19.1.7::gentoo
sys-apps/baselayout:       2.17::gentoo
sys-apps/sandbox:          2.46::gentoo
sys-apps/systemd:          256.10::gentoo
sys-devel/binutils:        2.44-r1::gentoo
sys-devel/binutils-config: 5.5.2::gentoo
sys-devel/gcc:             14.2.1_p20241221::gentoo
sys-devel/gcc-config:      2.12.1::gentoo
sys-kernel/linux-headers:  6.12::gentoo (virtual/os-headers)
sys-libs/glibc:            2.40-r8::gentoo
Repositories:

gentoo
    location: /var/db/repos/gentoo
    sync-type: git
    sync-uri: https://github.com/gentoo-mirror/gentoo.git
    priority: -1000
    volatile: False

guru
    location: /var/db/repos/guru
    sync-type: git
    sync-uri: https://github.com/gentoo-mirror/guru.git
    masters: gentoo
    volatile: False

ABI="amd64"
ABI_X86="64"
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
ACCEPT_PROPERTIES="*"
ACCEPT_RESTRICT="*"
ADA_TARGET="gcc_14"
APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_anon authn_dbm authn_file authz_dbm authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir env expires ext_filter file_cache filter headers include info log_config logio mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias"
ARCH="amd64"
AUTOCLEAN="no"
BINPKG_COMPRESS="bzip2"
BINPKG_FORMAT="xpak"
BINPKG_GPG_SIGNING_BASE_COMMAND="/usr/bin/flock /run/lock/portage-binpkg-gpg.lock /usr/bin/gpg --sign --armor [PORTAGE_CONFIG]"
BINPKG_GPG_SIGNING_DIGEST="SHA512"
BINPKG_GPG_VERIFY_BASE_COMMAND="/usr/bin/gpg --verify --batch --no-tty --no-auto-check-trustdb --status-fd 2 [PORTAGE_CONFIG] [SIGNATURE]"
BINPKG_GPG_VERIFY_GPG_HOME="/etc/portage/gnupg"
BOOTSTRAP_USE="unicode pkg-config split-usr xml python_targets_python3_13 python_single_target_python3_13 multilib zstd cet systemd sysv-utils udev"
BROOT=""
CALLIGRA_FEATURES="karbon sheets words"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=haswell -O2 -pipe"
CFLAGS_amd64="-m64"
CFLAGS_x32="-mx32"
CFLAGS_x86="-m32 -mfpmath=sse"
CHOST="x86_64-pc-linux-gnu"
CHOST_amd64="x86_64-pc-linux-gnu"
CHOST_x32="x86_64-pc-linux-gnux32"
CHOST_x86="i686-pc-linux-gnu"
CLEAN_DELAY="5"
COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog"
COLLISION_IGNORE="/boot/dtbs/* /lib/modules/*"
COMMON_FLAGS="-march=haswell -O2 -pipe"
CONFIG_PROTECT="/etc /usr/lib64/libreoffice/program/sofficerc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d"
CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sse sse2 sse3 sse4_1 sse4_2 ssse3"
CXXFLAGS="-march=haswell -O2 -pipe"
DEFAULT_ABI="amd64"
DISPLAY=":0"
DISTDIR="/var/cache/distfiles"
EDITOR="nvim"
ELIBC="glibc"
EMERGE_DEFAULT_OPTS=" -j12 -l10.8 --alert --ask --keep-going=y --misspell-suggestions=y --quiet --quiet-build=y --verbose"
EMERGE_WARNING_DELAY="10"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GDK_PIXBUF_MODULE_FILE GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR XDG_STATE_HOME"
EPREFIX=""
EROOT="/"
ESYSROOT="/"
FCFLAGS="-march=haswell -O2 -pipe"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs buildpkg-live candy config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox merge-sync merge-wait multilib-strict network-sandbox news parallel-fetch parallel-install pid-sandbox pkgdir-index-trusted preserve-libs protect-owned qa-unresolved-soname-deps sandbox strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FETCHCOMMAND="wget -t 3 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}""
FETCHCOMMAND_RSYNC="rsync -LtvP "${URI}" "${DISTDIR}/${FILE}""
FETCHCOMMAND_SFTP="bash -c "x=\${2#sftp://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; eval \"declare -a ssh_opts=(\${3})\" ; exec sftp \${port:+-P \${port}} \"\${ssh_opts[@]}\" \"\${host}:/\${x#*/}\" \"\$1\"" sftp "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
FETCHCOMMAND_SSH="bash -c "x=\${2#ssh://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; exec rsync --rsh=\"ssh \${port:+-p\${port}} \${3}\" -avP \"\${host}:/\${x#*/}\" \"\$1\"" rsync "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
FFLAGS="-march=haswell -O2 -pipe"
GCC_SPECS=""
GENTOO_MIRRORS="http://distfiles.gentoo.org"
GPG_VERIFY_GROUP_DROP="nogroup"
GPG_VERIFY_USER_DROP="nobody"
GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax navcom oceanserver oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 tsip tripmate tnt ublox"
GRUB_PLATFORMS="efi-64"
GSETTINGS_BACKEND="dconf"
GUILE_SINGLE_TARGET="3-0"
GUILE_TARGETS="3-0"
HOME="/root"
INFOPATH="/usr/share/gcc-data/x86_64-pc-linux-gnu/14/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.44/info:/usr/share/autoconf-2.72/info:/usr/share/automake-1.17/info:/usr/share/info"
INPUT_DEVICES="libinput"
IUSE_IMPLICIT="abi_x86_64 prefix prefix-guest prefix-stack"
KERNEL="linux"
L10N="en en-US"
LANG="en_US.UTF-8"
LCD_DEVICES="bayrad cfontz glk hd44780 lb216 lcdm001 mtxorb text"
LC_COLLATE="C.UTF-8"
LC_MESSAGES="C"
LC_TIME="en_US.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-z,pack-relative-relocs"
LDFLAGS_amd64="-m elf_x86_64"
LDFLAGS_x32="-m elf32_x86_64"
LDFLAGS_x86="-m elf_i386"
LESS="-sFRiMX --shift 5"
LESSOPEN="|lesspipeno %s"
LEX="flex"
LIBDIR_amd64="lib64"
LIBDIR_x32="libx32"
LIBDIR_x86="lib"
LOGNAME="root"
LUA_SINGLE_TARGET="lua5-1"
LUA_TARGETS="lua5-1"
LV2_PATH="/usr/lib64/lv2"
MAKEOPTS="-j12 -l10.8"
MANPAGER="manpager"
MANPATH="/usr/share/gcc-data/x86_64-pc-linux-gnu/14/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.44/man:/usr/local/share/man:/usr/share/man:/usr/lib/rust/man-bin-1.86.0:/usr/lib/rust/man-bin-1.87.0:/usr/lib/llvm/19/share/man"
MULTILIB_ABIS="amd64 x86"
MULTILIB_STRICT_DENY="64-bit.*shared object"
MULTILIB_STRICT_DIRS="/lib32 /lib /usr/lib32 /usr/lib /usr/kde/*/lib32 /usr/kde/*/lib /usr/qt/*/lib32 /usr/qt/*/lib /usr/X11R6/lib32 /usr/X11R6/lib"
MULTILIB_STRICT_EXEMPT="(perl5|gcc|binutils|eclipse-3|debug|portage|udev|systemd|clang|python-exec|llvm)"
NPM_CONFIG_GLOBALCONFIG="/etc/npm/npmrc"
OFFICE_IMPLEMENTATION="libreoffice"
OLDPWD="/root"
PAGER="/usr/bin/less"
PATH="/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/bin:/usr/lib/llvm/19/bin"
PERL_FEATURES="ithreads"
PHP_TARGETS="php8-2"
PKGDIR="/var/cache/binpkgs"
PORTAGE_ARCHLIST="alpha amd64 amd64-linux arm arm-linux arm64 arm64-linux arm64-macos hppa loong m68k mips ppc ppc-macos ppc64 ppc64-linux riscv riscv-linux s390 sparc x64-macos x64-solaris x86 x86-linux x86-macos"
PORTAGE_BIN_PATH="/usr/lib/portage/python3.13"
PORTAGE_COMPRESS_EXCLUDE_SUFFIXES="css gif htm[l]? jp[e]?g js pdf png"
PORTAGE_CONFIGROOT="/"
PORTAGE_DEBUG="0"
PORTAGE_DEPCACHEDIR="/var/cache/edb/dep"
PORTAGE_ELOG_CLASSES="log warn error"
PORTAGE_ELOG_MAILFROM="portage@localhost"
PORTAGE_ELOG_MAILSUBJECT="[portage] ebuild log for ${PACKAGE} on ${HOST}"
PORTAGE_ELOG_MAILURI="root"
PORTAGE_ELOG_SYSTEM="save_summary:log,warn,error,qa echo"
PORTAGE_FETCH_CHECKSUM_TRY_MIRRORS="5"
PORTAGE_FETCH_RESUME_MIN_SIZE="350K"
PORTAGE_GID="250"
PORTAGE_GPG_SIGNING_COMMAND="gpg --sign --digest-algo SHA256 --clearsign --yes --default-key "${PORTAGE_GPG_KEY}" --homedir "${PORTAGE_GPG_DIR}" "${FILE}""
PORTAGE_GRPNAME="portage"
PORTAGE_INST_GID="0"
PORTAGE_INST_UID="0"
PORTAGE_INTERNAL_CALLER="1"
PORTAGE_LOGDIR_CLEAN="find "${PORTAGE_LOGDIR}" -type f ! -name "summary.log*" -mtime +7 -delete"
PORTAGE_OVERRIDE_EPREFIX=""
PORTAGE_PYM_PATH="/usr/lib/python3.13/site-packages"
PORTAGE_PYTHONPATH="/usr/lib/python3.13/site-packages"
PORTAGE_QUIET="1"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_RSYNC_RETRIES="-1"
PORTAGE_SCHEDULING_POLICY="idle"
PORTAGE_SYNC_STALE="30"
PORTAGE_TMPDIR="/var/tmp"
PORTAGE_TRUST_HELPER="/usr/bin/getuto"
PORTAGE_USERNAME="portage"
PORTAGE_VERBOSE="1"
PORTAGE_WORKDIR_MODE="0700"
PORTAGE_XATTR_EXCLUDE="bcachefs.* bcachefs_effective.*    btrfs.* security.evm security.ima    security.selinux system.nfs4_acl user.apache_handler    user.Beagle.* user.dublincore.* user.mime_encoding user.xdg.*"
POSTGRES_TARGETS="postgres17"
PROFILE_ONLY_VARIABLES="ARCH ELIBC IUSE_IMPLICIT KERNEL USE_EXPAND_IMPLICIT USE_EXPAND_UNPREFIXED USE_EXPAND_VALUES_ARCH USE_EXPAND_VALUES_ELIBC USE_EXPAND_VALUES_KERNEL"
PWD="/root"
PYTHONDONTWRITEBYTECODE="1"
PYTHON_SINGLE_TARGET="python3_13"
PYTHON_TARGETS="python3_13"
QT_QPA_PLATFORMTHEME="qt5ct"
RESUMECOMMAND="wget -c -t 3 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}""
RESUMECOMMAND_RSYNC="rsync -LtvP "${URI}" "${DISTDIR}/${FILE}""
RESUMECOMMAND_SSH="bash -c "x=\${2#ssh://} ; host=\${x%%/*} ; port=\${host##*:} ; host=\${host%:*} ; [[ \${host} = \${port} ]] && port= ; exec rsync --rsh=\"ssh \${port:+-p\${port}} \${3}\" -avP \"\${host}:/\${x#*/}\" \"\$1\"" rsync "${DISTDIR}/${FILE}" "${URI}" "${PORTAGE_SSH_OPTS}""
ROOT="/"
ROOTPATH="/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/bin:/usr/lib/llvm/19/bin"
RPMDIR="/var/cache/rpm"
RUBY_TARGETS="ruby32"
SHELL="/bin/zsh"
SHLVL="1"
SYMLINK_LIB="no"
SYSROOT="/"
TERM="xterm-kitty"
TWISTED_DISABLE_WRITING_OF_PLUGIN_CACHE="1"
UNINSTALL_IGNORE="/boot/dtbs/* /lib/modules/* /var/run /var/lock /bin /lib /lib32 /lib64 /libx32 /sbin /usr/sbin /usr/lib/modules/*"
USE="X a52 aac acl acpi alsa amd64 branding bzip2 cairo cdda cdr cet crypt dbus dri dts dvd dvdr encode exif flac gdbm gif gpm gtk gui iconv icu ipv6 jpeg lcms libnotify libtirpc mad mng mp3 mp4 mpeg multilib ncurses nls ogg opengl openmp pam pango pcre pdf png policykit ppds pulseaudio qml qt5 qt6 readline sdl seccomp sound spell ssl startup-notification svg systemd test-rust tiff truetype udev udisks unicode upower usb vorbis vulkan wayland wxwidgets x264 xattr xcb xft xml xv xvid zlib" ABI_X86="64" ADA_TARGET="gcc_14" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_anon authn_dbm authn_file authz_dbm authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir env expires ext_filter file_cache filter headers include info log_config logio mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax navcom oceanserver oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 tsip tripmate tnt ublox" GRUB_PLATFORMS="efi-64" GUILE_SINGLE_TARGET="3-0" GUILE_TARGETS="3-0" INPUT_DEVICES="libinput" KERNEL="linux" L10N="en en-US" LCD_DEVICES="bayrad cfontz glk hd44780 lb216 lcdm001 mtxorb text" LUA_SINGLE_TARGET="lua5-1" LUA_TARGETS="lua5-1" OFFICE_IMPLEMENTATION="libreoffice" PERL_FEATURES="ithreads" PHP_TARGETS="php8-2" POSTGRES_TARGETS="postgres17" PYTHON_SINGLE_TARGET="python3_13" PYTHON_TARGETS="python3_13" RUBY_TARGETS="ruby32" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipp2p iface geoip fuzzy condition tarpit sysrq proto logmark ipmark dhcpmac delude chaos account"
USER="root"
USERLAND="GNU"
USE_EXPAND="ABI_MIPS ABI_S390 ABI_X86 ADA_TARGET ALSA_CARDS AMDGPU_TARGETS APACHE2_MODULES APACHE2_MPMS CALLIGRA_FEATURES CAMERAS COLLECTD_PLUGINS CPU_FLAGS_ARM CPU_FLAGS_PPC CPU_FLAGS_X86 CURL_QUIC CURL_SSL ELIBC FFTOOLS GPSD_PROTOCOLS GRUB_PLATFORMS GUILE_SINGLE_TARGET GUILE_TARGETS INPUT_DEVICES KERNEL L10N LCD_DEVICES LIBREOFFICE_EXTENSIONS LLVM_SLOT LLVM_TARGETS LUA_SINGLE_TARGET LUA_TARGETS NGINX_MODULES_HTTP NGINX_MODULES_MAIL NGINX_MODULES_STREAM OFFICE_IMPLEMENTATION OPENMPI_FABRICS OPENMPI_OFED_FEATURES OPENMPI_RM PERL_FEATURES PHP_TARGETS POSTGRES_TARGETS PYTHON_SINGLE_TARGET PYTHON_TARGETS QEMU_SOFTMMU_TARGETS QEMU_USER_TARGETS RUBY_TARGETS SANE_BACKENDS UWSGI_PLUGINS VIDEO_CARDS VOICEMAIL_STORAGE XTABLES_ADDONS"
USE_EXPAND_HIDDEN="ABI_MIPS ABI_S390 CPU_FLAGS_ARM CPU_FLAGS_PPC ELIBC KERNEL"
USE_EXPAND_IMPLICIT="ARCH ELIBC KERNEL"
USE_EXPAND_UNPREFIXED="ARCH"
USE_EXPAND_VALUES_ARCH="alpha amd64 amd64-linux arm arm64 arm64-macos hppa loong m68k mips ppc ppc64 ppc64-linux ppc-macos riscv s390 sparc x64-macos x64-solaris x86 x86-linux"
USE_EXPAND_VALUES_ELIBC="bionic Darwin glibc mingw musl SunOS"
USE_EXPAND_VALUES_KERNEL="Darwin linux SunOS"
USE_ORDER="env:pkg:conf:defaults:pkginternal:features:repo:env.d"
VIDEO_CARDS="nvidia"
XAUTHORITY="/root/.xauthW7EE58"
XDG_CONFIG_DIRS="/etc/xdg"
XDG_DATA_DIRS="/usr/local/share:/usr/share"
XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipp2p iface geoip fuzzy condition tarpit sysrq proto logmark ipmark dhcpmac delude chaos account"
ac_cv_c_undeclared_builtin_options="none needed"
enable_year2038="no"
gl_cv_compiler_check_decl_option="-Werror=implicit-function-declaration"
gl_cv_func_getcwd_path_max="yes"


Any suggestion on how to troubleshoot this?

Note that I suspend my system at least once a day (while booted from Gentoo).
I might try to either shut down the system (or stop suspending) or use Arch for a few days to see if the issue occurs as well.
An NVIDIA issue keeps me from resuming from suspend with 570 drivers on Arch though, so if the issue has anything to do with suspend, I'd have to use 550 from the AUR.

Thanks.


Last edited by eeckwrk99 on Fri Jun 06, 2025 12:41 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 55341
Location: 56N 3W

PostPosted: Wed May 21, 2025 7:05 pm    Post subject: Reply with quote

eeckwrk99,

smartmontools is not well adapted to nvme. Try
Code:
nvme smart-log /dev/nvme0n1

You will need sys-apps/nvme-cli
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eeckwrk99
Apprentice
Apprentice


Joined: 14 Mar 2021
Posts: 264
Location: Gentoo forums

PostPosted: Wed May 21, 2025 7:08 pm    Post subject: Reply with quote

NeddySeagoon wrote:
eeckwrk99,

smartmontools is not well adapted to nvme. Try
Code:
nvme smart-log /dev/nvme0n1

You will need sys-apps/nvme-cli


Thanks for the heads-up.

Code:
nvme smart-log /dev/nvme0n1

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning         : 0
temperature            : 104 °F (313 K)
available_spare            : 100%
available_spare_threshold      : 10%
percentage_used            : 19%
endurance group critical warning summary: 0
Data Units Read            : 29543659 (15.13 TB)
Data Units Written         : 34870990 (17.85 TB)
host_read_commands         : 355235692
host_write_commands         : 634996803
controller_busy_time         : 3589
power_cycles            : 4176
power_on_hours            : 2703
unsafe_shutdowns         : 568
media_errors            : 0
num_err_log_entries         : 3932
Warning Temperature Time      : 0
Critical Composite Temperature Time   : 0
Temperature Sensor 1         : 104 °F (313 K)
Temperature Sensor 2         : 109 °F (316 K)
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time   : 0
Thermal Management T2 Total Time   : 0
Back to top
View user's profile Send private message
zen_desu
Apprentice
Apprentice


Joined: 25 Oct 2024
Posts: 298

PostPosted: Wed May 21, 2025 7:10 pm    Post subject: Reply with quote

my guess is a bad power supply or even bad cable. NVME's can draw lots of power in bursts, and this could make the voltage sag enough that the drive goes offline. Some chance a kernel update added new power saving features that are too aggressive.

I had a similar issue which I could reproduce by running a filesystem scrub on all of my nvme's, one would go down first every time, and the issue was a bad 24 pin cable.
_________________
µgRD dev
Wiki writer
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 55341
Location: 56N 3W

PostPosted: Wed May 21, 2025 7:20 pm    Post subject: Reply with quote

eeckwrk99,

On TLC FLASH media, you are usually guaranteed 600 erase cycles.
With a 256G drive, that's 125TB written before your warranty expires.
You have 17.85 TB written, so are well within the write life.

Your dmesg includes the text
Code:
[82593.041548] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug

Add that mouthful to your kernel command line to disable power savings and see what happens.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Anon-E-moose
Watchman
Watchman


Joined: 23 May 2008
Posts: 6301
Location: Dallas area

PostPosted: Wed May 21, 2025 7:43 pm    Post subject: Reply with quote

find your hwmon for the device "cat /sys/class/hwmon/*/name" and find which dir it's under and monitor the temp when doing lots of IO.

If temps are fine (long term) then it could very well be either power or hardware (mb and/or nvme itself)
_________________
UM780, 6.14 zen kernel, gcc 13, openrc, wayland
Back to top
View user's profile Send private message
eeckwrk99
Apprentice
Apprentice


Joined: 14 Mar 2021
Posts: 264
Location: Gentoo forums

PostPosted: Wed May 21, 2025 8:47 pm    Post subject: Reply with quote

Thanks for the input, everyone.

For now, I've added the suggested options to the kernel command line:

Code:
cat /proc/cmdline

BOOT_IMAGE=/vmlinuz-6.12.21-gentoo-custom root=/dev/mapper/vg2-gentoo_root ro rd.luks.uuid=luks-... rd.lvm.lv=vg2/gentoo_root rd.lvm.lv=vg2/gentoo_swap root=... resume=UUID=... rd.luks.options=password-echo=no nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off quiet rw


I still have my 6.1 and 6.6 .config files so trying these kernel versions if still an option.

I've just fired up two VMs for about half an hour (after adding the above kernel parameters) and did my usual stuff on them while monitoring the NVMe temperatures with sys-process/bottom. The latter reported between 50°C and 55°C for "Composite" and "Sensor 1". "Sensor 2" temp varied between 60°C and 81°C, with an average of 67-70°C I'd say.

I noticed that the issue was most likely to occur after a resume from suspend to RAM so I'll see how it goes and report back.
Back to top
View user's profile Send private message
eeckwrk99
Apprentice
Apprentice


Joined: 14 Mar 2021
Posts: 264
Location: Gentoo forums

PostPosted: Fri Jun 06, 2025 12:41 pm    Post subject: Reply with quote

After 15+ days with
Code:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off
added to the kernel parameters, the issue never occurred again.

Marking as solved.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



OSZAR »