Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
malfunctioning standbyscript: any ideas?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
jpsollie
Guru
Guru


Joined: 17 Aug 2013
Posts: 324

PostPosted: Wed Apr 16, 2025 7:02 am    Post subject: malfunctioning standbyscript: any ideas? Reply with quote

To make my system more energy-efficient, I made a standbyscript where I put the Gentoo-made NAS in standby.
The idea is "send a WoL packet when you need the NAS, otherwise simply keep it suspended"
This script is meant to say "umount all filesystems with soft raid and stop the raid devices",
this avoids any "device kicked out" at resume, which is pretty annoying (and requires rebuilding the whole array).
at startup, the script is meant to do:
"carefully wait for all scsi devices to be up and running again (20 seconds), then carefully try to reassemble devices, and remount filesystems on these md arrays if they were mounted previously"
so, here's the script:
Code:

#!/bin/bash
#umount filesystems having a /dev/md device or in /mnt
mp=$(cat /etc/fstab | grep -E "(/mnt|/dev/md)" | cut -d$'\t' -f2)
wasmounted=()
for i in $mp; do if [[ "$(grep $i /proc/mounts)" ]]; then umount $i; wasmounted+=("$i"); fi; done
#stop md raid arrays
for i in /dev/md[0-9]; do mdadm --stop $i; done
# print date for debugging,
# sleep and wait 20s after wakeup, so all sd* devices on the scsi bus have been reattached properly
date
echo mem > /sys/power/state
sleep 20
#reassemble raid arrays
for i in $(cat /etc/mdadm.conf | grep '^ARRAY' | cut -d' ' -f2 ); do mdadm --assemble --scan --no-degraded $i; done
#remount devices
for i in ${wasmounted[@]}; do mount $i; done

when launching it from a SSH session, it works perfectly: if I send an WoL packet to the NAS, it wakes up and everything is the way it should have been.
but ... from crontab, it does not!
I arrach a dmesg with 3x this script trying to suspend the system:
https://pastebin.com/E8Ndwr2x
The first (and failed) suspend script execution is at line 2402, ending resume at line :
Code:

[26419.594009] [ T9148] bcachefs (b1fe4470-e6b8-4aab-a557-998404619502): clean shutdown complete, journal seq 1853520
...
[26559.452664] [ T9703] bcachefs (fde6c4aa-7e4c-4429-b7e0-98a10224feb4): delete_dead_inodes... done

The second one (where I had to wakeup the system via a WoL packet)(line 3171 -> 3925):
Code:

[55650.925550] [T14940] bcachefs (b1fe4470-e6b8-4aab-a557-998404619502): clean shutdown complete, journal seq 1853525
...
[55818.422715] [T15559] bcachefs (fde6c4aa-7e4c-4429-b7e0-98a10224feb4): delete_dead_inodes... done

and the third one (where I left > 5m between shutdown and resume, just to be sure)(line 3936 -> 4662):
Code:

[56105.726554] [T15679] bcachefs (b1fe4470-e6b8-4aab-a557-998404619502): clean shutdown complete, journal seq 1853526
...
[56169.457662] [T16289] bcachefs (fde6c4aa-7e4c-4429-b7e0-98a10224feb4): delete_dead_inodes... done


As I thought "hey, this works properly", I put it in /etc/crontab:
Code:

# Global variables
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# For details see man 5 crontab

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed
*/5 * * * * root anacron -s
30 23 * * * root /opt/standbyscript.sh
...


... and in the morning, I saw the suspend operation (1st attempt in my dmesg log) failed, the system was completely up-and-running.
does anybody have an idea why the command in crontab doesn't work but if executed via ssh, it works?
any idea would be welcome, even the weirdest one!

*EDIT* to check how long it suspended, I checked my dhcpd log (it is on a not-useful loglevel, but it currently is):
Code:

Apr 15 23:23:10 linuxserver dhcpd[4032]: DHCPACK on 192.168.177.102 to 10:92:66:88:ed:d6 via eth0
Apr 15 23:29:11 linuxserver dhcpd[4032]: DHCPDISCOVER from 00:68:eb:37:78:0f (portablejp) via eth0
Apr 15 23:29:11 linuxserver dhcpd[4032]: DHCPOFFER on 192.168.177.84 to 00:68:eb:37:78:0f (portablejp) via eth0
Apr 15 23:29:11 linuxserver dhcpd[4032]: DHCPREQUEST for 192.168.177.84 (192.168.177.29) from 00:68:eb:37:78:0f (portablejp) via eth0
Apr 15 23:29:11 linuxserver dhcpd[4032]: DHCPACK on 192.168.177.84 to 00:68:eb:37:78:0f (portablejp) via eth0
Apr 15 23:29:11 linuxserver dhcpd[4032]: bind update on 192.168.177.84 from costadelsollie rejected: incoming update is less critical than outgoing update
Apr 15 23:29:11 linuxserver dhcpd[4032]: Added new forward map from portablejp.costadelsollie.home.arpa to 192.168.177.84
Apr 15 23:29:11 linuxserver dhcpd[4032]: Added reverse map from 84.177.168.192.in-addr.arpa. to portablejp.costadelsollie.home.arpa
Apr 15 23:29:13 linuxserver dhcpd[4094]: Information-request message from fe80::4ac:53e7:ebcc:d185 port 546, transaction ID 0x5A9C1100
Apr 15 23:29:13 linuxserver dhcpd[4094]: Sending Reply to fe80::4ac:53e7:ebcc:d185 port 546
Apr 15 23:59:50 linuxserver dhcpd[4032]: timeout waiting for failover peer costadelsollie
Apr 15 23:59:50 linuxserver dhcpd[4032]: peer costadelsollie: disconnected
Apr 15 23:59:51 linuxserver dhcpd[4032]: failover peer costadelsollie: I move from normal to communications-interrupted
Apr 15 23:59:51 linuxserver dhcpd[4032]: DHCPDISCOVER from dc:e5:5b:6a:06:5a via eth0
Apr 15 23:59:51 linuxserver dhcpd[4032]: DHCPOFFER on 192.168.177.176 to dc:e5:5b:6a:06:5a via eth0
Apr 15 23:59:51 linuxserver dhcpd[4032]: DHCPREQUEST for 192.168.177.176 (192.168.177.30) from dc:e5:5b:6a:06:5a via eth0
Apr 15 23:59:51 linuxserver dhcpd[4032]: DHCPACK on 192.168.177.176 to dc:e5:5b:6a:06:5a via eth0
Apr 15 23:59:51 linuxserver dhcpd[4032]: failover peer costadelsollie: peer moves from normal to communications-interrupted
Apr 15 23:59:51 linuxserver dhcpd[4032]: failover peer costadelsollie: I move from communications-interrupted to normal
Apr 15 23:59:51 linuxserver dhcpd[4032]: balancing pool 556faf3570c0 192.168.177.0/24  total 121  free 45  backup 67  lts 11  max-own (+/-)11
Apr 15 23:59:51 linuxserver dhcpd[4032]: balanced pool 556faf3570c0 192.168.177.0/24  total 121  free 45  backup 67  lts 11  max-misbal 17
Apr 15 23:59:51 linuxserver dhcpd[4032]: Sending updates to costadelsollie.
Apr 15 23:59:51 linuxserver dhcpd[4032]: failover peer costadelsollie: peer moves from communications-interrupted to normal
Apr 15 23:59:51 linuxserver dhcpd[4032]: failover peer costadelsollie: Both servers normal


so it looks like the system went to sleep for half an hour ... and then woke up.
So, what happened here?

*EDIT2*:
thinking about "it may be an event triggered after half an our / every hour (0:00, 1:00, ...), I put the system to sleep at 9h25, and woke up via WoL at 10:10, so it was probably not a timeout issue.
running unside screen:
Code:

linuxserver /var/log # sleep 30; echo "beginning standby"; date; /opt/standbyscript.sh; echo "ending standby"; date;
beginning standby
wo 16 apr 2025 09:25:34 CEST
mdadm: stopped /dev/md0
mdadm: stopped /dev/md1
mdadm: stopped /dev/md2
mdadm: stopped /dev/md3
wo 16 apr 2025 09:27:03 CEST
mdadm: /dev/md0 has been started with 4 drives and 1 spare and 1 journal.
mdadm: /dev/md2 has been started with 5 drives.
mdadm: /dev/md3 has been started with 6 drives and 5 spares and 1 journal.
mdadm: /dev/md1 has been started with 10 drives and 2 spares and 1 journal.
ending standby
wo 16 apr 2025 10:10:29 CEST

_________________
The power of Gentoo optimization (not overclocked): [img]https://www.passmark.com/baselines/V10/images/503714802842.png[/img]
Back to top
View user's profile Send private message
RumpletonBongworth
Tux's lil' helper
Tux's lil' helper


Joined: 17 Jun 2024
Posts: 104

PostPosted: Thu May 08, 2025 2:35 pm    Post subject: Reply with quote

I can discern no obvious reason for the script to behave differently where executed by crond. That being said, I have a few suggestions. Perhaps one of them will help you to get the bottom of the matter.

Firstly, you can trace your script's execution by enabling xtrace.

Code:
#!/bin/bash
PS4='+$BASH_SOURCE:$LINENO:$FUNCNAME: '
set -x

The resulting diagnostic messages will be conveyed to STDERR. Owing to the fact that your crontab defines MAIL=root, crond will attempt to deliver this output to the root user's mailbox by invoking sendmail(1). However, if you do not have a working sendmail implementation, or if has not been configured correctly, this output may be lost, or will perhaps end up as a "dead.letter" in the home directory of the applicable user. Still, you are free to have your script direct its STDERR to wherever you please. For example:

Code:
exec 2>>"$HOME/standbyscript.log"

Alternatively, both STDOUT and STDERR:
Code:
exec >>"$HOME/standbyscript.log" 2>&1

Secondly, it might be interesting to determine whether there is any difference in behaviour in the case that your script is backgrounded and disowned by the initial SHELL that crond spawns:

Code:
30 23 * * * root /opt/standbyscript.sh & disown

Thirdly, your script has several defects. You may determine what most of them are by evaluating it at shellcheck.net. Here is a rewrite that should be a little more robust, and which also introduces (some) error checking. In turn, that may help you to debug more effectively.
Code:
#!/bin/bash

declare -a md_devs mounted

while read -r source target; do
   printf -v target %b "$target"
   if [[ $source == /dev/md+([0-9]) ]]; then
      md_devs+=("$source")
   elif [[ $target != /mnt/* ]]; then
      continue
   fi
   mounted+=("$target")
done < <(findmnt --real -rno source,target)

for i in "${mounted[@]}"; do
   umount "$i" || exit
done

for i in "${md_devs[@]}"; do
   mdadm --stop "$i" || exit
done

date
echo mem > /sys/power/state
sleep 20

for i in "${md_devs[@]}"; do
   mdadm --assemble --scan --no-degraded "$i" || exit
done

for i in "${mounted[@]}"; do
   mount "$i" || exit
done
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3650

PostPosted: Thu May 08, 2025 4:18 pm    Post subject: Reply with quote

Quote:
I can discern no obvious reason for the script to behave differently where executed by crond.
Cron runs its jobs with a very minimal env. This is pretty damn good reason for things to break. Notably PATH doesn't cover all directories you might expect.

jpsollie, if your script prints anything at all to stdout or stderr, cron should collect it and email to its owner account (or whatever address you defined). Check that mail, or modify your script (or crontab) to log it to a file to find out what's wrong.

You might also try adding -l at the end of your shebang. If it's about env, going through a proper setup might be all you need to correct it.
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
RumpletonBongworth
Tux's lil' helper
Tux's lil' helper


Joined: 17 Jun 2024
Posts: 104

PostPosted: Thu May 08, 2025 4:25 pm    Post subject: Reply with quote

szatox wrote:
Quote:
I can discern no obvious reason for the script to behave differently where executed by crond.
Cron runs its jobs with a very minimal env. This is pretty damn good reason for things to break. Notably PATH doesn't cover all directories you might expect.

The provided crontab(5) defines a reasonable default PATH.

Code:
PATH=/sbin:/bin:/usr/sbin:/usr/bin

None of the utilities executed by the script should be especially sensitive to the leaner environment created by crond.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



OSZAR »