ubuntu 20.4 LTS 经常死机

18.04、16.04、14.04
回复
starrysky76
帖子: 3
注册时间: 2020-08-01 9:17
系统: UBUNTU 20.4 LTS
送出感谢: 0
接收感谢: 0

ubuntu 20.4 LTS 经常死机

#1

帖子 starrysky76 » 2020-08-01 9:30

我的ubuntu20.4LTS跑深度学习程序一段时间就会死机,
CPU 10700K,
显卡RTX TITAN
显卡驱动 NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.95.01 Thu May 28 07:03:08 UTC 2020
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)
(昨晚死机之前在软件更新的附加驱动更新从nvidia driver metapackage 来自nvidia-driver-440(专有,tested)更新为nvidia server driver metapackage 来自nvidia-driver-440(专有))
大概每周一两次的样子,最近一次死机是今早发现,屏幕停在昨天晚上1:40左右,死机时鼠标键盘屏幕都没反应,“reisub”大法时而有用时而无用,今早尝试reisub没用,以下是昨天1:40相关日志,请大家看一下是有什么问题吗?


8月 01 01:39:21 pjc-MS-7C71 anacron[830]: Job `cron.daily' started
8月 01 01:39:21 pjc-MS-7C71 anacron[3763]: Updated timestamp for job `cron.daily' to 2020-08-01
8月 01 01:39:21 pjc-MS-7C71 cracklib[3790]: no dictionary update necessary.
8月 01 01:40:02 pjc-MS-7C71 anacron[830]: Job `cron.daily' terminated
8月 01 01:40:02 pjc-MS-7C71 anacron[830]: Normal exit (1 job run)
8月 01 01:40:02 pjc-MS-7C71 systemd[1]: anacron.service: Succeeded.
8月 01 01:40:03 pjc-MS-7C71 PackageKit[1318]: daemon quit
8月 01 01:40:03 pjc-MS-7C71 systemd[1]: packagekit.service: Succeeded.
8月 01 01:40:41 pjc-MS-7C71 kernel: invalid opcode: 0000 [#1] SMP NOPTI
8月 01 01:40:41 pjc-MS-7C71 kernel: CPU: 12 PID: 1896 Comm: gnome-shell Tainted: P OE 5.4.0-42-generic #46-Ubuntu
8月 01 01:40:41 pjc-MS-7C71 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C71/MEG Z490 UNIFY (MS-7C71), BIOS A.21 0>
8月 01 01:40:41 pjc-MS-7C71 kernel: RIP: 0010:policy_node+0x29/0x40
8月 01 01:40:41 pjc-MS-7C71 kernel: Code: 00 0f 1f 44 00 00 55 89 d0 0f b7 56 04 48 89 e5 66 83 fa 01 74 14 66 83 fa 02 74 02 5d c3>
8月 01 01:40:41 pjc-MS-7C71 kernel: RSP: 0000:ffffc16e83693d20 EFLAGS: 00010246
8月 01 01:40:41 pjc-MS-7C71 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00002cd1c1859000
8月 01 01:40:41 pjc-MS-7C71 kernel: RDX: 0000000000000001 RSI: ffffffff8d6c8200 RDI: 0000000000100dca
8月 01 01:40:41 pjc-MS-7C71 kernel: RBP: ffffc16e83693d20 R08: ffffffff8d6c8200 R09: 0000000000000000
8月 01 01:40:41 pjc-MS-7C71 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8d6c8200
8月 01 01:40:41 pjc-MS-7C71 kernel: R13: 0000000000100dca R14: 0000000000000000 R15: 00002cd1c1859000
8月 01 01:40:41 pjc-MS-7C71 kernel: FS: 00007ff4a2558cc0(0000) GS:ffff9d765db00000(0000) knlGS:0000000000000000
8月 01 01:40:41 pjc-MS-7C71 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
8月 01 01:40:41 pjc-MS-7C71 kernel: CR2: 00002cd1c1859020 CR3: 0000000fb9712004 CR4: 00000000007606e0
8月 01 01:40:41 pjc-MS-7C71 kernel: PKRU: 55555554
8月 01 01:40:41 pjc-MS-7C71 kernel: Call Trace:
8月 01 01:40:41 pjc-MS-7C71 kernel: alloc_pages_vma+0x6f/0x200
8月 01 01:40:41 pjc-MS-7C71 kernel: do_anonymous_page+0x118/0x650
8月 01 01:40:41 pjc-MS-7C71 kernel: __handle_mm_fault+0x760/0x7a0
8月 01 01:40:41 pjc-MS-7C71 kernel: handle_mm_fault+0xca/0x200
8月 01 01:40:41 pjc-MS-7C71 kernel: do_user_addr_fault+0x1f9/0x450
8月 01 01:40:41 pjc-MS-7C71 kernel: __do_page_fault+0x58/0x90
8月 01 01:40:41 pjc-MS-7C71 kernel: do_page_fault+0x2c/0xe0
8月 01 01:40:41 pjc-MS-7C71 kernel: page_fault+0x34/0x40
8月 01 01:40:41 pjc-MS-7C71 kernel: RIP: 0033:0x7ff4a7a1d6d9
8月 01 01:40:41 pjc-MS-7C71 kernel: Code: 48 81 ea 80 00 00 00 c5 fd 7f 07 c5 fd 7f 4f 20 c5 fd 7f 57 40 c5 fd 7f 5f 60 48 81 c7 80>
8月 01 01:40:41 pjc-MS-7C71 kernel: RSP: 002b:00007fff332dc2f8 EFLAGS: 00010283
8月 01 01:40:41 pjc-MS-7C71 kernel: RAX: 00002cd1c1858cc0 RBX: 00007fff332dc808 RCX: 00002cd1c1859020
8月 01 01:40:41 pjc-MS-7C71 kernel: RDX: 0000000000000060 RSI: 0000564859cd2ea0 RDI: 00002cd1c1858fe0
8月 01 01:40:41 pjc-MS-7C71 kernel: RBP: 00007fff332dc340 R08: ffffffffffffffe0 R09: 0000000000000001
8月 01 01:40:41 pjc-MS-7C71 kernel: R10: 00007fff332dc808 R11: 00002cd1c1858cc0 R12: 00002cd1c1858cc0
8月 01 01:40:41 pjc-MS-7C71 kernel: R13: 0000564856a5cce0 R14: 000030d815907c40 R15: 00002cd1c1858cb0
8月 01 01:40:41 pjc-MS-7C71 kernel: Modules linked in: rfcomm ccm cmac algif_hash algif_skcipher af_alg bnep snd_hda_codec_hdmi int>
8月 01 01:40:41 pjc-MS-7C71 kernel: ecc ucsi_ccg ipmi_msghandler typec_ucsi snd fb_sys_fops typec syscopyarea sysfillrect sysimgbl>
8月 01 01:40:41 pjc-MS-7C71 kernel: ---[ end trace 10a27326485a922b ]---
8月 01 01:40:41 pjc-MS-7C71 kernel: RIP: 0010:policy_node+0x29/0x40
8月 01 01:40:41 pjc-MS-7C71 kernel: Code: 00 0f 1f 44 00 00 55 89 d0 0f b7 56 04 48 89 e5 66 83 fa 01 74 14 66 83 fa 02 74 02 5d c3>
8月 01 01:40:41 pjc-MS-7C71 kernel: RSP: 0000:ffffc16e83693d20 EFLAGS: 00010246
8月 01 01:40:41 pjc-MS-7C71 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00002cd1c1859000
8月 01 01:40:41 pjc-MS-7C71 kernel: RDX: 0000000000000001 RSI: ffffffff8d6c8200 RDI: 0000000000100dca
8月 01 01:40:41 pjc-MS-7C71 kernel: RBP: ffffc16e83693d20 R08: ffffffff8d6c8200 R09: 0000000000000000
8月 01 01:40:41 pjc-MS-7C71 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8d6c8200
8月 01 01:40:41 pjc-MS-7C71 kernel: R13: 0000000000100dca R14: 0000000000000000 R15: 00002cd1c1859000
8月 01 01:40:41 pjc-MS-7C71 kernel: FS: 00007ff4a2558cc0(0000) GS:ffff9d765db00000(0000) knlGS:0000000000000000
8月 01 01:40:41 pjc-MS-7C71 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
8月 01 01:40:41 pjc-MS-7C71 kernel: CR2: 00002cd1c1859020 CR3: 0000000fb9712004 CR4: 00000000007606e0
8月 01 01:40:41 pjc-MS-7C71 kernel: PKRU: 55555554
8月 01 01:40:43 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged

8月 01 01:43:22 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:43:28 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged


8月 01 01:45:04 pjc-MS-7C71 dbus-daemon[1609]: [session uid=1000 pid=1609] Activating via systemd: service name='org.freedesktop.Tracker1' unit='tracker-store.service' requested by ':1.1' (uid=1000 pid=1599 comm="/usr/libexec/tracker-miner-fs " label="unconfined")
8月 01 01:45:04 pjc-MS-7C71 systemd[1587]: Starting Tracker metadata database store and lookup manager...
8月 01 01:45:04 pjc-MS-7C71 dbus-daemon[1609]: [session uid=1000 pid=1609] Successfully activated service 'org.freedesktop.Tracker1'
8月 01 01:45:04 pjc-MS-7C71 systemd[1587]: Started Tracker metadata database store and lookup manager.
8月 01 01:45:04 pjc-MS-7C71 variety.desktop[4094]: No such schema “org.cinnamon.desktop.background”
8月 01 01:45:04 pjc-MS-7C71 dbus-daemon[1609]: [session uid=1000 pid=1609] Activating via systemd: service name='org.freedesktop.Tracker1.Miner.Extract' unit='tracker-extract.service' requested by ':1.1' (uid=1000 pid=1599 comm="/usr/libexec/tracker-miner-fs " label="unconfined")
8月 01 01:45:04 pjc-MS-7C71 systemd[1587]: Starting Tracker metadata extractor...
8月 01 01:45:04 pjc-MS-7C71 tracker-extract[4101]: Set scheduler policy to SCHED_IDLE
8月 01 01:45:04 pjc-MS-7C71 tracker-extract[4101]: Setting priority nice level to 19
8月 01 01:45:04 pjc-MS-7C71 dbus-daemon[1609]: [session uid=1000 pid=1609] Successfully activated service 'org.freedesktop.Tracker1.Miner.Extract'
8月 01 01:45:04 pjc-MS-7C71 systemd[1587]: Started Tracker metadata extractor.
8月 01 01:45:14 pjc-MS-7C71 systemd[1587]: tracker-extract.service: Succeeded.
8月 01 01:45:24 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:45:34 pjc-MS-7C71 tracker-store[4072]: OK
8月 01 01:45:34 pjc-MS-7C71 systemd[1587]: tracker-store.service: Succeeded.
8月 01 01:45:44 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged



8月 01 01:47:03 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:48:02 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:48:13 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:49:09 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:49:21 pjc-MS-7C71 systemd[1]: Starting Cleanup of Temporary Directories...
8月 01 01:49:21 pjc-MS-7C71 systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
8月 01 01:49:21 pjc-MS-7C71 systemd[1]: Finished Cleanup of Temporary Directories.


8月 01 01:50:20 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
8月 01 01:50:23 pjc-MS-7C71 kernel: mce: [Hardware Error]: Machine check events logged
starrysky76
帖子: 3
注册时间: 2020-08-01 9:17
系统: UBUNTU 20.4 LTS
送出感谢: 0
接收感谢: 0

Re: ubuntu 20.4 LTS 经常死机

#2

帖子 starrysky76 » 2020-08-01 9:44

这是系统相关信息
[email protected]:~$ sudo cat /etc/os-release
[sudo] pjc 的密码:
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and- ... acy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
[email protected]:~$ sudo lshw -numeric -class video
*-display
description: VGA compatible controller
product: TU102 [TITAN RTX] [10DE:1E02]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: [email protected]:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:205 memory:b3000000-b3ffffff memory:a0000000-afffffff memory:b0000000-b1ffffff ioport:4000(size=128) memory:b4000000-b407ffff
[email protected]:~$ lsmod | grep -P "(video|drm)"
nvidia_drm 45056 6
nvidia_modeset 1114112 9 nvidia_drm
drm_kms_helper 184320 1 nvidia_drm
fb_sys_fops 16384 1 drm_kms_helper
syscopyarea 16384 1 drm_kms_helper
sysfillrect 16384 1 drm_kms_helper
sysimgblt 16384 1 drm_kms_helper
drm 491520 9 drm_kms_helper,nvidia_drm
video 49152 0
[email protected]:~$ echo $DESKTOP_SESSION
ubuntu
[email protected]:~$ echo $XDG_SESSION_TYPE
x11
starrysky76
帖子: 3
注册时间: 2020-08-01 9:17
系统: UBUNTU 20.4 LTS
送出感谢: 0
接收感谢: 0

Re: ubuntu 20.4 LTS 经常死机

#3

帖子 starrysky76 » 2020-08-01 10:10

谢谢你的建议,曾经尝试过,可是卸载闭源驱动后cuda和cudnn没法用,跑程序pytorch会报错,另外浏览器在浏览网页时会出现撕裂的现象
xenomorph0525
帖子: 458
注册时间: 2009-11-21 20:29
送出感谢: 9 次
接收感谢: 24 次

Re: ubuntu 20.4 LTS 经常死机

#4

帖子 xenomorph0525 » 2020-08-01 12:21

我20.04發生死機是因為輸入法跟某軟體發生衝突,換成Fcitx後好一半,變成只有那個軟體會當掉。或許你可以考慮換輸入法或輸入法框架(有問題的應該是輸入法,不過換框架可能可以改善一半)。
回复

回到 “LTS 长支持版”