跑算法 奇怪的现象

漏洞扫描、网关、防火墙、补丁升级、数据备份和迁移、系统故障排除
回复
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

跑算法 奇怪的现象

#1

帖子 lqfie » 2018-08-07 12:23

系统:Linux titanxp-desktop 4.15.0-30-generic #32~16.04.1-Ubuntu SMP Thu Jul 26 20:25:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
CPU:Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
内存:32G
硬盘:ssd 250G sata 3T
显卡:NVIDIA TiTanXP 公版带涡轮风扇 (后改水冷)
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
安装软件:cuda9.0 cudnn9.0 python版本的tensorflow

现象:
1、跑模型算法时候,经常死机。并且死机后跟TiTanXP相连的路由器也死机 也就是说网络中断。重启后网络恢复……
2、通过nvidia-smi 发现显卡温度有记录的是89°C(还是涡轮风扇时候),换水冷并且重做系统后 温度降到59°C 但是还是经常跑算法死机
Fri Aug 3 18:23:01 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 On | N/A |
| 35% 59C P2 238W / 250W | 11773MiB / 12194MiB | 40% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1861 G /usr/lib/xorg/Xorg 103MiB |
| 0 2398 G /opt/teamviewer/tv_bin/TeamViewer 3MiB |
| 0 2616 G compiz 143MiB |
| 0 11747 C python3 11519MiB |
+-----------------------------------------------------------------------------+

3、通过demsg | grep NV得到信息如下
[ 0.004000] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.618965] rtc_cmos 00:04: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[ 1.902320] nvidia: loading out-of-tree module taints kernel.
[ 1.902324] nvidia: module license 'NVIDIA' taints kernel.
[ 1.904907] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1.909199] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[ 1.909383] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1.936579] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 390.77 Tue Jul 10 22:10:46 PDT 2018
[ 1.942002] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 1.942003] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
4、今天跑算法死机后重启,发现驱动没有了…… 并且登录界面 输入密码自动注销回到登录界面 只好重新安装驱动


请教各位大神 这是什么问题 我该如何排查哪里的问题造成频繁死机
poloshiao
论坛版主
帖子: 18197
注册时间: 2009-08-04 16:33
送出感谢: 21 次
接收感谢: 1926 次

Re: 跑算法 奇怪的现象

#2

帖子 poloshiao » 2018-08-07 13:44

显卡:NVIDIA TiTanXP 公版带涡轮风扇 (后改水冷)
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
http://www.nvidia.com/download/driverRe ... 6120/en-us
Linux x64 (AMD64/EM64T) Display Driver
Operating System: Linux 64-bit
Supported products
沒有 TiTanXP
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#3

帖子 lqfie » 2018-08-07 14:13

poloshiao 写了:
2018-08-07 13:44
显卡:NVIDIA TiTanXP 公版带涡轮风扇 (后改水冷)
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
http://www.nvidia.com/download/driverRe ... 6120/en-us
Linux x64 (AMD64/EM64T) Display Driver
Operating System: Linux 64-bit
Supported products
沒有 TiTanXP
怎么可能没有
Product Type: TITAN
Product Series: NVIDIA TITAN series
Product: NVIDIA TITAN Xp 第二个就是啊
Operating System: linux 64-bit
Language: english(us)
poloshiao
论坛版主
帖子: 18197
注册时间: 2009-08-04 16:33
送出感谢: 21 次
接收感谢: 1926 次

Re: 跑算法 奇怪的现象

#4

帖子 poloshiao » 2018-08-07 15:31

NVIDIA TiTanXP
Product: NVIDIA TITAN Xp 第二个就是啊
抱歉 差了一個 空白 (所以 搜索不到)
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#5

帖子 lqfie » 2018-08-07 15:36

poloshiao 写了:
2018-08-07 15:31
NVIDIA TiTanXP
Product: NVIDIA TITAN Xp 第二个就是啊
抱歉 差了一個 空白 (所以 搜索不到)
大侠 您觉得这个是那方面的原因 我应该从哪里入手排查呢
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#6

帖子 lqfie » 2018-08-07 15:46

lspci -knn

00:00.0 Host bridge [0600]: Intel Corporation Device [8086:591f] (rev 05)
Subsystem: ASRock Incorporation Device [1849:591f]
00:01.0 PCI bridge [0604]: Intel Corporation Sky Lake PCIe Controller (x16) [8086:1901] (rev 05)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:14.0 USB controller [0c03]: Intel Corporation Device [8086:a2af]
Subsystem: ASRock Incorporation Device [1849:a2af]
Kernel driver in use: xhci_hcd
00:14.2 Signal processing controller [1180]: Intel Corporation Device [8086:a2b1]
Subsystem: ASRock Incorporation Device [1849:a2b1]
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:a2ba]
Subsystem: ASRock Incorporation Device [1849:a2ba]
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller [0106]: Intel Corporation Device [8086:a282]
Subsystem: ASRock Incorporation Device [1849:a282]
Kernel driver in use: ahci
Kernel modules: ahci
00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:a2e7] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:a290] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1c.4 PCI bridge [0604]: Intel Corporation Device [8086:a294] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:a298] (rev f0)
Kernel driver in use: pcieport
Kernel modules: shpchp
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:a2c5]
Subsystem: ASRock Incorporation Device [1849:a2c5]
00:1f.2 Memory controller [0580]: Intel Corporation Device [8086:a2a1]
Subsystem: ASRock Incorporation Device [1849:a2a1]
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:a2f0]
Subsystem: ASRock Incorporation Device [1849:7893]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
00:1f.4 SMBus [0c05]: Intel Corporation Device [8086:a2a3]
Subsystem: ASRock Incorporation Device [1849:a2a3]
Kernel modules: i2c_i801
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
Subsystem: ASRock Incorporation Ethernet Connection (2) I219-V [1849:15b8]
Kernel driver in use: e1000e
Kernel modules: e1000e
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b02] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11df]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10ef] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:11df]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel


lshw -numeric -class video

*-display
description: VGA compatible controller
product: NVIDIA Corporation [10DE:1B02]
vendor: NVIDIA Corporation [10DE]
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:126 memory:de000000-deffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:c0000-dffff


ubuntu-drivers devices

== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
manual_install: True
vendor : NVIDIA Corporation
modalias : pci:v000010DEd00001B02sv000010DEsd000011DFbc03sc00i00
driver : nvidia-384 - distro non-free recommended
driver : xserver-xorg-video-nouveau - distro free builtin


efibootmgr -v

efibootmgr: EFI variables are not supported on this system
onlylove
论坛版主
帖子: 4412
注册时间: 2007-01-14 16:23
送出感谢: 0
接收感谢: 99 次

Re: 跑算法 奇怪的现象

#7

帖子 onlylove » 2018-08-07 15:48

路由器也死机?你的算法依赖网络的?路由器死机在前还是在后?
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#8

帖子 lqfie » 2018-08-07 15:56

onlylove 写了:
2018-08-07 15:48
路由器也死机?你的算法依赖网络的?路由器死机在前还是在后?
这个十分不明白 按理说算法不会依赖网络
路由器死机在前还是在后 这个不清楚 但凡连接这个的路由器的其他电脑上不去网 绝逼是TITANxp死机了 重启titanxp那个Ubuntu机器 网络就好了
poloshiao
论坛版主
帖子: 18197
注册时间: 2009-08-04 16:33
送出感谢: 21 次
接收感谢: 1926 次

Re: 跑算法 奇怪的现象

#9

帖子 poloshiao » 2018-08-07 15:59

显卡驱动:NVIDIA-Linux-x86_64-390.77.run
1. 安裝 nvidia 閉源驅動 參閱
1-1. viewtopic.php?p=3208425#p3208425

2. 從 nvidia 官網 安裝驅動 參閱
2-1. http://us.download.nvidia.com/XFree86/L ... index.html
NVIDIA Accelerated Linux Graphics Driver README and Installation Guide
2-2. https://help.ubuntu.com/community/NvidiaManual
NvidiaManual
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#10

帖子 lqfie » 2018-08-08 19:05

poloshiao 写了:
2018-08-07 15:59
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
1. 安裝 nvidia 閉源驅動 參閱
1-1. viewtopic.php?p=3208425#p3208425

2. 從 nvidia 官網 安裝驅動 參閱
2-1. http://us.download.nvidia.com/XFree86/L ... index.html
NVIDIA Accelerated Linux Graphics Driver README and Installation Guide
2-2. https://help.ubuntu.com/community/NvidiaManual
NvidiaManual
没明白什么意思 我是从NVIDIA官网下载的驱动啊 官网只有linux-64位 没有对应的ubuntu16.04的版本
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#11

帖子 lqfie » 2018-08-10 15:55

onlylove 写了:
2018-08-07 15:48
路由器也死机?你的算法依赖网络的?路由器死机在前还是在后?
路由器不死机 但是网络卡死 只要跑算法的titanxp死机 跟路由器相连的所有电脑 都上不去网 但是拔掉titanxp的网线(或者重启titanxp服务器)其他电脑就能上网了

算法没道理依赖网络 十分费解这种现象
lqfie
帖子: 18
注册时间: 2017-08-01 18:31
系统: win10
送出感谢: 0
接收感谢: 0

Re: 跑算法 奇怪的现象

#12

帖子 lqfie » 2018-08-13 11:03

周末加了CPU的温度监测 发现跑算法的时间段 CPU温度最高75°C 平均68°C 准备先跟显卡一样搞个水冷看看
回复

回到 “服务器维护和硬件相关”