|
小弟最近在搭建一个集群,服务器都是Ubuntu 14.04LTS 的系统。
现在在主节点的gridengine-client、gridengine-common、gridengine-master、gridengine-qmon已经安装好并配置完,计算节点的gridengine-client、gridengine-common也已经配置完成。但是现在qhost只有node1有内存、loading率等信息,其他的几个节点的信息都是“-”,如下:
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - master-ubuntu14server lx26-amd64 4 0.01 31.4G 292.0M 0.0 0.0 node1 lx26-amd64 12 0.01 62.9G 439.6M 0.0 0.0 node2 - - - - - - - node3 - - - - - - - node4 - - - - - - - node5 - - - - - - - node6 - - - - - - -
把这些节点全部放到队列all.q里面,然后qsub -cwd -l h=node1,任务会Eqw,qstat -j查看信息,如下:
queue instance "all.q@node4" dropped because it is overloaded: no value for complex attribute "np_load_avg" queue instance "all.q@node6" dropped because it is overloaded: no value for complex attribute "np_load_avg" queue instance "all.q@node3" dropped because it is overloaded: no value for complex attribute "np_load_avg" queue instance "all.q@node5" dropped because it is overloaded: no value for complex attribute "np_load_avg" queue instance "all.q@node2" dropped because it is overloaded: no value for complex attribute "np_load_avg" Job is in error state
这个错误信息,我谷歌的结果是:不能提供有用的信息用来判断问题在哪里。
单独将node1放到node1.q,然后qsub测试脚本,是没有问题的。
求教!有没有哪位大神遇到过这种情况?该怎么解决?
|