slurm 简介 源安装slurm 主节点配置 1.首先要确保时间、用户和组在集群中同步。
2.安装MUNGE
1 apt install -y libmunge-dev libmunge2 munge
3.生成 MUNGE 密钥。
实测MUNGE安装好后会默认生成。因此,除非有特别的安全考虑,否则这一步不是必须执行的。
1 2 3 4 5 dd if =/dev/random bs=1 count=1024 > /etc/munge/munge.key dd if =/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
修改密钥权限
1 2 chown munge:munge /etc/munge/munge.keychmod 400 /etc/munge/munge.key
4.修改 /etc/passwd
文件
将每个节点的munge用户修改为如下设置
1 munge:x:501:501::/var/run/munge;/sbin/nologin
5.开启 munge
1 2 systemctl start munge systemctl status munge
6.测试munge是否正常安装成功
1 2 3 4 munge -n munge -n | unmunge munge -n | ssh somehost unmunge remunge
7.在 Debian 11 上通过源安装 SLURM
1 apt install -y slurm-wlm slurm-wlm-doc
8.创建 slurm 配置文件
在浏览器中打开 /usr/share/doc/slurmctld/slurm-wlm-configurator.html
网页,根据实际情况填写,提交后会生成对应配置文件。
注意 : 可以使用 slurmd -C
命令来获取节点对应的核数、内存等信息。
配置完成后将配置文件放到 /etc/slurm/slurm.conf
位置。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 SlurmctldHost=mn1 MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/affinity InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched /backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompLoc=/tmp/slurm_job_completion.txt JobCompType=jobcomp/filetxt JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=mn1 CPUs=4 RealMemory=1980 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN NodeName=cn01 CPUs=2 RealMemory=1981 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN PartitionName=ptt1 Nodes=mn1,cn01 Default=YES MaxTime=INFINITE State=UP
slurm 配置文件复制到各计算节点时,需要将User
里的slurm
改为root
。
实测在使用如上配置的情况下,还需一个 cgroup.conf
文件置于相同的目录下才可正常运行。
1 2 3 4 5 6 7 8 9 10 CgroupAutomount=yes CgroupMountpoint=/sys/fs/cgroup ConstrainCores=yes ConstrainDevices=yes ConstrainKmemSpace=no ConstrainRAMSpace=yes ConstrainSwapSpace=yes
至此主节点配置完毕。
计算节点配置 1.安装 MUNGE
1 apt install -y libmunge-dev libmunge2 munge
2.将主节点的munge.key
复制到各计算节点
1 scp /etc/munge/munge.key root@node:/etc/munge/munge.key
1 2 chown munge:munge /etc/munge/munge.keychmod 400 /etc/munge/munge.key
3.在 Debian 11 上通过源安装 SLURM
4.在计算节点和主节点启动 SLURM
1 2 systemctl start slurmd systemctl start slurmctld
至此在集群上源安装 SLURM 完成。
源码编译安装 安装依赖 1 sudo apt-get install git build-essential libssl-dev libmysql++-dev ffmpeg python3-pip openssh-server cmake -y
安装mysql-server
To add the MySQL APT repository to your system go to the repository download page and download the latest release package using the following command.
1 wget https://dev.mysql.com/get/mysql-apt-config_0.8.24-1_all.deb
Install the release package.
1 sudo apt install ./mysql-apt-config_0.8.24-1_all.deb
We’re going to install MySQL version 8.0. Select OK by pressing Tab and hit Enter (as shown in the image above).
Now you can install MySQL.
1 2 sudo apt update sudo apt install mysql-server
Once the installation is completed, the MySQL service will start automatically. To verify that the MySQL server is running, type:
The output should show that the service is enabled and running:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ● mysql.service - MySQL Community Server Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2023-01-04 18:31:50 CST; 22s ago Docs: man:mysqld(8) http://dev.mysql.com/doc/refman/en/using-systemd.html Main PID: 33914 (mysqld) Status: "Server is operational" Tasks: 39 (limit : 4674) Memory: 364.1M CPU: 1.109s CGroup: /system.slice/mysql.service └─33914 /usr/sbin/mysqld Jan 04 18:31:49 debian systemd[1]: Starting MySQL Community Server... Jan 04 18:31:50 debian systemd[1]: Started MySQL Community Server.
编译munge
github 发布地址 Releases · dun/munge
1 2 3 4 5 wget https://github.com/dun/munge/releases/download/munge-0.5.15/munge-0.5.15.tar.xz tar xf ./muge-0.5.15.tar.xz && cd muge-0.5.15.tar.xz ./configure --prefix=/usr/local/munge/0.5.15 make && sudo make install cd /usr/local/munge && sudo ln -s 0.5.15 latest && cd - && cd ..
配置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/munge.sh<<EOF export MUNGE_HOME=/usr/local/munge/latest export PATH=\$MUNGE_HOME/bin:\$PATH export PATH=\$MUNGE_HOME/sbin:\$PATH EOF source /etc/profile
编译pmix
编译pmix
之前,还须编译hwloc
和 libevent
。
编译hwloc
Portable Hardware Locality (hwloc): Version 2.8 (open-mpi.org)
1 2 3 4 5 wget https://download.open-mpi.org/release/hwloc/v2.8/hwloc-2.8.0.tar.gz && \\ tar zxf hwloc-2.8.0.tar.gz && cd hwloc-2.8.0 && \\ ./configure --prefix=/usr/local/hwloc/2.8.0 && \\ make && sudo make install && \\ cd /usr/local/hwloc && sudo ln -s 2.8.0 latest && cd - && cd .. && ls && \\
配置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/hwloc.sh<<EOF export HWLOC_HOME=/usr/local/hwloc/latest export PATH=\$HWLOC_HOME/bin:\$PATH export PATH=\$HWLOC_HOME/sbin:\$PATH EOF source /etc/profile
编译libevent
Releases · libevent/libevent (github.com)
1 2 3 4 5 wget https://github.com/libevent/libevent/releases/download/release-2.1.12-stable/libevent-2.1.12-stable.tar.gz && \\ tar zxf libevent-2.1.12-stable.tar.gz && cd libevent-2.1.12-stable && \\ ./configure --prefix=/usr/local/libevent/2.1.12 && \\ make && sudo make install && \\ cd /usr/local/libevent && sudo ln -s 2.1.12 latest && cd - && cd .. && ls && \\
配置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/libevent.sh<<EOF export LIBEVENT_HOME=/usr/local/libevent/latest export PATH=\$LIBEVENT_HOME/bin:\$PATH export LD_LIBRARY_PATH=\$LIBEVENT_HOME/lib:\$LD_LIBRARY_PATH EOF source /etc/profile
编译 pmix
Releases · openpmix/openpmix (github.com)
1 2 3 4 5 wget https://github.com/openpmix/openpmix/releases/download/v4.2.2/pmix-4.2.2.tar.gz && \\ tar zxf pmix-4.2.2.tar.gz && cd pmix-4.2.2 && \\ ./configure --prefix=/usr/local/pmix/4.2.2 --with-libevent=/usr/local/libevent/latest --with-hwloc=/usr/local/hwloc/latest && \\ make && sudo make install && \\ cd /usr/local/pmix && sudo ln -s 4.2.2 latest && cd - && cd .. && ls && \\
设置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/pmix.sh<<EOF export PMIX_HOME=/usr/local/pmix/latest export PATH=\$PMIX_HOME/bin:\$PATH export LD_LIBRARY_PATH=\$PMIX_HOME/lib:\$LD_LIBRARY_PATH EOF source /etc/profile
编译openmpi
slurm
和 openmpi
均需依赖 ucx
,因此需要先安装ucx
。
编译ucx
Releases · openucx/ucx (github.com)
1 2 3 4 5 6 wget https://github.com/openucx/ucx/releases/download/v1.13.1/ucx-1.13.1.tar.gz && \\ tar zxf ucx-1.13.1.tar.gz && cd ucx-1.13.1 && \\ ./configure --prefix=/usr/local/ucx/1.13.1 && \\ make && sudo make install && \\ cd /usr/local/ucx && sudo ln -s 1.13.1 latest && cd - && cd .. && ls && \\
配置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/ucx.sh<<EOF export UCX_HOME=/usr/local/ucx/latest export PATH=\$UCX_HOME/bin:\$PATH export LD_LIBRARY_PATH=\$UCX_HOME/lib:\$LD_LIBRARY_PATH EOF source /etc/profile
编译openmpi
Open MPI: Version 4.1 (open-mpi.org)
1 2 3 4 5 wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz && \\ tar zxf openmpi-4.1.4.tar.gz && cd openmpi-4.1.4 && \\ ./configure --prefix=/usr/local/openmpi/4.1.4 --with-pmix=/usr/local/pmix/latest --with-ucx=/usr/local/ucx/latest --with-hwloc=/usr/local/hwloc/latest --with-libevent=/usr/local/libevent/latest && \\ make && sudo make install && \\ cd /usr/local/openmpi && sudo ln -s 4.1.4 latest && cd - && cd .. && ls && \\
配置环境变量
1 2 3 4 5 6 7 sudo cat >/etc/profile.d/openmpi.sh<<EOF export OPENMPI_HOME=/usr/local/openmpi/latest export PATH=\$OPENMPI_HOME/bin:\$PATH export LD_LIBRARY_PATH=\$OPENMPI_HOME/lib:\$LD_LIBRARY_PATH EOF source /etc/profile
编译slurm
Index of /slurm (schedmd.com)
1 2 3 4 5 wget https://download.schedmd.com/slurm/slurm-22.05.7.tar.bz2 && \\ tar xf slurm-22.05.7.tar.bz2 && cd slurm-22.05.7 && \\ ./configure --prefix=/usr/local/slurm/22.05.7 --with-pmix=/usr/local/pmix/latest --with-munge=/usr/local/munge/latest --with-hwloc=/usr/local/hwloc/latest --with-ucx=/usr/local/ucx/latest && \\ make && sudo make install && \\ cd /usr/local/slurm && sudo ln -s 22.05.7 latest && cd - && cd .. && ls && \\
配置环境变量
1 2 3 4 5 6 7 8 9 sudo cat >/etc/profile.d/slurm.sh<<EOF export SLURM_HOME=/usr/local/slurm/latest export PATH=\$SLURM_HOME/bin:\$PATH export PATH=\$SLURM_HOME/sbin:\$PATH export LD_LIBRARY_PATH=\$SLURM_HOME/lib:\$LD_LIBRARY_PATH export LD_LIBRARY_PATH=\$SLURM_HOME/lib/slurm:\$LD_LIBRARY_PATH EOF source /etc/profile
配置 配置数据库 配置 slurmdbd
需要的数据库 首先修改配置文件
1 2 3 4 5 6 7 sudo cat >>/etc/mysql/my.cnf<<EOF [mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900 skip-grant-tables EOF
然后启动mysql
并进入
1 2 3 4 5 sudo systemctl start mysql sudo systemctl enable --now mysql sudo mysql
创建 slurm 单独的用户和需要的数据库
1 2 3 4 5 6 7 8 9 10 11 12 13 14 flush privileges; create user 'slurm'@'localhost' identified by '12345678' # 创建slurm用户并设置密码 # 创建基础数据库 create database slurm_acct_db; # 允许slurm用户访问 grant all on slurm_acct_db.* TO 'slurm'@'localhost'; # 创建作业数据库 create database slurm_job_db; # 授权slurm用户访问 grant all on slurm_job_db.* TO 'slurm'@'localhost';
配置 slurmdbd.conf 依据官网配置 进行修改:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ArchiveEvents =yes ArchiveJobs =yes ArchiveResvs =yes ArchiveSteps =no ArchiveSuspend =no ArchiveTXN =no ArchiveUsage =no AuthInfo =/usr/local/munge/latest/var/run/munge/munge.socket.2 AuthType =auth/mungeDbdHost =localhostDebugLevel =infoPurgeEventAfter =1 monthPurgeJobAfter =12 monthPurgeResvAfter =1 monthPurgeStepAfter =1 monthPurgeSuspendAfter =1 monthPurgeTXNAfter =12 monthPurgeUsageAfter =24 monthLogFile =/var/log/slurmdbd.logPidFile =/var/run/slurmdbd.pidSlurmUser =rootStoragePass =12345678 StorageType =accounting_storage/mysqlStorageUser =slurmStorageHost =localhostStoragePort =3306
配置 slurm.conf 这个去官网 Slurm System Configuration Tool 自动生成即可,不同版本不一样,其中比较重要的是
参考文献 [1] How to Install MySQL on Debian 11 (cloudbooklet.com)
[2] slurm集群安装与踩坑详解 | 我是谁 (yuhldr.github.io)
[3] Slurm Installation on Debian - Supercomputación y Cálculo Científico UIS
[4] Simple Slurm configuration in Debian based systems (github.com)