Linpack测试流程
1. 安装OpenBLAS
Linpack
测试需要调用矩阵运算相关的数学库,可以选择的项目有BLAS、VSIPL
。在这里我们选择安装具有龙芯向量优化的OpenBLAS
。
步骤1:下载社区OpenBLAS
源码。
git clone https://github.com/OpenMathLib/OpenBLAS.git
cd OpenBLAS
步骤2:编译安装
make NO_LAPACK=1 USE_SIMPLE_THREADED_LEVEL3=1 NO_AFFINITY=0 HUGETLB_ALLOCATION=1
make install PREFIX=/opt/openblas
安装时选项PREFIX=/opt/openblas
指定的安装目录会在编译HPL
之前编辑Make
文件时使用到。
步骤3:设置OpenBLAS
环境变量
编辑/etc/profile
加入export LD_LIBRARY_PATH=/opt/openblas/lib:$LD_LIBRARY_PATH
,执行以下命令。
source /etc/profile
2. 安装mpich
步骤1:安装依赖
OpenMP
我们选择源上的 mpich
,安装依赖包括mpich,mpich-devel
,在loongnix-server
系统上执行。
sudo yum install mpich mpich-devel
步骤2: 设置mpich
环境变量
编辑/etc/profile
加入export PATH=/usr/lib64/mpich/bin:$PATH
,然后执行下面的命令使其生效。
source /etc/profile
可以通过命令rpm -ql mpich
查看mpich
的安装位置,后续在编译HPL
之前编辑Make
文件时也需要用到。
3. 编译HPL
编译HPL
首先需要提供编译配置文件,以Make.arch_name
命名,例如Make.Linux_la64
。该文件主要用于配置HPL
依赖的软件库和头文件位置,与前面安装mpich
和 OpenBLAS
时的指定的路径有关。将该文件放置在HPL
的源码目录下,执行make arch=arch_name
完成编译,例如make arch=Linux_la64
。
步骤1:解压HPL
源码
cd /home/hpc
tar xf hpl-2.2.tar.gz
cd hpl-2.2
步骤2:创建并设置Make.Linux_la64
配置文件
在HPL
的源码目录下hpl-2.2/setup
有多个架构的参考配置文件,可以基于此进行修改。龙芯平台可以参考如下配置文件。
# cat Make.Linux_la64
# ######################################################################
#
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -fs
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH = Linux_la64
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir = /home/yinsy/linpack/hpl-2.2 // TODO: 根据情况修改为HPL路径
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir = /usr/lib64/mpich
MPinc = -I$(MPdir)/include
MPlib = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir = /opt/openblas
LAinc = $(LAdir)/include
LAlib = $(LAdir)/lib/libopenblas.a
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section if and only if you are not planning to use
# a BLAS library featuring a Fortran 77 interface. Otherwise, it is
# necessary to fill out the F2CDEFS variable with the appropriate
# options. **One and only one** option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_ : all lower case and a suffixed underscore (Suns,
# Intel, ...), [default]
# -DNoChange : all lower case (IBM RS6000),
# -DUpCase : all upper case (Cray),
# -DAdd__ : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]
# -DF77_INTEGER=long : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle : The string address is passed at the string loca-
# tion on the stack, and the string length is then
# passed as an F77_INTEGER after all explicit
# stack arguments, [default]
# -DStringStructPtr : The address of a structure is passed by a
# Fortran 77 string, and the structure is of the
# form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal : A structure is passed by value for each Fortran
# 77 string, and the structure is of the form:
# struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle : Special option for Cray machines, which uses
# Cray fcd (fortran character descriptor) for
# interoperation.
#
F2CDEFS = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) -I$(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS call the cblas interface;
# -DHPL_CALL_VSIPL call the vsip library;
# -DHPL_DETAILED_TIMING enable detailed timers;
#
# By default HPL will:
# *) not copy L before broadcast,
# *) call the BLAS Fortran 77 interface,
# *) not display detailed timing information.
#
HPL_OPTS = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall -pthread -lm
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER = mpif77
LINKFLAGS = $(CCFLAGS) $(OMP_DEFS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ----------------------------------------------------------------------
步骤3:编译HPL
可执行文件
执行以下命令进行编译。
# make arch=Linux_la64
编译完成之后,会在hpl-2.2/bin/Linux_la64
目录下生成可执行文件xhpl
和测试配置文件HPL.dat
。
4. 测试HPL
步骤1:修改HPL.dat
配置文件
针对不同的硬件平台,要想测试出最佳的性能,需要配置合适的HPL.dat
。下述参考配置中,针对比较重要的行进行了说明。
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
171500 Ns # 根据内存大小设置 Ns,计算公式为:N^2=(0.9*memory_size(Bytes)/8)
1 # of NBs
250 NBs # 矩阵分块大小,跟硬件性能有关,建议配置为 cacheline/8 的整数倍
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q), 就是有几种PXQ组合需要测试,这里是2组,分别是1x4 和 2x2。为避免线程跨NUMA节点通信造成性能下降,为每个NUMA节点设置一个进程,即PxQ=NUMA节点个数。推荐参数:3A5000设置为1x1;单路3C5000L设置为2x2,双路3C5000L设置为2x4;单路3C5000设置为1x1, 双路3C5000设置为1x2,四路3C5000设置为2x2; 单路3D5000设置为1x2,双路3D5000设置为2x2
1 2 Ps
4 2 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
步骤2: 设置OpenBLAS
线程数
OPENBLAS_NUM_THREADS
指定的是OpenBLAS
单个进程开启的线程数。为避免线程跨NUMA
节点通信造成性能下降,为每个NUMA
节点分配一个进程,线程数为每个NUMA
节点核数,在不同硬件平台设置如下。
3A5000
与3C5000L
export OPENBLAS_NUM_THREADS=4
3C5000
与3D5000
export OPENBLAS_NUM_THREADS=16
步骤3:设置系统可用大页面数
设置系统可用大页面数目,开启大页的数量等于总线程数+NUMA节点数(OpenBLAS
每个进程会多分配一块缓存空间,以实现用户自定义线程模型)。设置时请使用root
用户,避免设置失败。不同硬件平台的设置如下:
3A5000
echo 5 > /proc/sys/vm/nr_hugepages
单路3C5000L
echo 20 > /proc/sys/vm/nr_hugepages
双路3C5000L
echo 40 > /proc/sys/vm/nr_hugepages
单路3C5000
echo 17 > /proc/sys/vm/nr_hugepages
双路3C5000
echo 34 > /proc/sys/vm/nr_hugepages
四路3C5000
echo 68 > /proc/sys/vm/nr_hugepages
单路3D5000
echo 34 > /proc/sys/vm/nr_hugepages
双路3D5000
echo 68 > /proc/sys/vm/nr_hugepages
步骤4: 执行测试
单进程时直接执行./xhpl
; 多进程时使用命令mpirun
来启动多进程的xhpl
测试,进程数为NUMA
节点个数。使用root
用户进行测试,避免大页面分配失败。不同平台执行测试步骤如下。
3A5000
cd bin/Linux_la64
./xhpl
单路3C5000L
cd bin/Linux_la64
mpirun -np 4 ./xhpl
双路3C5000L
cd bin/Linux_la64
mpirun -np 8 ./xhpl
单路3C5000
cd bin/Linux_la64
./xhpl
双路3C5000
cd bin/Linux_la64
mpirun -np 2 ./xhpl
四路3C5000
cd bin/Linux_la64
mpirun -np 4 ./xhpl
单路3D5000
cd bin/Linux_la64
mpirun -np 2 ./xhpl
双路3D5000
cd bin/Linux_la64
mpirun -np 4 ./xhpl