Embecosm Pine64 Cluster
Currently Embecosm has 4 Pine64 boards, and have ordered 12 more.
|Single Board Computer ||Pine64||https://www.pine64.com/product|
|USB power (hub)||Anear 6 port usb hub||https://www.amazon.co.uk/dp/B016EY61XU/ref=pe_385721_144310291_TE_dp_1|
|Micro SD Cards ||SanDisk 16GB mico SD Card||https://www.amazon.co.uk/dp/B012VKUSIA/ref=pe_385721_144310291_TE_dp_2|
It's been suggested to use a hard drive instead of individual sd cards.
Currently they all have their own SD cards, but have shared drive on one of our servers which is used for running mpi programs
At the moment we have connected all 4 with spacers and printed some red feet and a top.
It's been suggested that we're wanting a case to look like a Cray 1.
To run an mpi program such as this hello world we put it's .c in the pines directory.
Then we need to use a hostfile to inform open mpi where to run itself. for example my .mpi_hostfile:
# Hostfile for OpenMPI # The following slave nodes are single processor machines: pine1 slots=4 pine2 slots=4 pine3 slots=4 pine4 slots=4
Now we compile the c code:
ubuntu@pine1:~/pines$ mpicc hello.c
to then run:
ubuntu@pine1:~/pines$ mpirun --hostfile .mpihostfile ./a.out
then you should get:
Hello world from processor pine2, rank 4 out of 16 processors Hello world from processor pine1, rank 0 out of 16 processors Hello world from processor pine2, rank 5 out of 16 processors Hello world from processor pine1, rank 1 out of 16 processors Hello world from processor pine2, rank 6 out of 16 processors Hello world from processor pine1, rank 2 out of 16 processors Hello world from processor pine2, rank 7 out of 16 processors Hello world from processor pine1, rank 3 out of 16 processors Hello world from processor pine3, rank 8 out of 16 processors Hello world from processor pine3, rank 9 out of 16 processors Hello world from processor pine3, rank 10 out of 16 processors Hello world from processor pine3, rank 11 out of 16 processors Hello world from processor pine4, rank 12 out of 16 processors Hello world from processor pine4, rank 13 out of 16 processors Hello world from processor pine4, rank 14 out of 16 processors Hello world from processor pine4, rank 15 out of 16 processors
After logging in to a pine:
ubuntu@pine1:~$ cd pines/hpl-2.2/bin/pine/
will take you to the correct directory.
Inside of which should have the executable "xhpl", it's input file: "HPL.dat". To tune the .dat file I used this website. Once you have the right HPL.dat file and mpihostfile, you can run a program using:
ubuntu@pine1:~$ mpirun --hostfile .mpihostfile ./xhpl
Linpack on 4 boards
We have run Linpack on our cluster, with all 4 nodes running their maximum of four cores each we can reach 5.7 GFlops. This would have placed us on the Top500 until 1997.
But why stop there? We also ran the Linpack benchmark for all configurations of nodes and cores running per each node.
I then plotted GFlops to number of cores:
We can see that there is almost perfect linear improvement, though unfortunately far from the ideal, modelled in red. Extrapolating for 64 cores will get us 22.5 GFlops ,for the future 16 Pine64s where there would be a loss of 6.5GFlops from the ideal.
From this we can see than the best configuration of 6 cores would be to have 3 nodes running 2 cores each instead of 2 with 3. This then gets odder when we look at having a total of 4 cores. The same pattern occurs for 4 with 1 core and 1 with 4 cores however a distribution of 2 nodes with 2 cores gives us the optimum configuration for a total 4 cores.
From the previous data we can determine the GFlops per cores in use , as shown above. We can work out that from 1 board 1 core to 4 boards 4 cores you lose 27.8% of you core's GFlop efficiency. We can also see you lose the least Gflop per core by only increasing the amount of boards.
After Analysing the data we can see that by far the worst thing to improve GFlops per core is with 1 board to increase from 1 core to two cores. We can also see that it is just as efficient to run 4 cores with 2 cores on 2 boards as it is to run 3 boards with 1 core each.
Then plotting the points:
From this I then predict that the Gflops per core for 64 cores will be 0.31, which different by the predicted total GFlops/64 by 0.4 Gflops.
To measure the power we have a power saving meter socket connected to all the pines which I read randomly three times during each Iteration and averaged. I'd forgotten to measure the power of the switch also, so I measured this on its own for each iteration. The swtich's power usage does not changed, it idles at 4.7W, occasionally peaking to 5.4W if with lots of cores. Therefore I decided to just add 4.7W to the Power for each board. So for more boards and cores we will probably see an increase in power usage of the switch and therefore will need to incorporate it into the reading next time.
For 1 board if we increase the no. cores per board to two then we increase it's efficiency by 70%.
Coming back to the 4 core conundrum, where we only use a total of 4 cores across the boards. We see that although having 1 board with 4 cores is the worst combination for optimum Flops, However it is by far the best combination for MFlops/Watt, showing that the overhead for running 3 extra board is huge, to go to 4 boards with 1 core each would be increase power usage by 39% or only 14% to 2 boards 2 cores, an increase of .0054p per hour or 0.0019p per hour respectively. Which would be a loss of up to 47.30p a year for all of .091 extra GFlops.
After plotting the graph we can see it's Log natural like tendencies, in fact y=88ln(x+1) almost directly overlays the data. this suggests that Mflops per Watt will not carry on increasing linearly like the GFlops per core has been, instead reaching a maximum point. Therefore we can extrapolate that with 64 cores we'll get 367 Mflops per watt.
Linpack on 16 boards
After receiving a dozen more boards and constructing them into two 'trees' of eight pines I reran Linpack. We had expected to get 22.5GFlops, but unfortunately we only got 20Gflops. Also this took 3 hours instead of an Hour and a bit as usual, therefore after a very unsuccessful weekend benchmarking, It was decided to reduce the memory given to each node form 1750mb to 50mb, this successfully sped up the benchmarking as every iteration now only took a couple of minutes.
I reran some of the programs from pine16 after the first run produced awful results, however some still did not improve so I left the original. I'm very sceptical of later half of the each group's iterations. On the other hand we would expect the rate of performance increase to decrease, as we increase the communication needed. They were run by amounts of boards, so 14 boards using 3 cores each was followed by 14 boards with 2 cores each, this might also explain why each line follows a similar path.
For almost all runs of up to 10 cores the single core per board line is above the rest, showing that the it's more Performance efficient to use as many boards as possible. This trend carries on after 10 cores, with 2 cores per boards taking over. The pattern continues again with 3 cores per board and with 4.
We can almost start to see how line is becoming non-linear, but only just.
To measure energy data I sat recording the power at random points throughout each benchmark and averaged. I then plotted the results which included the switch:
With this the pattern is less visible, but that might be because of the noise in the results, but we can see that for their respective sections the cores per board are the most efficient until 40 cores.
An odd thing happens at 30 cores in both graphs, all 3 lines almost join.
Then comparing the two graphs we can notice points that although raise total GFlops decrease the energy efficiency: such as 20 to 24 cores with 4 cores per board, the GFlops total increase is consistent however the Energy efficiency over this area decreases substantially. 4 cores per board seems to be a repeat offender of this efficiency vs Total mishap, between 36 to 40: the efficiency increases although the Total Gflops plummets.
In conclusion with 16 boards it seems the most energy efficient and performance efficient way of running the benchmark is with as many boards as you can, up until about 40 cores: where results get a little unreliable.