- Published on
Data Collection and Manipulation on the Command Line
- Authors
- Name
- Teddy Xinyuan Chen
Table of Contents
Why
NCSU HPC cluster, Arc, uses slurm
to allocate compute nodes to users, and you only get 6 hours each time.
There are fancy GPU nodes like the a100
and a6000
, but sometimes those are taken and you need to run sinfo
to figure out what's the available node with the best GPU, which is too much work for someone like me to do frequently.
So I made this to automate this task, and had fun playing with unstructured data, CSV files, Sqlite database and writing fancy SQLs. :)
Demo
The data
sinfo
output
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
...
a4000 up 1-00:00:00 2 alloc c[5,35]
a4000 up 1-00:00:00 4 idle c[4,6-7,36]
a6000 up 8-00:00:00 2 alloc c[30-31]
a100 up 1-00:00:00 1 alloc c33
rtx4060ti8g up 1-00:00:00 5 alloc c[20-22,26-27]
rtx4060ti8g up 1-00:00:00 2 idle c[28-29]
rtx4060ti16g up 1-00:00:00 1 alloc c25
Columns are delimited by one or more white spaces, and the last column contains commas (,
).
In the next section I'll demonstrate how to convert them into a nice CSV file.
The scripts
Directory structure
.
├── data
│ ├── ncsu-arc.db # ignore this
│ ├── nvidia-gpus.csv
│ └── sinfo.csv # ignore this
├── metadata.yml
└── scripts
├── get-best-gpu-node-name.sh
├── get-sinfo-csv.sh
└── publish-sinfo-datasette.sh # ignore this
Collecting sinfo output into a CSV file
Play with the sinfo nodes table here: https://ncsu-arc-datasette.teddysc.me/ncsu-arc/nodes
# get-sinfo-csv.sh
#!/usr/bin/env bash
# sourcing for ncsu_vpn_connect
# source ~/config/system/.alias-vpn
ncsu_vpn_connect() {
# if this command doesn't fail, then it's connected
# /opt/cisco/anyconnect/bin/vpn -s state | command grep 'state: Connected'
# if it's not connected, then connect
if [[ -n $(/opt/cisco/anyconnect/bin/vpn -s state | command grep 'state: Connected') ]]; then
echo "Already connected to NCSU VPN"
return 0
fi
# curious about this script?
# read more here: https://teddysc.me/blog/ncsu-vpn-connect
/usr/local/bin/python3 ~/config/scripts/ncsu_vpn_connect.py | /opt/cisco/anyconnect/bin/vpn -s connect vpn.ncsu.edu
}
ncsu_vpn_connect 2>&1 >/dev/null
# ssh arc sinfo > sinfo.csv
# massedit -e 're.sub(" +", ",", line)' sinfo.csv -w
# massedit -e 're.sub(",(c\[?.*$)", ",\"\\1\"", line)' sinfo.csv -w
#
ssh arc "sinfo | sed -E 's/ +/,/g' | sed -E 's/,(c.*$)/,\"\1\"/'"
nvidia-gpus.csv
using LLM
Creating I copied the relevant part of the Arc documentation page as context and prompted GPT-4 with this:
...
convert the info above to a csv file, output in a codeblock, don't output anything else.
fields:
1. 'name', choose from :
"""
...
a4000
a6000
a100
rtx4060ti8g
rtx4060ti16g
"""
2. VRAM: integer
And I got a nice GPU VRAM info csv table easily.
Ingesting sinfo and NVIDIA GPU data into a Sqlite database
Everything is done in a temp dir.
First I get sinfo.csv
using the scripts above, insert data into 2 tables and link them together using foreign key
# get-best-gpu-node-name.sh
#!/usr/bin/env bash
tmpdir=$(mktemp -d)
cd "$tmpdir" || exit 1
SCRIPTS_DIR="$(cd "$(dirname "$0")" && pwd)"
DATA_DIR="$(realpath "$SCRIPTS_DIR/../data")"
"${SCRIPTS_DIR}"/get-sinfo-csv.sh > sinfo.csv
ln "${DATA_DIR}"/nvidia-gpus.csv .
# -d to detect numerical cols
sqlite-utils insert ncsu-arc.db nodes sinfo.csv --csv -d 2>&1 >/dev/null
sqlite-utils insert ncsu-arc.db nvidia-gpus nvidia-gpus.csv --csv --pk name -d 2>&1 >/dev/null
sqlite-utils add-foreign-key ncsu-arc.db nodes PARTITION nvidia-gpus name
# echo "$(realpath ncsu-arc.db)"
sqlite-utils query ncsu-arc.db "SELECT DISTINCT
[nvidia-gpus].name
FROM
[nvidia-gpus]
JOIN nodes ON [nvidia-gpus].name = nodes.PARTITION
WHERE
nodes.STATE = 'idle'
ORDER BY
[nvidia-gpus].VRAM DESC
LIMIT 1;
" | jq '.[].name' -r
CSV output:
PARTITION,AVAIL,TIMELIMIT,NODES,STATE,NODELIST
...
a4000,up,1-00:00:00,3,alloc,"c[4-5,35]"
a4000,up,1-00:00:00,3,idle,"c[6-7,36]"
a6000,up,8-00:00:00,2,alloc,"c[30-31]"
a100,up,1-00:00:00,1,alloc,"c33"
rtx4060ti8g,up,1-00:00:00,5,alloc,"c[20-22,26-27]"
rtx4060ti8g,up,1-00:00:00,2,idle,"c[28-29]"
rtx4060ti16g,up,1-00:00:00,1,alloc,"c25"
SQL on JOINed tables to get the node name with highest VRAM
See the last part of the script above.
Also, you can try the query below here.
SELECT DISTINCT
[nvidia-gpus].name,
[nvidia-gpus].VRAM
-- [nvidia-gpus].name
FROM
[nvidia-gpus]
JOIN nodes ON [nvidia-gpus].name = nodes.PARTITION
WHERE
nodes.STATE = 'idle'
ORDER BY
[nvidia-gpus].VRAM DESC
Result
ncsu_arc_salloc_best_gpu() {
local BEST_NODE_NAME="$(~/testdir/ncsu-arc/scripts/get-best-gpu-node-name.sh)"
ssh arc -t "salloc -p ${BEST_NODE_NAME}"
}
Now a simple ncsu_arc_salloc_best_gpu
command would drop me into the best GPU available in the cluster. :)
$ ncsu_arc_salloc_best_gpu
salloc: slurm_job_submit: interactive job submitted by user_id:teddysc.me to partition:a4000 limited to 6 hours
salloc: Pending job allocation 45727
salloc: job 45727 queued and waiting for resources
salloc: job 45727 has been allocated resources
salloc: Granted job allocation 45727
salloc: Waiting for resource configuration
salloc: Nodes c4 are ready for job
(base) [teddysc.me@c4 ~]$
Things I learned
sed
uses a ridiculous type of regex by default, use-E
to enable extended regex to regain some sanity.
I've fallen for this so many times. Never again!- Getting dir paths this way:
SCRIPTS_DIR="$(cd "$(dirname "$0")" && pwd)" DATA_DIR="$(realpath "$SCRIPTS_DIR/../data")"
- When ingesting CSVs into Sqlite database, don't forget to make numerical columns numerical.