Data Collection and Manipulation on the Command Line

Table of Contents

Why
Demo
The data
- sinfo output
The scripts
Result
Things I learned
See also

Why

NCSU HPC cluster, Arc, uses slurm to allocate compute nodes to users, and you only get 6 hours each time.

There are fancy GPU nodes like the a100 and a6000, but sometimes those are taken and you need to run sinfo to figure out what's the available node with the best GPU, which is too much work for someone like me to do frequently.

So I made this to automate this task, and had fun playing with unstructured data, CSV files, Sqlite database and writing fancy SQLs. :)

Demo

The data

`sinfo` output

Play with this data online

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
...
a4000           up 1-00:00:00      2  alloc c[5,35]
a4000           up 1-00:00:00      4   idle c[4,6-7,36]
a6000           up 8-00:00:00      2  alloc c[30-31]
a100            up 1-00:00:00      1  alloc c33
rtx4060ti8g     up 1-00:00:00      5  alloc c[20-22,26-27]
rtx4060ti8g     up 1-00:00:00      2   idle c[28-29]
rtx4060ti16g    up 1-00:00:00      1  alloc c25

Columns are delimited by one or more white spaces, and the last column contains commas (,).

In the next section I'll demonstrate how to convert them into a nice CSV file.

The scripts

Directory structure

.
├── data
│  ├── ncsu-arc.db # ignore this
│  ├── nvidia-gpus.csv
│  └── sinfo.csv # ignore this
├── metadata.yml
└── scripts
   ├── get-best-gpu-node-name.sh
   ├── get-sinfo-csv.sh
   └── publish-sinfo-datasette.sh # ignore this

Collecting sinfo output into a CSV file

Play with the sinfo nodes table here: https://ncsu-arc-datasette.teddysc.me/ncsu-arc/nodes

# get-sinfo-csv.sh
#!/usr/bin/env bash

# sourcing for ncsu_vpn_connect
# source ~/config/system/.alias-vpn
ncsu_vpn_connect() {
    # if this command doesn't fail, then it's connected
    # /opt/cisco/anyconnect/bin/vpn -s state | command grep 'state: Connected'
    # if it's not connected, then connect
    if [[ -n $(/opt/cisco/anyconnect/bin/vpn -s state | command grep 'state: Connected') ]]; then
        echo "Already connected to NCSU VPN"
        return 0
    fi
    # curious about this script?
    # read more here: https://teddysc.me/blog/ncsu-vpn-connect
    /usr/local/bin/python3 ~/config/scripts/ncsu_vpn_connect.py | /opt/cisco/anyconnect/bin/vpn -s connect vpn.ncsu.edu
}
ncsu_vpn_connect 2>&1 >/dev/null
# ssh arc sinfo > sinfo.csv
# massedit -e 're.sub(" +", ",", line)' sinfo.csv -w
# massedit -e 're.sub(",(c\[?.*$)", ",\"\\1\"", line)' sinfo.csv -w
#

ssh arc "sinfo | sed -E 's/ +/,/g' | sed -E 's/,(c.*$)/,\"\1\"/'"

Creating `nvidia-gpus.csv` using LLM

I copied the relevant part of the Arc documentation page as context and prompted GPT-4 with this:

...
convert the info above to a csv file, output in a codeblock, don't output anything else.
fields:
1. 'name', choose from :
"""
...
a4000
a6000
a100
rtx4060ti8g
rtx4060ti16g
"""
2. VRAM: integer

And I got a nice GPU VRAM info csv table easily.

Ingesting sinfo and NVIDIA GPU data into a Sqlite database

Everything is done in a temp dir.

First I get sinfo.csv using the scripts above, insert data into 2 tables and link them together using foreign key

# get-best-gpu-node-name.sh
#!/usr/bin/env bash

tmpdir=$(mktemp -d)
cd "$tmpdir" || exit 1

SCRIPTS_DIR="$(cd "$(dirname "$0")" && pwd)"
DATA_DIR="$(realpath "$SCRIPTS_DIR/../data")"

"${SCRIPTS_DIR}"/get-sinfo-csv.sh > sinfo.csv
ln "${DATA_DIR}"/nvidia-gpus.csv .

# -d to detect numerical cols
sqlite-utils insert ncsu-arc.db nodes sinfo.csv --csv -d 2>&1 >/dev/null
sqlite-utils insert ncsu-arc.db nvidia-gpus nvidia-gpus.csv --csv --pk name -d 2>&1 >/dev/null
sqlite-utils add-foreign-key ncsu-arc.db nodes PARTITION nvidia-gpus name
# echo "$(realpath ncsu-arc.db)"
sqlite-utils query ncsu-arc.db "SELECT DISTINCT
        [nvidia-gpus].name
FROM
        [nvidia-gpus]
        JOIN nodes ON [nvidia-gpus].name = nodes.PARTITION
WHERE
        nodes.STATE = 'idle'
ORDER BY
        [nvidia-gpus].VRAM DESC
LIMIT 1;
" | jq '.[].name' -r

CSV output:

PARTITION,AVAIL,TIMELIMIT,NODES,STATE,NODELIST
...
a4000,up,1-00:00:00,3,alloc,"c[4-5,35]"
a4000,up,1-00:00:00,3,idle,"c[6-7,36]"
a6000,up,8-00:00:00,2,alloc,"c[30-31]"
a100,up,1-00:00:00,1,alloc,"c33"
rtx4060ti8g,up,1-00:00:00,5,alloc,"c[20-22,26-27]"
rtx4060ti8g,up,1-00:00:00,2,idle,"c[28-29]"
rtx4060ti16g,up,1-00:00:00,1,alloc,"c25"

SQL on JOINed tables to get the node name with highest VRAM

See the last part of the script above.

Also, you can try the query below here.

SELECT DISTINCT
	[nvidia-gpus].name,
	[nvidia-gpus].VRAM
	-- [nvidia-gpus].name
FROM
	[nvidia-gpus]
	JOIN nodes ON [nvidia-gpus].name = nodes.PARTITION
WHERE
	nodes.STATE = 'idle'
ORDER BY
	[nvidia-gpus].VRAM DESC

Result

ncsu_arc_salloc_best_gpu() {
    local BEST_NODE_NAME="$(~/testdir/ncsu-arc/scripts/get-best-gpu-node-name.sh)"
    ssh arc -t "salloc -p ${BEST_NODE_NAME}"
}

Now a simple ncsu_arc_salloc_best_gpu command would drop me into the best GPU available in the cluster. :)

$ ncsu_arc_salloc_best_gpu

salloc: slurm_job_submit: interactive job submitted by user_id:teddysc.me to partition:a4000 limited to 6 hours
salloc: Pending job allocation 45727
salloc: job 45727 queued and waiting for resources
salloc: job 45727 has been allocated resources
salloc: Granted job allocation 45727
salloc: Waiting for resource configuration
salloc: Nodes c4 are ready for job
(base) [teddysc.me@c4 ~]$

Things I learned

sed uses a ridiculous type of regex by default, use -E to enable extended regex to regain some sanity.
I've fallen for this so many times. Never again!

Getting dir paths this way:

SCRIPTS_DIR="$(cd "$(dirname "$0")" && pwd)"
DATA_DIR="$(realpath "$SCRIPTS_DIR/../data")"

When ingesting CSVs into Sqlite database, don't forget to make numerical columns numerical.

Why