- Published on
Blog Posts Data of This Site
- Authors
- Name
- Teddy Xinyuan Chen
Table of Contents
Why I made these?
On the about page I talked about I don't really use real dates when adding or updating blog posts, and now that this site has grown this big, I sometimes cannot remember what I recently published, so I decided to fix this by making dashboards with real add & mod times.
I gathered all metadata from the posts, put them in an SQLite database and enabled FTS, and now I can do this:
Explore them here!
Datasette instances
Contains 2 tables
blog_posts
blog_posts_no_blob
: with BLOB columns removed
On the web
You can type in an SQL and get insights from the data, some ideas:
- Which posts appears in the most commits?
- Does a higher
num_commits
imply longer contents (num_lines
, etc)? Yes, generally. - Which post has the largest difference between
last_modified
andfirst_added
times?
Public instance (unpublished draft blog posts not included):
- https://data.teddysc.me
- Sorted by real last mod time (most recent ones at the top): https://teddysc.me/bpdm | Using SQL | Using view
- By real added time: https://teddysc.me/bpda
Private instance:
- You can request for access if you've been a good friend of mine. :)
- https://pdata.teddysc.me
- https://teddysc.me/pbpm
- https://teddysc.me/pbpa
dclient
Using # https://dclient.datasette.io/en/latest/queries.html
pipx install dclient
dclient query https://data.teddysc.me/blog-posts-public 'select * from "blog_posts_no_blob" order by first_added desc limit 2'
# using alias:
# dclient alias add bp https://blog-posts-datasette.teddysc.me/blog-posts-public
# dclient query bp 'select * from "blog-posts-no-blob" order by first_added desc limit 2'
[
{
"post_url": "https://teddysc.me/blog/data-manipulation-on-the-command-line",
"image_url": "https://teddysc.me/static/images/ncsu-arc-table-plus-query-screenshot.jpg",
"images_url": "[\"https://teddysc.me/static/images/ncsu-arc-table-plus-query-screenshot.jpg\"]",
"path": "data-manipulation-on-the-command-line.mdx",
"first_added": 1709688398,
"last_modified": 1709690032,
"num_commits": 6,
"num_lines": 235,
"num_words": 859,
"num_chars": 7763,
"images": "[\"/static/images/ncsu-arc-table-plus-query-screenshot.jpg\"]",
"title": "Data Collection and Manipulation on the Command Line",
"date": "2024-02-27",
"lastmod": "2024-02-27",
"tags": "[\"command-line\", \"tech\", \"sql\", \"data\", \"sqlite\", \"slurm\", \"hpc\", \"featured\"]",
"draft": 0,
"summary": "AKA \"How to grab the best GPU instance in a slurm-managed HPC cluster faster than anyone else\"",
"layout": null
},
{
"post_url": "https://teddysc.me/blog/ntit/walking-tours",
"image_url": "https://teddysc.me/static/images/vintage-church-photo-from-true-crime-tour.jpg",
"images_url": "[\"https://teddysc.me/static/images/vintage-church-photo-from-true-crime-tour.jpg\"]",
"path": "ntit/walking-tours.mdx",
"first_added": 1709615507,
"last_modified": 1709615507,
"num_commits": 1,
"num_lines": 27,
"num_words": 158,
"num_chars": 1138,
"images": "[\"/static/images/vintage-church-photo-from-true-crime-tour.jpg\"]",
"title": "NTIT: Walking Tours",
"date": "2024-02-15",
"lastmod": "2024-02-15",
"tags": "[\"ntit\", \"raleigh\"]",
"draft": 0,
"summary": "I went to the Raleigh True Crime Walking Tour ($) & and the history walking tour",
"layout": null
}
]
JSON API
Access controlled: /private/blog-posts.json
Random Blog Post API
https://blog-api.teddysc.me/random
Used by: /random/blog-post
& Random GPT
Note
These instances currently run on Vercel free plan, so slow cold starts and bad query performance are expected.
Techniqal Details
Getting real add & mod times
Everything is tracked by Git, including the blog posts *.mdx
files.
With this command you can see the time associated with the commits that involves a specific file in the repository:
git log --diff-filter=A --follow --format=%aD -- $file
And I wrote this funciton to do this with GitPython:
# scripts/utils.py
from git import Repo
from datetime import datetime
from typing import Tuple, List, Optional
def git_get_file_history(
repo_path: str, file_path: str
) -> Tuple[Optional[datetime], List[datetime]]:
# example return:
# (
# datetime.datetime(2024, 2, 12, 18, 54, 5, tzinfo=<git.objects.util.tzoffset object at 0x10c224b00>),
# [
# datetime.datetime(2024, 2, 12, 18, 54, 5, tzinfo=<git.objects.util.tzoffset object at 0x10c224b00>),
# datetime.datetime(2024, 2, 12, 20, 35, 3, tzinfo=<git.objects.util.tzoffset object at 0x10c3cdd60>),
# datetime.datetime(2024, 2, 13, 20, 9, 4, tzinfo=<git.objects.util.tzoffset object at 0x10c3cdbe0>),
# datetime.datetime(2024, 2, 13, 20, 49, 48, tzinfo=<git.objects.util.tzoffset object at 0x10c3cd8e0>)
# ]
# )
#
repo = Repo(repo_path)
first_added = None
commit_datetimes = []
commits = list(repo.iter_commits(paths=file_path))
if not commits:
return None, []
for commit in commits[::-1]: # Iterate from the earliest to the most recent commit
commit_datetime = commit.authored_datetime
commit_datetimes.append(commit_datetime)
if not first_added:
# Check if the file was added in this commit by looking at diffs with the first parent
diffs = (
commit.parents[0].diff(commit)
if commit.parents
else commit.diff("HEAD")
)
for diff in diffs:
if diff.change_type == "A" and diff.b_path == file_path:
first_added = commit_datetime
break
if (not first_added) and commit_datetimes:
# If the file was not added in any commit, it must have been added before the first commit
first_added = commit_datetimes[0]
return first_added, commit_datetimes
Getting metadata in the frontmatter
The python-frontmatter
module makes this very easy.