Published on

Blog Posts Data of This Site

Authors
  • avatar
    Name
    Teddy Xinyuan Chen
    Twitter
Table of Contents

Why I made these?

On the about page I talked about I don't really use real dates when adding or updating blog posts, and now that this site has grown this big, I sometimes cannot remember what I recently published, so I decided to fix this by making dashboards with real add & mod times.

I gathered all metadata from the posts, put them in an SQLite database and enabled FTS, and now I can do this:

Explore them here!

Datasette instances

Contains 2 tables

  • blog_posts
  • blog_posts_no_blob: with BLOB columns removed

On the web

You can type in an SQL and get insights from the data, some ideas:

Public instance (unpublished draft blog posts not included):

Private instance:

Using dclient

# https://dclient.datasette.io/en/latest/queries.html
pipx install dclient
dclient query https://data.teddysc.me/blog-posts-public 'select * from "blog_posts_no_blob" order by first_added desc limit 2'
# using alias:
# dclient alias add bp https://blog-posts-datasette.teddysc.me/blog-posts-public
# dclient query bp 'select * from "blog-posts-no-blob" order by first_added desc limit 2'
[
  {
    "post_url": "https://teddysc.me/blog/data-manipulation-on-the-command-line",
    "image_url": "https://teddysc.me/static/images/ncsu-arc-table-plus-query-screenshot.jpg",
    "images_url": "[\"https://teddysc.me/static/images/ncsu-arc-table-plus-query-screenshot.jpg\"]",
    "path": "data-manipulation-on-the-command-line.mdx",
    "first_added": 1709688398,
    "last_modified": 1709690032,
    "num_commits": 6,
    "num_lines": 235,
    "num_words": 859,
    "num_chars": 7763,
    "images": "[\"/static/images/ncsu-arc-table-plus-query-screenshot.jpg\"]",
    "title": "Data Collection and Manipulation on the Command Line",
    "date": "2024-02-27",
    "lastmod": "2024-02-27",
    "tags": "[\"command-line\", \"tech\", \"sql\", \"data\", \"sqlite\", \"slurm\", \"hpc\", \"featured\"]",
    "draft": 0,
    "summary": "AKA \"How to grab the best GPU instance in a slurm-managed HPC cluster faster than anyone else\"",
    "layout": null
  },
  {
    "post_url": "https://teddysc.me/blog/ntit/walking-tours",
    "image_url": "https://teddysc.me/static/images/vintage-church-photo-from-true-crime-tour.jpg",
    "images_url": "[\"https://teddysc.me/static/images/vintage-church-photo-from-true-crime-tour.jpg\"]",
    "path": "ntit/walking-tours.mdx",
    "first_added": 1709615507,
    "last_modified": 1709615507,
    "num_commits": 1,
    "num_lines": 27,
    "num_words": 158,
    "num_chars": 1138,
    "images": "[\"/static/images/vintage-church-photo-from-true-crime-tour.jpg\"]",
    "title": "NTIT: Walking Tours",
    "date": "2024-02-15",
    "lastmod": "2024-02-15",
    "tags": "[\"ntit\", \"raleigh\"]",
    "draft": 0,
    "summary": "I went to the Raleigh True Crime Walking Tour ($) & and the history walking tour",
    "layout": null
  }
]

JSON API

/blog-posts-public.json

Access controlled: /private/blog-posts.json

Random Blog Post API

https://blog-api.teddysc.me/random

Used by: /random/blog-post & Random GPT

Note

These instances currently run on Vercel free plan, so slow cold starts and bad query performance are expected.

Techniqal Details

Getting real add & mod times

Everything is tracked by Git, including the blog posts *.mdx files.

With this command you can see the time associated with the commits that involves a specific file in the repository:

git log --diff-filter=A --follow --format=%aD -- $file

And I wrote this funciton to do this with GitPython:

# scripts/utils.py

from git import Repo
from datetime import datetime
from typing import Tuple, List, Optional


def git_get_file_history(
    repo_path: str, file_path: str
) -> Tuple[Optional[datetime], List[datetime]]:
    # example return:
    #     (
    #     datetime.datetime(2024, 2, 12, 18, 54, 5, tzinfo=<git.objects.util.tzoffset object at 0x10c224b00>),
    #     [
    #         datetime.datetime(2024, 2, 12, 18, 54, 5, tzinfo=<git.objects.util.tzoffset object at 0x10c224b00>),
    #         datetime.datetime(2024, 2, 12, 20, 35, 3, tzinfo=<git.objects.util.tzoffset object at 0x10c3cdd60>),
    #         datetime.datetime(2024, 2, 13, 20, 9, 4, tzinfo=<git.objects.util.tzoffset object at 0x10c3cdbe0>),
    #         datetime.datetime(2024, 2, 13, 20, 49, 48, tzinfo=<git.objects.util.tzoffset object at 0x10c3cd8e0>)
    #     ]
    # )
    #
    repo = Repo(repo_path)
    first_added = None
    commit_datetimes = []
    commits = list(repo.iter_commits(paths=file_path))
    if not commits:
        return None, []
    for commit in commits[::-1]:  # Iterate from the earliest to the most recent commit
        commit_datetime = commit.authored_datetime
        commit_datetimes.append(commit_datetime)
        if not first_added:
            # Check if the file was added in this commit by looking at diffs with the first parent
            diffs = (
                commit.parents[0].diff(commit)
                if commit.parents
                else commit.diff("HEAD")
            )
            for diff in diffs:
                if diff.change_type == "A" and diff.b_path == file_path:
                    first_added = commit_datetime
                    break
    if (not first_added) and commit_datetimes:
        # If the file was not added in any commit, it must have been added before the first commit
        first_added = commit_datetimes[0]
    return first_added, commit_datetimes

Getting metadata in the frontmatter

The python-frontmatter module makes this very easy.

See also

Posts tagged data