Miskatonic University Press

Using Llama 3 for scripting help with curl and long podcast URLs

llm unix

Someone, I think it was Ed Summers, mentioned a podcast I didn’t know, specifically Episode #202: Why we can’t think beyond capitalism. - Neoliberalism (Mark Fisher, Capitalist Realism). I didn’t know anything about it, but there was an RSS feed (as there is for all true podcasts) so I had a look at that. Podcast RSS feeds are clear and easy ways to see what the show is like, without a lot of cruft. I thought I’d download episode 202 and give it a listen.

This is the URL for episode 202:

https://pdst.fm/e/chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748

Wow! That looks very long and rather suspicious. The URL has a lot of domain names (such as chrt.fm) in it. What’s going on? The best tool to find out is curl. Here’s what I ran at the command line, with the output:

$ curl --head "https://pdst.fm/e/chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748"
HTTP/2 302
date: Wed, 10 Jul 2024 23:24:45 GMT
content-type: text/html; charset=utf-8
access-control-allow-origin: *
location: https://chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
via: 1.1 google

The --head option has curl report the HTTP header (mostly metadata) instead of downloading anything. Notice the HTTP 302 response and the location line. That tells the client (browser or podcast program or whatever) that it should look to this URL to get what it wants. But if we get that URL …

$ curl --head "https://chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748"
HTTP/2 302
content-length: 0
location: https://pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
date: Wed, 10 Jul 2024 23:28:24 GMT
server: uvicorn
set-cookie: _chtbl=a225a87f117649acb2171c1dd6a37ffb; Domain=.chrt.fm; Path=/; SameSite=lax
x-cache: Miss from cloudfront
via: 1.1 4c6036e1a9755ebb992fa03bf694150e.cloudfront.net (CloudFront)
x-amz-cf-pop: YUL62-P2
x-amz-cf-id: qy9MNGbTQ1PvhMtqgaNesEm86ukfgOpUfy7CJ1FmX8XKInKY9bxv3g==

Another location header, still containing a lot of things that look like domain names in it. It’s going to take a while to get to the bottom of this.

I wrote myself a little function (for bash, the shell I use), so I can supply a URL and get the location header back.

$ function get_location() {  curl --silent --head "$1" | grep location | sed 's/.* //'; }
$ get_location "https://pdst.fm/e/chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748"
https://chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748

Nice!

Next I wanted to write a short script that would work through all these URLs until it got to the bottom. I spent five or ten minutes on this but ran into some kind of problem with checking if a string was empty or something like that. I thought, “Wait a minute. Simon Willison would use an LLM to do it for him!” So that’s what I tried, with DuckDuckGo’s Llama 3.

I told it:

Write a bash shell function that uses curl to find if a given URL has a location header, and if so, return that location URL.

It gave me:

get_location_header() {
  local url="$1"
  local location=$(curl -s -I -o /dev/null -w "%{http_code} %{redirect_url}" "$url")
  local code=$(echo "$location" | cut -d' ' -f1)
  local redirect_url=$(echo "$location" | cut -d' ' -f2-)

  if [ "$code" -eq 301 -o "$code" -eq 302 ]; then
    echo "$redirect_url"
  else
    echo "No location header found"
  fi
}

I said:

It is possible that running that function on a URL returns a URL that itself returns a location header. Use that function to write a recursive shell script that works through multiple levels of location headers until there are no more.

It gave me:

follow_redirects() {
  local url="$1"
  local max_redirects=10  # prevent infinite loops

  while [ $max_redirects -gt 0 ]; do
    local location=$(get_location_header "$url")
    if [ "$location" != "No location header found" ]; then
      url="$location"
      max_redirects=$((max_redirects - 1))
    else
      echo "Final URL: $url"
      return
    fi
  done

  echo "Error: too many redirects"
}

I tried that and it worked, but not how I wanted. I said:

That works, but make it show each URL as it recurses.

follow_redirects() {
  local url="$1"
  local max_redirects=10  # prevent infinite loops

  while [ $max_redirects -gt 0 ]; do
    echo "Checking URL: $url"
    local location=$(get_location_header "$url")
    if [ "$location" != "No location header found" ]; then
      url="$location"
      max_redirects=$((max_redirects - 1))
    else
      echo "Final URL: $url"
      return
    fi
  done

  echo "Error: too many redirects"
}

I said:

That works nicely.

I'm glad to hear that! The `follow_redirects` function should be helpful for tracing the chain of redirects for a given URL. If you have any more questions or need further assistance, feel free to ask!

Here’s how to run it all in a bash shell:

$ function get_location_header() {
  local url="$1"
  local location=$(curl -s -I -o /dev/null -w "%{http_code} %{redirect_url}" "$url")
  local code=$(echo "$location" | cut -d' ' -f1)
  local redirect_url=$(echo "$location" | cut -d' ' -f2-)

  if [ "$code" -eq 301 -o "$code" -eq 302 ]; then
    echo "$redirect_url"
  else
    echo "No location header found"
  fi
}
$ follow_redirects() {
  local url="$1"
  local max_redirects=10  # prevent infinite loops

  while [ $max_redirects -gt 0 ]; do
    echo "Checking URL: $url"
    local location=$(get_location_header "$url")
    if [ "$location" != "No location header found" ]; then
      url="$location"
      max_redirects=$((max_redirects - 1))
    else
      echo "Final URL: $url"
      return
    fi
  done

  echo "Error: too many redirects"
}
$ follow_redirects "https://pdst.fm/e/chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748"
Checking URL: https://pdst.fm/e/chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://chrt.fm/track/47G541/pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://pscrb.fm/rss/p/mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://mgln.ai/e/433/claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://claritaspod.com/measure/traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://traffic.megaphone.fm/QCD6489207092.mp3?updated=1717391748
Checking URL: https://dcs.megaphone.fm/QCD6489207092.mp3?key=8cd61d704fe268b21da14bc81f17f7ba&request_event_id=9706319f-431f-465d-a85d-d55053cbb184&timetoken=1720741181_4922C6F958D58D4C0673A8DE2476B183
Final URL: https://dcs.megaphone.fm/QCD6489207092.mp3?key=8cd61d704fe268b21da14bc81f17f7ba&request_event_id=9706319f-431f-465d-a85d-d55053cbb184&timetoken=1720741181_4922C6F958D58D4C0673A8DE2476B183

I knew I had to use function to make bash know these were functions, but I could have told Llama to give me something I could copy and paste right into the shell. Also, it’s using short options for curl (such as -I) where I prefer long ones (such as --head) in anything scripted, because it helps readability. I could have told it to change that as well.

It works very nicely, and what’s more, the scripts are good. Better than anything I could hack in a few minutes, and I’ve never used local in a shell script in my life, but it seems like a good practice.

After all that, the final URL is

https://dcs.megaphone.fm/QCD6489207092.mp3?key=8cd61d704fe268b21da14bc81f17f7ba&request_event_id=9706319f-431f-465d-a85d-d55053cbb184&timetoken=1720741181_4922C6F958D58D4C0673A8DE2476B183

If you strip off the tracking, it’s

https://dcs.megaphone.fm/QCD6489207092.mp3

Now I can feed that into my audio player and finally listen to episode 202 of Philosophize This! I see it’s thirty-seven minutes long, which is shorter than it took me to write this.