site banner

Small-Scale Question Sunday for August 4, 2024

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

Does anyone know of a good tool or method that could be used to archive all the pages of a given Substack?

I know this is paranoid, but there are some that I'd like to save locally, in case the site is taken down or something.

Best: https://github.com/alexferrari88/sbstck-dl

Can even pass your login cookies for paywalled articles.

Collect the substack's archive page for all post links. You can do this by going to the archive page, and scrolling to the bottom so all the posts can load. When you have the list, visit each link and download the page using SingleFile or, my favorite, Save Page WE.

The standard recommendation for website archiving is wget (with the options for recursive and archival downloading enabled). I don't know whether it works on Substack's Javascript-heavy archive pages, but you can just manually scroll down to load the entire archive page, then download that manually and tell wget to download everything linked from the downloaded page.

Alternatively, you can use SingleFile to download individual pages manually.

I know this is paranoid

It's not. (and the blog owners itself are more likely to delete it than entire site goes down)

You can fetch Substack content via RSS, and have a reader automatically store all content locally, or on an server. I'm doing something like this myself, but I'm more focused on Twitter.

For Substack one issue is subscriber only posts, because the public feed only gives you previews, but maybe it's just a question of passing your credentials in the request headers. Never bothered trying it out, but I can give it a go, if you're interested.

LLMs are great at writing scripts for this kind of thing. Untested, but see e.g. https://g.co/gemini/share/3ce27fe9318f

Something like readwise? Any tool that grabs pages and stores them with all of the junk stripped out might work for you.