Newer
Older
a simple script to scan a html file and replace every img tag's src with an embedded copy of the base64-encoded data, i.e. embedding the image in the file. (back when i used to play with windows xp installers adding things to the disk such as updates used to be known as 'slipstreaming')
because imgur announced recently (apr 2023) that they would be deleting a bunch of images, and i wanted to keep the pictures with some twine games that i've been playing recently, and i also didn't want to do that by hand (i am also something of a data hoarder ;)
the python script doesn't pretend to understand html, it reads the file into memory, then looks for the <img opening tag, looks for a src= attribute following that, tries to determine if that contains a url, and if it does, downloads the url, base64 encodes the result, and jams it in to the src attribute. it also looks for <img and src=" which crop up in twine files a fair bit. it then writes the results to a new file.
badly formatted html won't stop it, but it also will react unpredictably. it tries to strip unnecessary whitespace from the url, but that doesn't always work right. if a quote is missing, it may miss the url entirely, or it may grab half the file and think that is the url, i capped the url length at 40 characters (now 100 and adjustable on the command line) to try and avoid sending garbage requests. i have no idea what it does with urls that fail to load (now throws a warning if the server doesn't send a 200), or don't contain an image. i do not recommend removing the original file until you have thoroughly checked everything. it cannot handle already local files, but it should throw a warning alerting you to them. it cannot handle javascript loaded images at all unless it fits with standard html, if the js uses img it should throw a warning but no guarantees.