permanent tangent

Todays selection...

Web Scraping

Published: 14thOctober 2022

I've used PHP before to scrape information from a web site, mainly using CURL. However a few years ago I delved into Python and found Beautiful Soup. It was so much easier to use. My first project using it was combining some webpages and turning them into an ebook. That went fairly well, helped by the site I was accessing being relatively static. Then I wanted to do more...

A dynamically generated site, with one of the worst code layout I have seen. It was partilly minified, but there was so much additional crud in there, additional elements not needed, div after div, span after span. The initial download fo the page was so large and it just looked messy even after prettifying it.

I find a good starting point for my project with a python script that was old and needed updating. But using someone elses code has its drawbacks, especially when you need to debug it. But I took the time to go through it methodically, adding necessary bits as I go to get it to actually work. Then out of the blue it stops working.

I go through the code line by line. Adding in more code to try and identify and track down the issue, but it takes way to much to time, but at least it help me to understand the original coding more. So that gets fixed, everything looks ok, I tested it and confirmed it works. So then I start the process to deploy it, and automate it to run 3 or 4 times a day. But it didn't work, it took me a while to get the right syntax in Cron. And then... it breaks again. This time though the error wasn't mine! I eventually tracked it down to be an issue in the coding of the site I'm scraping. Problem with this was part of my code ended up in a cycle and would not stop, but at least I was logging those errors. So this time it was actually beneficial, as I had to add in a check in case it ever happened again.

Then I finished it, set it lose and it continues to work. There are some glitches with the data, which I might fix on day, but as it was coded just for me, I can safely leave that.

All this to find out what was new each day on the streaming services I subscribe to. Maybe one day I'll release the code.