Written by Sean Behan on Wed Aug 01st 2018

Here is a little ruby snippet that will download all pictures from a webpage.

Rather than using XPath, we are going to first reduce the source code to capture everything inside of quotes. Some websites use JSON w/in a script tag to lazy load images and therefore XPath wouldn't be effective.

After we get everything that is quoted, we further reduce the results to items that match against image extensions, .jpg, .png... etc. The regex here doesn't check to see it's at the end of the string bc, formats like "myimg.png?t=123" are common.

We then check if it's a relative link and merge the path w/ the url, if that's the case.

require 'open-uri'

url = 'https://www.telegraph.co.uk/science/2018/07/29/sir-paul-mccartney-misremembers-writing-life-says-harvard-analysing/amp/'
images = open(url).read.scan(/"(.*?)"/im)
    .map { |i| i[0].to_s }
    .select { |i| i=~/(.jpg|.png|.jpeg|.gif)/im }
    .reject {|i| ['.jpg', '.gif', '.png', '.jpeg'].include?(i) }
    .map do |img|
        img =~ /^http/i ? img : URI.join(url, img)
end

puts images

This script could use some improvement. For instance, you would prob,. want to check single quotes too as well as parse the url and check the extension w/out the query string.


Tagged with..
#ruby #regex #images #web scraping #xpath #json

Just finishing up brewing up some fresh ground comments...