SQLTerminal.app
now available

XPath.app
now available
Want fresh tech tips in your inbox?

How to Extract all Images from a Webpage with Ruby

       

Here is a little ruby snippet that will download all pictures from a webpage.

Rather than using XPath, we are going to first reduce the source code to capture everything inside of quotes. Some websites use JSON w/in a script tag to lazy load images and therefore XPath wouldn't be effective.

After we get everything that is quoted, we further reduce the results to items that match against image extensions, .jpg, .png... etc. The regex here doesn't check to see it's at the end of the string bc, formats like "myimg.png?t=123" are common.

We then check if it's a relative link and merge the path w/ the url, if that's the case.

require 'open-uri'

url = 'https://www.telegraph.co.uk/science/2018/07/29/sir-paul-mccartney-misremembers-writing-life-says-harvard-analysing/amp/'
images = open(url).read.scan(/"(.*?)"/im)
    .map { |i| i[0].to_s }
    .select { |i| i=~/(.jpg|.png|.jpeg|.gif)/im }
    .reject {|i| ['.jpg', '.gif', '.png', '.jpeg'].include?(i) }
    .map do |img|
        img =~ /^http/i ? img : URI.join(url, img)
end

puts images

This script could use some improvement. For instance, you would prob,. want to check single quotes too as well as parse the url and check the extension w/out the query string.

Tagged w/ #ruby #regex #images #web scraping #xpath #json

Apps I've Built

XPath Expression Editor
Practice and improve your XPath skills with XPath Editor
Click to buy on the App Store
Photo Location Changer
Easily change the location on your photos and videos
Click here for more info about the app
Photo Date Changer
Easily change the dates and times on your photos and videos
Click here for more info about the app
VocabReminder
English dictionary with notifications so you won't forget what you're studying!
Click here for more info about the app
VocabQuiz
The app that quizzes and scores you on your vocabulary!
Click here for more info about the app