Your first Ruby scraper with HTTParty + Nokogiri
The classic Ruby stack is HTTParty to fetch and Nokogiri to parse. Install both with gem install nokogiri httparty, then:
require 'httparty'
require 'nokogiri'
response = HTTParty.get('https://books.toscrape.com/',
headers: { 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' })
doc = Nokogiri::HTML(response.body)
doc.css('article.product_pod').each do |book|
title = book.at_css('h3 a')['title']
price = book.at_css('.price_color').text
puts "#{title} | #{price}"
enddoc.css(selector) returns all matching nodes; at_css returns the first. Access attributes with node['attr'] and text with .text. Nokogiri also supports XPath via doc.xpath('//...') when you need it. Always pass a realistic User-Agent — the default identifies your script as a bot.
Following pagination
To crawl multiple pages, read the "next" link and resolve it against the current URL with URI.join:
require 'uri'
url = 'https://books.toscrape.com/'
while url
doc = Nokogiri::HTML(HTTParty.get(url).body)
doc.css('article.product_pod h3 a').each { |a| puts a['title'] }
nxt = doc.at_css('li.next a')
url = nxt ? URI.join(url, nxt['href']).to_s : nil # nil ends the loop
endFor sites that need cookies, logins, or form submissions, the Mechanize gem wraps Nokogiri with session and form handling, so you do not manage cookies by hand.
Scraping JavaScript-rendered pages
Nokogiri only parses the HTML you give it — it does not run JavaScript. For client-side-rendered pages, drive a browser with Selenium (gem install selenium-webdriver):
require 'selenium-webdriver'
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless=new')
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get('https://quotes.toscrape.com/js/')
driver.find_elements(css: '.quote').each do |q|
text = q.find_element(css: '.text').text
author = q.find_element(css: '.author').text
puts "#{author}: #{text}"
end
driver.quitWatir is a friendlier wrapper around Selenium that reads more like natural Ruby. Either way, a browser is heavier and easier to detect than a plain HTTP request.
Which Ruby gem should you use?
| Gem | Type | Runs JS? | Best for |
|---|---|---|---|
| Nokogiri | HTML/XML parser | No | Parsing — CSS and XPath, the standard |
| HTTParty | HTTP client | No | Simple, readable requests |
| Faraday | HTTP client | No | Middleware, advanced configuration |
| Mechanize | HTTP + parse + session | No | Logins, forms, cookies |
| Selenium / Watir | Browser automation | Yes | JavaScript-rendered pages |
For most jobs, HTTParty + Nokogiri is all you need; add Mechanize for sessions and Selenium only for JavaScript.
The hard part: handling anti-bot blocking
The gem you choose rarely decides success — anti-bot defenses do. HTTParty sends a TLS fingerprint and headers that Cloudflare, DataDome, and Akamai recognise as non-browser, and headless Selenium leaks automation signals. Nokogiri cannot parse a 403 or CAPTCHA page.
Handling modern anti-bot stacks means residential proxies and a real browser fingerprint. A scraping API handles all of that server-side, so your Ruby code posts the URL and parses the returned HTML with Nokogiri as usual: