Web Scraping With Ruby

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

Web Scraping With Ruby — conceptual illustration

On this page

Web scraping with Ruby means fetching a page with an HTTP gem like HTTParty and parsing the HTML with Nokogiri, which supports both CSS selectors and XPath. Nokogiri is the de-facto standard parser in the Ruby ecosystem. For JavaScript-rendered pages you drive a real browser with Selenium or Watir, and for full crawlers there are frameworks built on these gems.

HTML parsing	Nokogiri — CSS and XPath, the Ruby standard
HTTP client	HTTParty or Faraday (or built-in net/http)
JavaScript pages	Selenium WebDriver or Watir
Form/session helper	Mechanize — cookies, forms, link-following
Install	gem install nokogiri httparty

Your first Ruby scraper with HTTParty + Nokogiri

The classic Ruby stack is HTTParty to fetch and Nokogiri to parse. Install both with gem install nokogiri httparty, then:

require 'httparty'
require 'nokogiri'

response = HTTParty.get('https://books.toscrape.com/',
  headers: { 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' })

doc = Nokogiri::HTML(response.body)

doc.css('article.product_pod').each do |book|
  title = book.at_css('h3 a')['title']
  price = book.at_css('.price_color').text
  puts "#{title} | #{price}"
end

doc.css(selector) returns all matching nodes; at_css returns the first. Access attributes with node['attr'] and text with .text. Nokogiri also supports XPath via doc.xpath('//...') when you need it. Always pass a realistic User-Agent — the default identifies your script as a bot.

Following pagination

To crawl multiple pages, read the "next" link and resolve it against the current URL with URI.join:

require 'uri'

url = 'https://books.toscrape.com/'
while url
  doc = Nokogiri::HTML(HTTParty.get(url).body)

  doc.css('article.product_pod h3 a').each { |a| puts a['title'] }

  nxt = doc.at_css('li.next a')
  url = nxt ? URI.join(url, nxt['href']).to_s : nil   # nil ends the loop
end

For sites that need cookies, logins, or form submissions, the Mechanize gem wraps Nokogiri with session and form handling, so you do not manage cookies by hand.

Scraping JavaScript-rendered pages

Nokogiri only parses the HTML you give it — it does not run JavaScript. For client-side-rendered pages, drive a browser with Selenium (gem install selenium-webdriver):

require 'selenium-webdriver'

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless=new')
driver = Selenium::WebDriver.for(:chrome, options: options)

driver.get('https://quotes.toscrape.com/js/')

driver.find_elements(css: '.quote').each do |q|
  text = q.find_element(css: '.text').text
  author = q.find_element(css: '.author').text
  puts "#{author}: #{text}"
end

driver.quit

Watir is a friendlier wrapper around Selenium that reads more like natural Ruby. Either way, a browser is heavier and easier to detect than a plain HTTP request.

Which Ruby gem should you use?

Gem	Type	Runs JS?	Best for
Nokogiri	HTML/XML parser	No	Parsing — CSS and XPath, the standard
HTTParty	HTTP client	No	Simple, readable requests
Faraday	HTTP client	No	Middleware, advanced configuration
Mechanize	HTTP + parse + session	No	Logins, forms, cookies
Selenium / Watir	Browser automation	Yes	JavaScript-rendered pages

For most jobs, HTTParty + Nokogiri is all you need; add Mechanize for sessions and Selenium only for JavaScript.

The hard part: handling anti-bot blocking

The gem you choose rarely decides success — anti-bot defenses do. HTTParty sends a TLS fingerprint and headers that Cloudflare, DataDome, and Akamai recognise as non-browser, and headless Selenium leaks automation signals. Nokogiri cannot parse a 403 or CAPTCHA page.

Handling modern anti-bot stacks means residential proxies and a real browser fingerprint. A scraping API handles all of that server-side, so your Ruby code posts the URL and parses the returned HTML with Nokogiri as usual:

Code example

ruby

require 'httparty'
require 'json'
require 'nokogiri'

resp = HTTParty.post(
  'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
  headers: { 'Content-Type' => 'application/json' },
  body: { cmd: 'request.get', url: 'https://example.com/protected' }.to_json
)

# Fully rendered, unblocked HTML -- parse it with Nokogiri as usual.
html = resp.parsed_response['solution']['response']
doc = Nokogiri::HTML(html)
puts doc.at_css('title')&.text

Web scraping with PHP means fetching pages with the Guzzle HTTP client and extracting data with Symfony's DomCrawler component, which suppor…

Web Scraping With R: A Complete 2026 Guide

Web scraping with R means using the rvest package to download and parse HTML into tidy data frames, with CSS selectors or XPath. rvest is th…

Web Scraping With Node.js: A Complete 2026 Guide

Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving …

What Is Selenium?

Selenium is the original cross-browser automation framework — the W3C WebDriver standard predates Puppeteer by a decade. In plain terms, it …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

XPath for Web Scraping: A Complete 2026 Guide

XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document, widely used in web scraping to pinpoint the ex…

Web Scraping With Java: A Complete 2026 Guide

Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages…

Web Scraping With C#: A Complete 2026 Guide

Web scraping with C# means using .NET's HttpClient to fetch a page and a parser like HtmlAgilityPack or AngleSharp to extract data from the …

Web Scraping With Go (Golang): A Complete 2026 Guide

Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like select…

Concept map

How Web Scraping With Ruby: A Complete 2026 Guide connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping by Language

Frequently asked questions

What is the best gem for web scraping with Ruby?

Nokogiri is the standard HTML/XML parser — it supports both CSS selectors and XPath and is what almost every Ruby scraper uses. Pair it with HTTParty or Faraday to fetch pages, and use Mechanize when you need cookies, logins, or form submissions. For JavaScript-rendered pages, add Selenium or Watir.

Can Ruby scrape JavaScript-rendered pages?

Yes, but not with Nokogiri alone, which only parses static HTML. Drive a real browser with selenium-webdriver or Watir (a friendlier Selenium wrapper) for client-side-rendered content. Alternatively, find the JSON API the page calls in your browser’s Network tab and request it directly with HTTParty.

Does Nokogiri support XPath?

Yes. Nokogiri supports both CSS selectors (doc.css) and XPath (doc.xpath), so you can use whichever fits. CSS is shorter for most selections; XPath is more powerful when you need to select by text content or navigate to parent and sibling nodes.

Why does my Ruby scraper get blocked, and how do I fix it?

Set a realistic User-Agent, throttle your requests, and rotate residential proxies. Against serious anti-bot vendors you also need a browser-grade TLS fingerprint , which is hard to maintain from HTTParty. Many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.

Last updated: 2026-06-08