Legal Considerations
1. Compliance Requirements
- Check website's robots.txt
- Respect terms of service
- Follow data protection laws
- Obtain necessary permissions
- Implement rate limiting
- Store data securely
- Honor opt-out requests
2. Implementation Example
class LegalEmailScraper:
def __init__(self):
self.visited_urls = set()
self.email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
self.robots_parser = RobotFileParser()
async def can_scrape(self, url):
# Check robots.txt
robots_url = urljoin(url, '/robots.txt')
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
return self.robots_parser.can_fetch('*', url)
async def scrape_emails(self, url, depth=2):
if not await self.can_scrape(url):
return set()
emails = set()
try:
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
# Extract emails
found_emails = self.email_pattern.findall(text)
emails.update(self.validate_emails(found_emails))
# Follow links if depth allows
if depth > 0:
soup = BeautifulSoup(text, 'lxml')
for link in soup.find_all('a', href=True):
next_url = urljoin(url, link['href'])
if next_url not in self.visited_urls:
self.visited_urls.add(next_url)
sub_emails = await self.scrape_emails(next_url, depth-1)
emails.update(sub_emails)
except Exception as e:
logger.error(f'Error scraping {url}: {e}')
return emails
def validate_emails(self, emails):
# Remove common false positives
return {email for email in emails if self.is_valid_email(email)}
Best Practices
1. Rate Limiting
class RateLimitedScraper:
def __init__(self, requests_per_minute=20):
self.rate_limit = RateLimiter(requests_per_minute)
async def scrape_with_limits(self, url):
async with self.rate_limit:
return await self.scrape_page(url)
2. Data Protection
class SecureEmailStorage:
def __init__(self):
self.encryption_key = Fernet.generate_key()
self.cipher_suite = Fernet(self.encryption_key)
def store_email(self, email):
encrypted_email = self.cipher_suite.encrypt(email.encode())
return self.save_to_database(encrypted_email)
Remember: Always prioritize legal compliance and respect website owners' rights when scraping email addresses.
Extracting emails at scale without getting blocked
Two practical problems show up once you move past a handful of pages. First, addresses are rarely sitting in plain text: sites use Cloudflare email obfuscation (the data-cfemail attribute that decodes client-side), HTML entity encoding, [at]/[dot] munging, or render contact details only after JavaScript runs. A regex over the raw HTML will miss all of these, so you need to decode the Cloudflare scheme, normalise entities, and render JS-heavy pages in a real browser before extracting.
Second, the contact and team pages that hold these addresses are often the most heavily protected on a site, sitting behind anti-bot detection and rate limits. Aggressive crawling gets your IP blocked quickly, so pair conservative rate limiting and proxy rotation with the legal and consent practices above. A web scraping API handles the rendering, decoding-friendly HTML, and rotation in one call, which keeps a compliant email-collection workflow from collapsing the moment it hits a protected page.
