HomeLearningRobots.txt & Legal
Beginner12 min read

Understanding Robots.txt and Legal Considerations in Web Scraping

Before you write a single line of scraping code, understanding the rules of the road is essential. The robots.txt file serves as the first handshake between your crawler and a website, while a growing body of law shapes what data you can collect, how you store it, and what you do with it. This guide covers the technical syntax of robots.txt, the major legal frameworks that affect web scraping, and the ethical best practices that keep your ecommerce data operations sustainable and compliant.

What Is Robots.txt?

The Robots Exclusion Protocol, commonly known as robots.txt, is a plain-text file placed at the root of a website (e.g.,https://example.com/robots.txt) that communicates crawling preferences to automated bots. Originally proposed by Martijn Koster in 1994, the protocol has become the de facto standard for webmaster-to-crawler communication, though it was only formalized as an internet standard (RFC 9309) in 2022.

Advisory, Not Enforced

Robots.txt is a voluntary protocol. There is no technical mechanism that prevents a bot from ignoring it. However, major search engines and reputable scraping services honor robots.txt directives, and courts have increasingly treated ignoring robots.txt as evidence of bad faith.

Universal Adoption

Nearly every major ecommerce platform, from Amazon and Walmart to Shopify storefronts, publishes a robots.txt file. Understanding how to read these files is the first step in any responsible data collection strategy, whether you rely on scraping or APIs.

For ecommerce professionals, robots.txt matters because many product pages, pricing endpoints, and category listings may be explicitly allowed or disallowed. A well-configured scraper checks robots.txt first, respects the directives, and adjusts its crawl plan accordingly. To understand the full technical picture, see DataWeBot's guide on how ecommerce price scrapers work.

Robots.txt Syntax Deep Dive

A robots.txt file consists of one or more groups, each beginning with a User-agent line followed by Allow and Disallow directives. Here is a breakdown of the key directives you will encounter.

User-agent

Specifies which crawler the following rules apply to. An asterisk (*) matches all bots. Specific names like Googlebot or Bingbot target individual crawlers. If your scraper does not identify itself with a recognized user-agent, the wildcard rules apply.

Disallow

Tells bots not to access the specified path. Disallow: /admin/ blocks access to all URLs under /admin/. An empty Disallow: means nothing is blocked for that user-agent.

Allow

Overrides a broader Disallow rule for a specific path. For example, you might see Disallow: /products/ followed by Allow: /products/public/, which blocks all product pages except those in the public subdirectory.

Crawl-delay

A non-standard but widely supported directive that tells bots to wait a specified number of seconds between requests. A value of Crawl-delay: 10 means your scraper should wait at least 10 seconds between page fetches. While Google ignores this directive, many ecommerce sites rely on it to protect their infrastructure.

Sitemap

Points to an XML sitemap, which can be incredibly valuable for ecommerce scraping because it lists all product URLs, categories, and last-modified dates. This lets you build efficient crawl schedules that only revisit pages that have changed.

Pro Tip: Always parse robots.txt programmatically using a library like Python's urllib.robotparser rather than reading it manually. This ensures you correctly handle wildcard patterns, path precedence, and edge cases.

CFAA and US Law

The Computer Fraud and Abuse Act (CFAA) is the primary federal statute in the United States that has been applied to web scraping cases. Originally enacted in 1986 to combat computer hacking, its application to scraping has been shaped by several landmark cases.

hiQ Labs v. LinkedIn (2022)

The Ninth Circuit ruled that scraping publicly available data does not violate the CFAA because there is no "unauthorized access" when information is available to anyone with a web browser. This case is widely considered a landmark victory for web scraping, though its scope is limited to publicly accessible data and the Ninth Circuit jurisdiction.

Van Buren v. United States (2021)

The Supreme Court narrowed the CFAA's "exceeds authorized access" provision, ruling that it applies only to those who access information they are not entitled to obtain, not those who misuse information they are entitled to access. This decision reduced the risk of CFAA liability for scrapers accessing public pages.

State-Level Laws

Many US states have their own computer fraud statutes that may impose additional restrictions. California, Virginia, and Illinois have particularly active enforcement of data-related laws. Some state laws are broader than the CFAA and may capture scraping activities that federal law permits.

Important: Even after hiQ v. LinkedIn, scraping data behind a login wall, circumventing CAPTCHAs or IP blocks, or ignoring cease-and-desist letters can still create significant legal risk under the CFAA and related doctrines.

GDPR and EU Regulations

The General Data Protection Regulation (GDPR) imposes strict requirements on the collection and processing of personal data belonging to EU residents, regardless of where the scraper is located. For ecommerce scraping, this has several practical implications.

  • Product data is generally safe

    Prices, descriptions, specifications, stock levels, and other product attributes are not personal data and fall outside GDPR scope.

  • Review data requires caution

    Customer reviews that include names, locations, or other identifiers are personal data under GDPR. You need a lawful basis (usually legitimate interest) to collect and process them.

  • Seller information varies

    Business contact details on marketplace listings may be personal data if the seller is a sole proprietor. Corporate seller information is typically outside GDPR scope.

  • Data minimization applies

    Even when you have a lawful basis, GDPR requires that you collect only the data you actually need and retain it only as long as necessary.

  • The EU Database Directive

    Beyond GDPR, the EU grants sui generis database rights that protect the investment made in compiling a database, even if the individual data points are not copyrightable.

Working with a managed scraping provider like DataWeBot can simplify GDPR compliance because the provider handles data processing agreements, retention policies, and anonymization as part of their service.

Ethical Scraping Practices

Beyond legal compliance, ethical scraping is about being a good citizen of the web. Responsible scraping practices protect both you and the sites you collect data from, ensuring long-term sustainability of your data operations.

Identify Your Bot

Set a descriptive User-Agent string that includes your company name and a contact URL or email. This lets webmasters reach out if your crawler causes issues, rather than simply blocking you.

Respect Disallow Rules

Always honor robots.txt directives, even when they are technically unenforceable. Ignoring them signals bad faith and may be used as evidence against you in legal proceedings.

Scrape During Off-Peak

Schedule intensive crawls during a site's off-peak hours to minimize impact on their infrastructure. For US-based ecommerce sites, this typically means late night to early morning Eastern Time.

Monitor Server Impact

Watch for HTTP 429 (Too Many Requests) and 503 (Service Unavailable) responses. If you receive these, immediately reduce your crawl rate. A well-behaved scraper adapts its speed based on server feedback.

Ethical scraping is not just about avoiding lawsuits. It protects your reputation, ensures data quality (blocked scrapers get incomplete data), and builds sustainable relationships with the sites you depend on for business intelligence.

Rate Limiting and Politeness

Rate limiting is one of the most practical aspects of responsible scraping. Getting it right means you collect data reliably without disrupting the target site. Getting it wrong means your IP gets blocked, your data pipeline breaks, and you may face legal action.

Recommended Rate Limits by Site Type

Large marketplaces (Amazon, eBay)1-2 requests per second
Mid-size ecommerce sites1 request every 2-5 seconds
Small Shopify/WooCommerce stores1 request every 5-10 seconds
Sites with Crawl-delay directiveFollow the specified delay

Beyond basic rate limiting, advanced politeness strategies include exponential backoff when you receive error responses, randomized delays between requests to avoid detection patterns, and session-based throttling that distributes requests across multiple IP addresses to reduce per-IP load on the target server. DataWeBot's smart rate limiting system handles all of these strategies automatically.

Caching is another important politeness mechanism. If a product page has not changed since your last visit (check the Last-Modified or ETag headers), there is no need to re-download the full page. Conditional requests using If-Modified-Since headers reduce bandwidth for both you and the target site.

Terms of Service Compliance

Most ecommerce websites include terms of service (ToS) that explicitly address automated data collection. While the enforceability of ToS provisions against scrapers remains a contested legal question, understanding and respecting these terms is an important component of a responsible scraping strategy.

Common ToS Provisions

  • Prohibition on automated access or use of bots, spiders, and scrapers
  • Restrictions on reproducing, distributing, or creating derivative works from site content
  • Requirements to use official APIs for data access where available
  • Limits on the volume or frequency of data access
  • Reservation of rights to block, throttle, or take legal action against violators

Practical Approach

The safest approach is to use official APIs when they are available, supplement with scraping only for data points the API does not cover, and always maintain a record of your compliance efforts. If a site sends a cease-and-desist letter, take it seriously and consult legal counsel before continuing to scrape that site.

Many ecommerce data providers, including DataWeBot, handle ToS compliance as part of their service by maintaining relationships with data sources, using authorized access methods where available, and structuring data collection to minimize legal exposure for their clients.

Scrape Responsibly with DataWeBot

DataWeBot handles robots.txt compliance, rate limiting, and legal best practices so you can focus on using ecommerce data to grow your business. Our managed scraping infrastructure respects site policies while delivering the comprehensive product data you need.

Navigating the Legal Landscape of Web Scraping

DataWeBot operates within a legal framework that has evolved significantly through landmark court decisions. The 2022 hiQ Labs v. LinkedIn ruling by the Ninth Circuit established that scraping publicly available data does not violate the Computer Fraud and Abuse Act, providing important legal clarity for businesses that rely on publicly accessible web data. However, DataWeBot recognizes this ruling does not grant blanket permission for all scraping activities. Courts continue to weigh whether data is behind a login wall, whether scraping causes technical harm to the target site, and whether scraped data is used in ways that violate intellectual property rights or contractual agreements like terms of service.

DataWeBot treats robots.txt compliance as a foundational element of its legal risk management strategy. While robots.txt is technically a voluntary protocol — a suggestion rather than a legal mandate — courts have increasingly considered robots.txt compliance as evidence of good faith in scraping disputes. DataWeBot respects crawl-delay directives, identifies its bots with descriptive user-agent strings, avoids request rates that could degrade site performance, and maintains documentation of all compliance efforts. This approach minimizes legal risk while preserving access to the competitive intelligence that drives informed business decisions.

Robots.txt and Web Scraping Legal FAQs

Common questions about robots.txt compliance and the legal landscape of web scraping.

DataWeBot operates within established legal precedent: the US hiQ v. LinkedIn decision established that scraping publicly available data does not violate the CFAA. However, legality depends on multiple factors including the type of data collected, how it is used, whether technical barriers are circumvented, and the jurisdiction. DataWeBot focuses on product pricing and specification data, which is generally lower risk than personal data like reviews with user information.

DataWeBot always honors robots.txt directives because ignoring them carries serious consequences. IP addresses may be blocked, scraping infrastructure may be fingerprinted and banned, and in legal disputes, ignoring robots.txt is cited as evidence of unauthorized access or bad faith. DataWeBot's residential proxy network maintains compliant access, and courts consistently consider robots.txt compliance when evaluating scraping lawsuits.

DataWeBot's ecommerce data extraction is designed to minimize GDPR exposure. Pure product data — prices, descriptions, and stock levels — are not personal data and fall outside GDPR scope. However, if extraction collects reviewer names, seller contact details, or any information that could identify a natural person, DataWeBot applies GDPR-compliant practices including lawful processing basis, data minimization, and data subject rights.

DataWeBot handles sites that actively block scrapers through compliant access methods. The ethical and legal approach is to first check if the site offers an API, then consider reaching out to request data access. DataWeBot explores the trade-offs between scraping and official APIs and maintains compliant access to data sources, handling the technical and legal complexities on behalf of clients.

DataWeBot uses a safe starting rate of one request every 2-5 seconds for mid-size sites, adjusting based on the site's Crawl-delay directive if present. Large marketplaces can often handle 1-2 requests per second, while small stores may require 5-10 second delays. DataWeBot monitors for 429 and 503 responses and automatically reduces crawl rate when encountered.

DataWeBot parses and respects the Robots Exclusion Protocol — a standard that allows website owners to communicate crawling preferences through a plain-text robots.txt file placed at the site root. The file contains directives specifying which paths bots are allowed or disallowed from accessing. DataWeBot honors these directives as a matter of standard practice, consistent with major search engines and reputable crawlers.

DataWeBot's legal compliance framework is grounded in the hiQ v. LinkedIn case (2022) — a landmark Ninth Circuit decision establishing that scraping publicly available data does not violate the Computer Fraud and Abuse Act because there is no unauthorized access when information is visible to any web browser user. DataWeBot operates within this precedent, focusing on publicly accessible information while maintaining compliance across all relevant jurisdictions.

DataWeBot's standard ecommerce data extraction collects product prices, specifications, and stock levels — non-personal data that falls outside the scope of GDPR. DataWeBot avoids capturing customer review content with reviewer names or sole proprietor contact details that would trigger GDPR obligations. DataWeBot's data minimization policy ensures collection of only the product-level data actually needed.

DataWeBot respects the crawl-delay directive — a non-standard but widely supported robots.txt instruction that tells bots to wait a specified number of seconds between requests. DataWeBot honors this directive even though major search engines like Google ignore it, because ecommerce sites rely on it to protect their server infrastructure. Respecting crawl-delay demonstrates good faith and helps DataWeBot maintain long-term access to data sources.

DataWeBot operates exclusively on publicly available data that any visitor can see in a browser — the lower legal risk category. DataWeBot does not circumvent access controls such as login walls, CAPTCHAs, IP blocks, or rate limiters, which would introduce significant legal risk as potential unauthorized access under computer fraud statutes. This distinction between accessing public information and bypassing technical barriers is a critical factor in DataWeBot's compliance framework.

DataWeBot documents compliance efforts and works within terms of service frameworks where possible. Most websites prohibit automated data collection in their ToS, but the enforceability of these provisions remains a contested legal question. Courts have reached different conclusions depending on whether the user agreed via a clickwrap agreement, whether terms were reasonably conspicuous, and the jurisdiction. DataWeBot recommends consulting legal counsel when scraping sites with restrictive terms.

DataWeBot's legal team monitors the distinction between clickwrap and browsewrap agreements in scraping case law. A clickwrap agreement requires the user to actively click 'I agree' before accessing a service, creating a stronger contractual relationship. A browsewrap agreement states that merely using the website constitutes acceptance, without any affirmative action. Courts generally enforce clickwrap agreements more readily, while browsewrap agreements are often found unenforceable against scrapers because bots never had an opportunity to read or agree to the terms.

DataWeBot's EU data operations account for the EU Database Directive, which grants sui generis rights to database creators who have made a substantial investment in obtaining, verifying, or presenting database contents. This protection exists independently of copyright and can apply even when individual data points are not copyrightable. DataWeBot avoids extracting substantial portions of European databases, even when individual prices or product details are factual and non-copyrightable.

DataWeBot uses IP rotation to distribute requests across multiple IP addresses rather than sending all requests from a single source — a technique that avoids triggering rate limits on target sites. DataWeBot uses IP rotation as a politeness mechanism, not to circumvent explicit access restrictions. Using IP rotation to evade access controls after receiving a cease-and-desist letter could be viewed as evidence of intentional evasion in legal proceedings, which DataWeBot avoids.

DataWeBot's CCPA compliance framework accounts for the California Consumer Privacy Act, which gives California residents the right to know what personal information businesses collect about them and to request its deletion. If DataWeBot's extraction collects data that identifies California residents — such as names from product reviews or seller profiles — CCPA obligations apply regardless of where the business is located. DataWeBot is prepared to honor deletion requests and discloses data collection practices in its privacy policy.

DataWeBot takes cease-and-desist letters seriously as a matter of legal policy. A cease-and-desist letter is a formal written notice from a website owner demanding that scraping stop, typically citing terms of service violations, trespass to chattels, or computer fraud statutes. Receiving one does not mean the law has been broken, but DataWeBot immediately pauses extraction from the affected site, consults legal counsel to evaluate the position, and explores alternative data access methods such as official APIs.

DataWeBot's rate limiting practices are designed to avoid trespass to chattels claims — a legal theory alleging intentional interference with another party's personal property causing harm. Website owners have argued that excessive automated requests consume server resources and degrade site performance. DataWeBot maintains crawl rates that cannot measurably impact website performance or availability, which is the threshold courts require plaintiffs to demonstrate for this claim to succeed.