How to Scrape Any Website and Get Clean Markdown

TL;DR: Send any URL to the Crawly scrape API and get back clean Markdown with metadata. It handles JavaScript rendering, anti-bot bypass, and content extraction automatically. One request, one credit ($0.001), structured output every time.

Why Markdown Instead of HTML?

When you scrape a webpage, the raw HTML includes everything: navigation menus, ad scripts, cookie banners, tracking pixels, footer links, and the actual content buried somewhere in between. A typical webpage has 50-200KB of HTML. The useful content is usually 2-5KB.

Markdown strips away the noise and preserves what matters: headings, paragraphs, lists, links, emphasis, and code blocks. Here is why developers prefer it:

LLM-ready: Feed directly into GPT, Claude, or any language model as context. Markdown tokens are 10-50x fewer than raw HTML
Preserves structure: Headings, lists, links, bold, italic, and code blocks are maintained in a readable format
Compact: A 150KB HTML page becomes a 3KB Markdown file
Human-readable: You can glance at the output and verify the content is correct
Universal: Works with any programming language, tool, or pipeline
Storage-efficient: Store thousands of scraped pages without blowing up your database

How It Works

Send a POST request to /v1/scrape with a URL. Crawly loads the page in a headless browser, waits for JavaScript to render, extracts the main content, converts it to Markdown, and returns the result with metadata.

The entire process takes 2-8 seconds depending on the page complexity. Failed requests are automatically refunded.

Step-by-Step: Scrape Your First Page

1. Get an API Key

2. Make a Request

curl -X POST https://api.crawly.bikal.co/v1/scrape \
  -H "Authorization: Bearer cr_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

3. Get Your Markdown

The API returns clean Markdown with metadata:

json

{
  "success": true,
  "url": "https://example.com",
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
  "metadata": {
    "title": "Example Domain",
    "description": "",
    "language": "en",
    "word_count": 83
  }
}

What Gets Extracted

Crawly does not just dump the entire page into Markdown. It identifies the main content area and extracts only what matters:

Included	Excluded
Article body text	Navigation menus
Headings (h1-h6)	Ad scripts and tracking
Links with href	Cookie banners
Lists (ordered and unordered)	Footer boilerplate
Bold, italic, code formatting	Sidebar widgets
Tables	Popup overlays
Images (as markdown links)	Social share buttons

JavaScript Rendering

Over 70% of modern websites use JavaScript frameworks like React, Next.js, Vue, or Angular. These sites return an empty HTML shell on a standard HTTP request. The actual content loads after JavaScript executes in the browser.

Crawly uses headless Chromium to fully render every page before extracting content. This means you get the complete page content regardless of what framework the site uses. There is no extra cost for JavaScript rendering. Every request includes it automatically.

Compared to other scraping APIs that charge 5-25x more for JS rendering, this is a significant cost advantage. On ScrapingBee, a JavaScript-rendered request costs 5 credits instead of 1. On Crawly, it is always 1 credit.

Anti-Bot Bypass

Many websites use anti-bot services like Cloudflare, Akamai, DataDome, and PerimeterX to block automated requests. Standard scrapers get blocked immediately because they use datacenter IPs and detectable browser fingerprints.

Crawly handles this with two techniques:

Residential proxy rotation: Every request is routed through a real consumer IP address. A new IP is used for each request, so rate limits and IP bans do not accumulate.
Browser fingerprinting: The headless browser mimics a real user with proper user-agent strings, viewport sizes, and browser headers. Anti-bot services see a normal browser visit, not a scraper.

If a request fails despite these measures, the credit is automatically refunded.

Metadata

Every response includes structured metadata extracted from the page:

Field	Description	Example
`title`	Page title from the title tag	"Getting Started with Next.js"
`description`	Meta description	"Learn how to build web apps with Next.js..."
`language`	Detected page language	"en"
`word_count`	Word count of extracted content	1,247

The metadata is useful for filtering, categorizing, and validating scraped content. You can check the word count to verify content was extracted, use the language to route content to the right processing pipeline, and use the title/description for indexing.

Common Use Cases

RAG Pipelines

Retrieval-Augmented Generation (RAG) requires feeding external content into language models. Markdown is the ideal format because it preserves document structure while minimizing token count. A 150KB HTML page that would consume 40,000 tokens as raw HTML becomes a 3KB Markdown file using roughly 800 tokens.

Content Monitoring

Track changes on competitor pages, documentation sites, or news sources. Scrape pages on a schedule, store the Markdown, and diff against previous versions to detect changes.

Data Collection

Build datasets from public web content for research, analysis, or training. Markdown output is easy to parse, search, and store in databases.

Documentation Aggregation

Pull documentation from multiple sources into a single knowledge base. The Markdown output preserves headings and structure, making it easy to merge content from different sites.

Using with AI Tools (MCP)

If you use Claude, Cursor, or Windsurf, you do not need to write API calls. Install the Crawly MCP server and ask your AI directly: "Scrape this URL and summarize the key points." The AI handles the API call and returns the Markdown content in your chat.

This is especially powerful for research workflows. Ask the AI to scrape multiple pages and compare their content, extract specific data points, or create summaries from several sources.

Error Handling

Error	Cause	Solution
`SCRAPE_FAILED`	Page could not be loaded or content extracted	Verify the URL loads in your browser. Some sites block all automated access.
`INVALID_URL`	URL format is not valid	Use a full URL with protocol (https://)
`TIMEOUT`	Page took too long to load	The page may be very heavy. Retry or try a simpler URL on the same domain.
`RATE_LIMITED`	Too many requests	Wait and retry. Limit is 60 requests per minute.

All failed requests are automatically refunded. You never pay for a scrape that does not return content.

Pricing

Each scrape costs 1 credit ($0.001). JavaScript rendering and proxy rotation are included, no extra cost. No monthly subscription. Credits never expire. You get 100 free credits on signup.

For cost comparisons with other scraping APIs, read our pricing breakdown.

Frequently Asked Questions

What websites can Crawly scrape?

Any publicly accessible webpage. This includes JavaScript-rendered sites (React, Next.js, Vue), Cloudflare-protected pages, and static sites. It cannot access pages behind login walls, paywalls, or private networks.

How long does a scrape take?

Most requests complete in 2-8 seconds. Simple static pages are faster. Complex JavaScript-heavy pages with multiple API calls take longer. The maximum timeout is 30 seconds.

Can I scrape multiple pages at once?

The API processes one URL per request. To scrape multiple pages, send parallel requests. The rate limit is 60 requests per minute, so you can scrape 60 pages concurrently.

Is the Markdown output customizable?

The API returns the best extraction of the main content area. You cannot currently customize which elements are included or excluded. If you need specific data from a page, you can post-process the Markdown output.

What happens if the page has no content?

If the page loads but has no extractable content (like a blank page or a page that is entirely an embedded application), the API returns an empty Markdown string. The request is still charged because the page was successfully loaded and rendered.

Ready to start scraping? Get 100 free credits or read the scrape API docs for the full parameter reference.