How to Scrape Any Website and Get Clean Markdown
On this page
TL;DR: Send any URL to the Crawly scrape API and get back clean Markdown with metadata. It handles JavaScript rendering, anti-bot bypass, and content extraction automatically. One request, one credit ($0.001), structured output every time.
Why Markdown Instead of HTML?
When you scrape a webpage, the raw HTML includes everything: navigation menus, ad scripts, cookie banners, tracking pixels, footer links, and the actual content buried somewhere in between. A typical webpage has 50-200KB of HTML. The useful content is usually 2-5KB.
Markdown strips away the noise and preserves what matters: headings, paragraphs, lists, links, emphasis, and code blocks. Here is why developers prefer it:
- LLM-ready: Feed directly into GPT, Claude, or any language model as context. Markdown tokens are 10-50x fewer than raw HTML
- Preserves structure: Headings, lists, links, bold, italic, and code blocks are maintained in a readable format
- Compact: A 150KB HTML page becomes a 3KB Markdown file
- Human-readable: You can glance at the output and verify the content is correct
- Universal: Works with any programming language, tool, or pipeline
- Storage-efficient: Store thousands of scraped pages without blowing up your database
How It Works
Send a POST request to /v1/scrape with a URL. Crawly loads the page in a headless browser, waits for JavaScript to render, extracts the main content, converts it to Markdown, and returns the result with metadata.
The entire process takes 2-8 seconds depending on the page complexity. Failed requests are automatically refunded.
Step-by-Step: Scrape Your First Page
1. Get an API Key
Sign up at crawly.bikal.co and create a key from the API Keys page. You get 100 free credits, no credit card required.
2. Make a Request
curl -X POST https://api.crawly.bikal.co/v1/scrape \-H "Authorization: Bearer cr_your_key" \-H "Content-Type: application/json" \-d '{"url": "https://example.com"}'
3. Get Your Markdown
The API returns clean Markdown with metadata:
{"success": true,"url": "https://example.com","markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...","metadata": {"title": "Example Domain","description": "","language": "en","word_count": 83}}
What Gets Extracted
Crawly does not just dump the entire page into Markdown. It identifies the main content area and extracts only what matters:
| Included | Excluded |
|---|---|
| Article body text | Navigation menus |
| Headings (h1-h6) | Ad scripts and tracking |
| Links with href | Cookie banners |
| Lists (ordered and unordered) | Footer boilerplate |
| Bold, italic, code formatting | Sidebar widgets |
| Tables | Popup overlays |
| Images (as markdown links) | Social share buttons |
JavaScript Rendering
Over 70% of modern websites use JavaScript frameworks like React, Next.js, Vue, or Angular. These sites return an empty HTML shell on a standard HTTP request. The actual content loads after JavaScript executes in the browser.
Crawly uses headless Chromium to fully render every page before extracting content. This means you get the complete page content regardless of what framework the site uses. There is no extra cost for JavaScript rendering. Every request includes it automatically.
Compared to other scraping APIs that charge 5-25x more for JS rendering, this is a significant cost advantage. On ScrapingBee, a JavaScript-rendered request costs 5 credits instead of 1. On Crawly, it is always 1 credit.
Anti-Bot Bypass
Many websites use anti-bot services like Cloudflare, Akamai, DataDome, and PerimeterX to block automated requests. Standard scrapers get blocked immediately because they use datacenter IPs and detectable browser fingerprints.
Crawly handles this with two techniques:
- Residential proxy rotation: Every request is routed through a real consumer IP address. A new IP is used for each request, so rate limits and IP bans do not accumulate.
- Browser fingerprinting: The headless browser mimics a real user with proper user-agent strings, viewport sizes, and browser headers. Anti-bot services see a normal browser visit, not a scraper.
If a request fails despite these measures, the credit is automatically refunded.
Metadata
Every response includes structured metadata extracted from the page:
| Field | Description | Example |
|---|---|---|
title | Page title from the title tag | "Getting Started with Next.js" |
description | Meta description | "Learn how to build web apps with Next.js..." |
language | Detected page language | "en" |
word_count | Word count of extracted content | 1,247 |
The metadata is useful for filtering, categorizing, and validating scraped content. You can check the word count to verify content was extracted, use the language to route content to the right processing pipeline, and use the title/description for indexing.
Common Use Cases
RAG Pipelines
Retrieval-Augmented Generation (RAG) requires feeding external content into language models. Markdown is the ideal format because it preserves document structure while minimizing token count. A 150KB HTML page that would consume 40,000 tokens as raw HTML becomes a 3KB Markdown file using roughly 800 tokens.
Content Monitoring
Track changes on competitor pages, documentation sites, or news sources. Scrape pages on a schedule, store the Markdown, and diff against previous versions to detect changes.
Data Collection
Build datasets from public web content for research, analysis, or training. Markdown output is easy to parse, search, and store in databases.
Documentation Aggregation
Pull documentation from multiple sources into a single knowledge base. The Markdown output preserves headings and structure, making it easy to merge content from different sites.
Using with AI Tools (MCP)
If you use Claude, Cursor, or Windsurf, you do not need to write API calls. Install the Crawly MCP server and ask your AI directly: "Scrape this URL and summarize the key points." The AI handles the API call and returns the Markdown content in your chat.
This is especially powerful for research workflows. Ask the AI to scrape multiple pages and compare their content, extract specific data points, or create summaries from several sources.
Error Handling
| Error | Cause | Solution |
|---|---|---|
SCRAPE_FAILED | Page could not be loaded or content extracted | Verify the URL loads in your browser. Some sites block all automated access. |
INVALID_URL | URL format is not valid | Use a full URL with protocol (https://) |
TIMEOUT | Page took too long to load | The page may be very heavy. Retry or try a simpler URL on the same domain. |
RATE_LIMITED | Too many requests | Wait and retry. Limit is 60 requests per minute. |
All failed requests are automatically refunded. You never pay for a scrape that does not return content.
Pricing
Each scrape costs 1 credit ($0.001). JavaScript rendering and proxy rotation are included, no extra cost. No monthly subscription. Credits never expire. You get 100 free credits on signup.
For cost comparisons with other scraping APIs, read our pricing breakdown.
Frequently Asked Questions
What websites can Crawly scrape?
Any publicly accessible webpage. This includes JavaScript-rendered sites (React, Next.js, Vue), Cloudflare-protected pages, and static sites. It cannot access pages behind login walls, paywalls, or private networks.
How long does a scrape take?
Most requests complete in 2-8 seconds. Simple static pages are faster. Complex JavaScript-heavy pages with multiple API calls take longer. The maximum timeout is 30 seconds.
Can I scrape multiple pages at once?
The API processes one URL per request. To scrape multiple pages, send parallel requests. The rate limit is 60 requests per minute, so you can scrape 60 pages concurrently.
Is the Markdown output customizable?
The API returns the best extraction of the main content area. You cannot currently customize which elements are included or excluded. If you need specific data from a page, you can post-process the Markdown output.
What happens if the page has no content?
If the page loads but has no extractable content (like a blank page or a page that is entirely an embedded application), the API returns an empty Markdown string. The request is still charged because the page was successfully loaded and rendered.
Ready to start scraping? Get 100 free credits or read the scrape API docs for the full parameter reference.