Casual_blog/content/hacking/HowTo_crawl_website.md
2024-12-12 15:48:15 +03:00

1.4 KiB

+++ title = 'HowTo crawl website' date = 2024-11-05 image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif' +++

Crawler (or spider) - gets you all links that site have and reference to. It isn't dirbusting, you can't get hidden directories with crawler.

With crawler you can more easily find hard to find website functions or interesting links (like URL parameters example.com/get?promo=code).

How to crawl

We will use 2 tools, katana and gau

Katana

Fast and feature-full crawler:

  • you can just crawl site - katana -u blog.ca.sual.in
  • crawl .js files for additional links (-jc -jsl)
  • use headless browser (in case you get blocked, -hl)
  • etc...

Gau

This one doesn't crawl site from your computer, it uses data from public internet crawlers

  • AlienVault's Open Threat Exchange
  • the Wayback Machine
  • Common Crawl
  • and URLScan

You can get crawl data in just 3 seconds!
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.

gau blog.ca.sual.in

Example

Let's make small bash script that will use both tools:

gau blog.ca.sual.in >> crawl
katana -u blog.ca.sual.in >> crawl

cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here