1.4 KiB
+++ title = 'HowTo crawl website' date = 2024-11-05 image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif' +++
Crawler (or spider) - gets you all links that site have and reference to. It isn't dirbusting, you can't get hidden directories with crawler.
With crawler you can more easily find hard to find website functions or interesting links (like URL parameters example.com/get?promo=code
).
How to crawl
We will use 2 tools, katana
and gau
Katana
Fast and feature-full crawler:
- you can just crawl site -
katana -u blog.ca.sual.in
- crawl .js files for additional links (
-jc -jsl
) - use headless browser (in case you get blocked,
-hl
) - etc...
Gau
This one doesn't crawl site from your computer, it uses data from public internet crawlers
- AlienVault's Open Threat Exchange
- the Wayback Machine
- Common Crawl
- and URLScan
You can get crawl data in just 3 seconds!
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.
gau blog.ca.sual.in
Example
Let's make small bash script that will use both tools:
gau blog.ca.sual.in >> crawl
katana -u blog.ca.sual.in >> crawl
cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here