Casual_blog/HowTo_crawl_website.md at 044ff1e4653d5f8e2820da66e99fdb13e102c6cf

2024-11-05 22:44:12 +03:00

1.4 KiB

Raw Blame History

+++ title = 'HowTo crawl website' date = 2024-11-05 image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif' +++

Crawler (or spider) - gets you all links that site have and reference to. It isn't dirbusting, you can't get hidden directories with crawler.

With crawler you can more easily find hard to find website functions or interesting links (like URL parameters example.com/get?promo=code).

How to crawl

We will use 2 tools, katana and gau

Katana

Fast and feature-full crawler:

you can just crawl site - katana -u blog.ca.sual.in
crawl .js files for additional links (-jc -jsl)
use headless browser (in case you get blocked, -hl)
etc...

Gau

This one doesn't crawl site from your computer, it uses data from public internet crawlers

AlienVault's Open Threat Exchange
the Wayback Machine
Common Crawl
and URLScan

You can get crawl data in just 3 seconds!
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.

gau blog.ca.sual.in

Example

Let's make small bash script that will use both tools:

gau blog.ca.sual.in >> crawl
katana -u blog.ca.sual.in >> crawl

cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here

1.4 KiB Raw Blame History

How to crawl

Katana

Gau

Example

1.4 KiB

Raw Blame History