Casual_blog/content/hacking/HowTo_crawl_website.md

+++
title = 'HowTo crawl website'
date = 2024-11-05
image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif'
+++

![](https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif)

Crawler (or spider) - gets you all links that site have and reference to. It isn't [dirbusting](/hidden/todo), you can't get hidden directories with crawler.

With crawler you can more easily find hard to find website functions or interesting links (like URL parameters `example.com/get?promo=code`).

## How to crawl

We will use 2 tools, `katana` and `gau`

### [Katana](https://github.com/projectdiscovery/katana)

Fast and feature-full crawler:
 - you can just crawl site - `katana -u blog.ca.sual.in`
 - crawl .js files for additional links (`-jc -jsl`)
 - use headless browser (in case you get blocked, `-hl`)
 - etc...

### [Gau](https://github.com/lc/gau)

This one doesn't crawl site from your computer, it uses data from public internet crawlers
 - AlienVault's Open Threat Exchange
 - the Wayback Machine
 - Common Crawl
 - and URLScan

You can get crawl data in just 3 seconds!
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.

`gau blog.ca.sual.in`

## Example

Let's make small bash script that will use both tools:
```
gau blog.ca.sual.in >> crawl
katana -u blog.ca.sual.in >> crawl

cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here
```