47 lines
1.4 KiB
Markdown
47 lines
1.4 KiB
Markdown
+++
|
|
title = 'HowTo crawl website'
|
|
date = 2024-11-05
|
|
image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif'
|
|
+++
|
|
|
|
![](https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif)
|
|
|
|
Crawler (or spider) - gets you all links that site have and reference to. It isn't [dirbusting](/hacking/howto_dirb), you can't get hidden directories with crawler.
|
|
|
|
With crawler you can more easily find hard to find website functions or interesting links (like URL parameters `example.com/get?promo=code`).
|
|
|
|
## How to crawl
|
|
|
|
We will use 2 tools, `katana` and `gau`
|
|
|
|
### [Katana](https://github.com/projectdiscovery/katana)
|
|
|
|
Fast and feature-full crawler:
|
|
- you can just crawl site - `katana -u blog.ca.sual.in`
|
|
- crawl .js files for additional links (`-jc -jsl`)
|
|
- use headless browser (in case you get blocked, `-hl`)
|
|
- etc...
|
|
|
|
### [Gau](https://github.com/lc/gau)
|
|
|
|
This one doesn't crawl site from your computer, it uses data from public internet crawlers
|
|
- AlienVault's Open Threat Exchange
|
|
- the Wayback Machine
|
|
- Common Crawl
|
|
- and URLScan
|
|
|
|
You can get crawl data in just 3 seconds!
|
|
This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.
|
|
|
|
`gau blog.ca.sual.in`
|
|
|
|
## Example
|
|
|
|
Let's make small bash script that will use both tools:
|
|
```
|
|
gau blog.ca.sual.in >> crawl
|
|
katana -u blog.ca.sual.in >> crawl
|
|
|
|
cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here
|
|
```
|