From 3420b784bcc61e5fc1db3252e0acb84aee48cc7b Mon Sep 17 00:00:00 2001 From: casual Date: Tue, 5 Nov 2024 22:44:12 +0300 Subject: [PATCH] add new post --- content/hacking/HowTo_crawl_website.md | 46 ++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) create mode 100644 content/hacking/HowTo_crawl_website.md diff --git a/content/hacking/HowTo_crawl_website.md b/content/hacking/HowTo_crawl_website.md new file mode 100644 index 0000000..98dd2a7 --- /dev/null +++ b/content/hacking/HowTo_crawl_website.md @@ -0,0 +1,46 @@ ++++ +title = 'HowTo crawl website' +date = 2024-11-05 +image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif' ++++ + +![](https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif) + +Crawler (or spider) - gets you all links that site have and reference to. It isn't [dirbusting](/hidden/todo), you can't get hidden directories with crawler. + +With crawler you can more easily find hard to find website functions or interesting links (like URL parameters `example.com/get?promo=code`). + +## How to crawl + +We will use 2 tools, `katana` and `gau` + +### [Katana](https://github.com/projectdiscovery/katana) + +Fast and feature-full crawler: + - you can just crawl site - `katana -u blog.ca.sual.in` + - crawl .js files for additional links (`-jc -jsl`) + - use headless browser (in case you get blocked, `-hl`) + - etc... + +### [Gau](https://github.com/lc/gau) + +This one doesn't crawl site from your computer, it uses data from public internet crawlers + - AlienVault's Open Threat Exchange + - the Wayback Machine + - Common Crawl + - and URLScan + +You can get crawl data in just 3 seconds! +This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it. + +`gau blog.ca.sual.in` + +## Example + +Let's make small bash script that will use both tools: +``` +gau blog.ca.sual.in >> crawl +katana -u blog.ca.sual.in >> crawl + +cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here +```