From 3420b784bcc61e5fc1db3252e0acb84aee48cc7b Mon Sep 17 00:00:00 2001
From: casual <casualcca@gmail.com>
Date: Tue, 5 Nov 2024 22:44:12 +0300
Subject: [PATCH] add new post

---
 content/hacking/HowTo_crawl_website.md | 46 ++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 content/hacking/HowTo_crawl_website.md

diff --git a/content/hacking/HowTo_crawl_website.md b/content/hacking/HowTo_crawl_website.md
new file mode 100644
index 0000000..98dd2a7
--- /dev/null
+++ b/content/hacking/HowTo_crawl_website.md
@@ -0,0 +1,46 @@
++++
+title = 'HowTo crawl website'
+date = 2024-11-05
+image = 'https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif'
++++
+
+![](https://cdn.dribbble.com/users/722835/screenshots/6516126/spider800.gif)
+
+Crawler (or spider) - gets you all links that site have and reference to. It isn't [dirbusting](/hidden/todo), you can't get hidden directories with crawler.  
+
+With crawler you can more easily find hard to find website functions or interesting links (like URL parameters `example.com/get?promo=code`).
+
+## How to crawl
+
+We will use 2 tools, `katana` and `gau`
+
+### [Katana](https://github.com/projectdiscovery/katana)
+
+Fast and feature-full crawler:
+ - you can just crawl site - `katana -u blog.ca.sual.in`
+ - crawl .js files for additional links (`-jc -jsl`)
+ - use headless browser (in case you get blocked, `-hl`)
+ - etc...  
+
+### [Gau](https://github.com/lc/gau)
+
+This one doesn't crawl site from your computer, it uses data from public internet crawlers 
+ - AlienVault's Open Threat Exchange
+ - the Wayback Machine
+ - Common Crawl
+ - and URLScan  
+
+You can get crawl data in just 3 seconds!  
+This data may be not actuall or full, but site may remove some link reference and not actual web page, so use it.
+
+`gau blog.ca.sual.in`
+
+## Example
+
+Let's make small bash script that will use both tools:
+```
+gau blog.ca.sual.in >> crawl
+katana -u blog.ca.sual.in >> crawl
+
+cat ./crawl | sort | uniq -u > crawl # Pro tip: insert httpx here
+```