Web Crawler

Year:

November 2023

GitHub:

This project is not open source.

Project Description:

This project involved building a Java-based web crawler designed to navigate and extract information from a mock social networking site called Fakebook. The crawler authenticated via HTTP POST, managed cookies for session continuity, and systematically traversed the site’s links within the /fakebook/ domain to collect five unique secret flags. Implementing HTTP/1.1 from scratch, the crawler handled various response statuses such as redirects (302), forbidden pages (403), and server errors (503). By tracking visited URLs, avoiding loops, and ensuring efficient traversal, the crawler adhered to ethical crawling practices and met performance goals.

My Contributions:

Responsible for the complete implementation of the web crawler, including crafting HTTP requests, handling responses, managing cookies, and ensuring secure connections via TLS. I designed the system to efficiently traverse the site while tracking visited URLs and dynamically adapting to HTTP response codes. Additionally, I implemented robust error handling and session management to maintain continuity during the crawl, ensuring all flags were located efficiently and securely.