REIbot-crawler

My friend mentioned over dinner recently that a fun project to build would be a bot that crawls all of the pages on REI's website and stores all of the links. Today, I decided this would be a fun project to tackle. Unfortunately, I think the actual tackling will take much longer than expected. I have a working bot; however, I think it may be getting hung up on different pages. This will obviously be an iterative project over a while.

The bulk of the code works and so far I have 2,405 pages from REI. At least some of them have codes that are for products. I'm not exactly sure if I have have exhausted my total options, but it was fun. In one of the first versions, I realized that I was merely moving along the categories and creating "new" links that were categories appended to the current category. These ended up being something along the lines of rei.com/c/running-shoes/c/camping-and-hiking/c/... This was clearly not ok. I fixed the code so that it store the links as their base component, but still went to the full page.

I made use of nodejs for this project, along with cheerio and request. The code was not too challenging to write except for the issues above. I have a few if in there to ensure that I don't re-parse pages. I'm still determining the best way to orient around the challenge of getting all of the product links.

The whole code is up on github right here.

Menu