README.md 2.74 KB

Node Crawler Examples

Node Crawler is a great open source web scraping tool. However, there are a few common questions regarding how to use it. Let's hash it out:

Table of Content

Use Proxy with Crawler

Most large scale webscraping tasks requires us to perform countless amounts of access to a specific website. This could be very risky using only one IP address since the website could permanently or temporarily block our IP address. Instead, we can use a proxy that gives us the freedom to access websites using multiple different IPs. Below is an example of how to use a proxy with Crawler:

const Crawler = require("crawler");

// for global
new Crawler({
    rateLimit:1000,
    proxy: "http://proxy.example.com"
});

//for just one task
Crawler.queue({
    uri: "http://www.example.com",
    proxy: "http://proxy.example.com"
})

Download Images and Other Files

Some of our web scraping tasks involves downloading images or other file types, like grabbing images to train image recognition algorithms. With crawler, a few settings will do the trick; simply set encoding and jQuery options to null and false respectively when queuing a task. Below is an example of downloading images with Crawler:

const Crawler = require("crawler");
const fs = require("fs");

let crawler = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            fs.createWriteStream(res.options.filename).write(res.body);
        }
        done();
    }
});

crawler.queue({
   uri: 'http://www.example.com/image.jpg',
   filename: 'myImage.jpg',
   encoding: null,
   jQuery: false
});

Get Full Path Using jQuery Selector

Visiting different layers within a website requires us to follow embedded links/paths. However, most embedded links can only give us partial links/paths. To obtain the full path, simply use URL.resolve(requestUrl, href) to concatenate the full path. Here is an example:

Say that you want to visit http://www.google.com/search and it returns : <a href="/article/174143.html" class="transition" target="_blank">hello world</a> The following code will concatenate the partial url into a full path:

const URL = require('url')

let requestUrl = res.request.uri.href;
let href = $('a.transition').attr('href')

# This gives you 'http://www.google.com/article/174143.html'
console.log(URL.resolve(requestUrl, href))