Cheerio爬虫

全栈程序员-站长 • 2026年3月19日上午9:10 • 未分类 • 阅读 2

爬虫要想爬取数据首先提供爬取数据的路径：

url: http://www.hubwiz.com/course/bc20ce26fhttp://xc.hubwiz.com/course/bc20ce26f

爬虫目标

页面每一章节的标题及其中小节的标题名称。

小爬虫

首先引用了 nodejs 的核心模块 http 和提供了爬取路径，然后通过 http 中的 get 接口给 url 发送 get 请求，最回调函数中对请求回来的数据进行处理。

爬取数据

通过cheerio的 load 方法把html加载；然后对.panel通过 map 进行遍历。之后我们在 map 中组装要数据格式，如上述中chapterData。再对小节 li 进行遍历，把 section 通过 push 方法添加到 chapterData.section 的数组中。再把组装好的数据 push 到我们创建的空数组 data 中。最后通过console.log进行输出。

处理数据

在大多数情况下我们爬取出来的数据，可能不是我们最终想要的东西比如说：数据中空值或者空格等等。在获取在通过 text 获取内容的后面跟随着一个 trim 的方法。这个方法的作用就是处理数据中空格和换行符。

空值的情况，在输出的数据中存在一个空的数组对象，通过 filter 方法去处理它。

输出数据

在 crawlerChapter方法中得到的数据 data 组装，进行输出。

在 printInfo 方法中的参数 data ，这个参数需要 crawlerChapter 方法 return 给 printInfo。然后就是 data 参数调用 filter 方法把数据为空的去掉。最后就是把章节拼接字符串进行输出。

参考源码：

var http = require('http'); var cheerio = require('cheerio'); var url = 'http://www.hubwiz.com/course/a032cafddbe/'; http.get(url, function (res) { 
    var html = ''; res.on('data', function (data) { 
   
        html += data; }); res.on('end', function () { 
    var chapter = crawlerChapter(html); printInfo(chapter); }) }).on('error', function () { 
    console.log('爬取页面错误') }); function crawlerChapter(html) { 
    var $ = cheerio.load(html); var chapters = $('.panel'); var data = []; chapters.map(function (node) { 
    var chapters = $(this); var chapterTitle = chapters.find('h4').text().trim(); var sections = chapters.find('li'); var chapterData = { 
    chaptersTitle: chapterTitle, section: [] }; sections.map(function (node) { 
    var section = $(this).text().trim(); chapterData.section.push(section); }); data.push(chapterData); }); return data; } function printInfo(data) { 
    data = data.filter(function filterByID(obj) { 
    return obj.chaptersTitle ? true : false; }); data.map(function (item) { 
    var chapterTitle = item.chaptersTitle; console.log('【' + chapterTitle + '】'); item.section.map(function (section) { 
    console.log(' 【' + section + '】') }) }) }

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/209468.html原文链接：https://javaforall.net

Cheerio爬虫

爬虫要想爬取数据首先提供爬取数据的路径：

爬虫目标

小爬虫

爬取数据

处理数据

输出数据

关于作者

全栈程序员-站长

发表回复

Cheerio爬虫

爬虫要想爬取数据首先提供爬取数据的路径：

爬虫目标

小爬虫

爬取数据

处理数据

输出数据

关于作者

全栈程序员-站长

相关推荐

ARM 开启Dcache 问题

USB转RS485串口电路设计「建议收藏」

汉字到底占几个字节丨C「建议收藏」

a 标签中 写页面刷新代码

随机森林随机选择特征的方法_随机森林步骤

简述python垃圾回收机制_理解Python垃圾回收机制

发表回复

a 标签中写页面刷新代码