大家好，又见面了，我是你们的朋友全栈君。如果您正在找激活码,请点击查看最新教程,关注关注公众号 “全栈程序员社区” 获取激活教程,可能之前旧版本教程已经失效.最新Idea2022.1教程亲测有效,一键激活。

Jetbrains全系列IDE使用 1年只要46元售后保障童叟无欺

Python-爬取HTML网页数据

软件环境

Mac 10.13.1 (17B1003)
Python 2.7.10
VSCode 1.18.1

摘要

本文是练手Demo，主要是使用 Beautiful Soup 来爬取网页数据。

Beautiful Soup 介绍

Beautiful Soup提供一些简单的、python式的用来处理导航、搜索、修改分析树等功能。

Beautiful Soup 官方中文文档

特点

简单：它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

Beautiful Soup 的安装

安装 pip (如果需要): sudo easy_install pip
安装 Beautiful Soup: sudo pip install beautifulsoup4

示例

确定获取数据范围

本示例是获取项目列表，打开Chrome的调试栏，找到对应的位置，如下图：
Chrome确定爬取位置

导包

import sys
import json
import urllib2 as HttpUtils
import urllib as UrlUtils
from bs4 import BeautifulSoup

获取页面信息（分页）

def gethtml(page):
    '获取指定页码的网页数据'
    url = 'https://box.xxx.com/Project/List'
    values = { 
   
        'category': '',
        'rate': '',
        'range': '',
        'page': page
    }
    data = UrlUtils.urlencode(values)
    # 使用 DebugLog
    httphandler = HttpUtils.HTTPHandler(debuglevel=1)
    httpshandler = HttpUtils.HTTPSHandler(debuglevel=1)
    opener = HttpUtils.build_opener(httphandler, httpshandler)
    HttpUtils.install_opener(opener)
    request = HttpUtils.Request(url + '?' + data)
    request.get_method = lambda: 'GET'
    try:
        response = HttpUtils.urlopen(request, timeout=10)
    except HttpUtils.URLError, err:
        if hasattr(err, 'code'):
            print err.code
        if hasattr(err, 'reason'):
            print err.reason
        return None
    else:
        print '====== Http request OK ======'
    return response.read().decode('utf-8')

TIPS

urlopen(url, data, timeout)
- url: 请求的 URL
- data: 访问 URL 时要传送的数据
- timeout: 超时时间
HttpUtils.build_opener(httphandler, httpshandler)
- 开启日志，将会在调试控制台输出网络请求日志，方便调试
必要的 try-catch，以便可以捕获到网络异常

解析获取的数据

创建BeautifulSoup对象

soup = BeautifulSoup(html, 'html.parser')

获取待遍历的对象

# items 是一个 <listiterator object at 0x10a4b9950> 对象，不是一个list，但是可以循环遍历所有子节点。
items = soup.find(attrs={ 
   'class':'row'}).children

遍历子节点，解析并获取所需参数

projectList = []
for item in items:
    if item == '\n': continue
    # 获取需要的数据
    title = item.find(attrs={ 
   'class': 'title'}).string.strip()
    projectId = item.find(attrs={ 
   'class': 'subtitle'}).string.strip()
    projectType = item.find(attrs={ 
   'class': 'invest-item-subtitle'}).span.string
    percent = item.find(attrs={ 
   'class': 'percent'})
    state = 'Open'
    if percent is None: # 融资已完成
        percent = '100%'
        state = 'Finished'
        totalAmount = item.find(attrs={ 
   'class': 'project-info'}).span.string.strip()
        investedAmount = totalAmount
    else:
        percent = percent.string.strip()
        state = 'Open'
        decimalList = item.find(attrs={ 
   'class': 'decimal-wrap'}).find_all(attrs={ 
   'class': 'decimal'})
        totalAmount =  decimalList[0].string
        investedAmount = decimalList[1].string
    investState = item.find(attrs={ 
   'class': 'invest-item-type'})
    if investState != None:
        state = investState.string
    profitSpan = item.find(attrs={ 
   'class': 'invest-item-rate'}).find(attrs={ 
   'class': 'invest-item-profit'})
    profit1 = profitSpan.next.strip()
    profit2 = profitSpan.em.string.strip()
    profit = profit1 + profit2
    term = item.find(attrs={ 
   'class': 'invest-item-maturity'}).find(attrs={ 
   'class': 'invest-item-profit'}).string.strip()
    project = { 
   
        'title': title,
        'projectId': projectId,
        'type': projectType,
        'percent': percent,
        'totalAmount': totalAmount,
        'investedAmount': investedAmount,
        'profit': profit,
        'term': term,
        'state': state
    }
    projectList.append(project)

输出解析结果，如下：

解析结果

TIPS

解析html代码，主要是运用了BeautifulSoup的几大对象，Tag、NavigableString、BeautifulSoup、Comment，可以参考Beautiful Soup 官方中文文档

本文参考：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
http://www.jianshu.com/p/972c95610fdc

发布者：全栈程序员-用户IM，转载请注明出处：https://javaforall.cn/193774.html原文链接：https://javaforall.cn

【正版授权，激活自己账号】： Jetbrains全家桶Ide使用，1年售后保障，每天仅需1毛

【官方授权正版激活】： 官方授权正版激活支持Jetbrains家族下所有IDE 使用个人JB账号...

Python-爬取HTML网页数据

Python-爬取HTML网页数据

软件环境

摘要

Beautiful Soup 介绍

Beautiful Soup 官方中文文档

特点

Beautiful Soup 的安装

示例

确定获取数据范围

导包

获取页面信息（分页）

TIPS

解析获取的数据

创建BeautifulSoup对象

获取待遍历的对象

遍历子节点，解析并获取所需参数

输出解析结果，如下：

TIPS

本文参考：

相关推荐

Makefile常用模板「建议收藏」

推荐系统中TopN与kNN的区别

MongoVUE_mongodb怎么用

Android——NDK基础概念——ndk-build介绍

java分割字符串的方法_java字符串按照特定字符分割

unity物体沿着一个方向移动_unity3d控制人物行走

发表回复