大家好,又见面了,我是你们的朋友全栈君。
强大又灵活的网页解析库。如果你觉得正则写起来太麻烦,BearutifulSoup 语法太难记,而又熟悉 jQuery 的语法,那么 PyQuery 就是你的绝佳选择
1、初始化
1.1、字符串初始化
html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html) # doc 为 pyquery 一个初始化对象
print(doc('li')) # 与 css 选择器一样,可以如 doc('ul .item-0')
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
1.2、URL 初始化
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head>
1.3、文件初始化
from pyquery import PyQuery as pq
doc = pq(filename='demo.html') # 软件同一目录下,或者指定其路径
print(doc('li'))
2、基本 CSS 选择器
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li')) # 中间以空格隔开
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
3、查找元素
3.1、子元素
只查找直接子节点用 children 方法,find 方法将符合条件的所有子节点查询出来(范围是节点的子孙节点)
类型为 PyQuery
html = """ <div> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li') # 都是 pyquery 对象,使用 对象.find() 方法
print(type(lis))
print(lis)
liss = item.children('.active') #筛选出子节点中 class 为 active的节点
<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
3.2、父元素
父节点(parent),祖先节点(parents)
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list') # 首先查找到元素(pyquery 对象)
container = items.parent() # 再使用对象.parent() 方法找到其父元素
# parents = items.parents() 祖先节点
print(type(container))
print(container)
<class 'pyquery.pyquery.PyQuery'>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
3.3、兄弟元素
兄弟节点(siblings)
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-1.active') # item-1 与 active 并列
print(li.siblings())
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
3.4、遍历
对于符合条件的有多个结果节点的,需要调用 items 方法,再进行循环遍历
# 使用 .items() 方法、for 循环遍历多个元素
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
print(li)
<class 'generator'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
4、获取信息
4.1、属性、文本、HTML
若得到的结果是多个节点,attr、html 需要遍历(items()),而 text 不需要(返回的是所有符合条件的节点的内容,中间以空格分隔,即是一个字符串)
# 获取属性的值 Value、文本、HTML
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-1.active a')
print(a)
print(a.attr('href')) # a.attr.属性名
# print(a.attr.href)
print(a.text()) # 获取文本信息
li = doc('.item-1.active')
print(li)
print(li.html()) # 获取 html
<a href="link4.html">fouth item</a>
link4.html
fouth item
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<a href="link4.html">fouth item</a>
5、DOM 操作
5.1、addClass、removeClass
增加或移除 class属性
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-1.active')
print(li)
print(li.removeClass('active'))
print(li.addClass('active'))
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-1"><a href="link4.html">fouth item</a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
5.2、修改 attr、css
可以增加 attr 、css、text以及html
attr(属性名,属性值),attr 方法传入两个参数是修改属性值,一个参数是获取属性值;text、html 不传参数是获取值,传参数是赋值
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-1.active')
print(li)
print(li.attr('name','link')) # 增加一个属性 :name="link"
print(li.css('font-size','14px')) # 增加一个css :style="font-size: 14px"
# li.text('changed item') 修改文本内容
# li.html('<span>changed item</span>') 修改html
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-1 active" name="link"><a href="link4.html">fouth item</a></li>
<li class="item-1 active" name="link" style="font-size: 14px"><a href="link4.html">fouth item</a></li>
5.3、remove()方法
利用remove 方法可以只获取标签中的某一段文本而不是全部
# 只获取 Hello World!,利用 remove 方法移除 p 标签
html = """ <div class="wrap"> Hello World! <p>First Cell</p> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove() # 找到 wrap 的子元素 p,并将其移除
print(wrap.text()) # 移除p节点后,获取text只能获取到 Hello World!
Hello World! First Cell
Hello World!
5.4、其他 DOM 方法
http://pyquery.readthedocs.io/en/latest/api.html
6、伪类选择器(CSS3)
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0"><a href="link3.html"><span class="blod">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fouth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child') # 获取第一个 li 标签
print(li)
li = doc('li:last-child') # 获取最后一个 li 标签
print(li)
li = doc('li:nth-child(2)') # 获取第 2 个 li 标签
print(li)
li = doc('li:gt(2)') # 获取索引值为 2 以后的 li 标签
print(li)
li = doc('li:nth-child(2n)') # 获取偶数的 li 标签
# li = doc('li:nth-child(2n+1)') 奇数
print(li)
li = doc('li:contains(second)') # 获取包含 second 的 li 标签
print(li)
<li class="item-0">first item</li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fouth item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
7、总结
初始化
- 字符串初始化:doc = pq(html)
- URL:doc = pq(‘url’)
- 文件:doc = pq(filename=’xxx.html’)
选择器
选择器包含基本 CSS选择器、伪类选择器
CSS 选择器:
与实现 CSS 样式类似,以 class 、id 等属性为标记
doc = pq(html) doc('#container .list' li)
伪类选择器:
li = doc('li:first-child') # 获取第一个 li 标签
li = doc('li:last-child') # 获取最后一个 li 标签
li = doc('li:nth-child(2)') # 获取第 2 个 li 标签
li = doc('li:gt(2)') # 获取索引值为 2 以后的 li 标签
li = doc('li:nth-child(2n)') # 获取偶数的 li 标签
# li = doc('li:nth-child(2n+1)') 奇数
li = doc('li:contains(second)') # 获取包含 second 的 li 标签
查找元素
# 子元素(find 方法)
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list') # 查找 class 为 list 的标签,它的子元素 li 标签
print(items.find('li'))
# 父元素、祖先节点
print(items.parent()) # 父元素
print(items.parents()) # 祖先节点
# 兄弟元素
print(items.siblings())
# 遍历(items 方法)
lis = doc('li').items()
for li in lis:
print(li)
获取信息
# 属性 (attr)
doc = pq(html)
a = doc('#container .list a')
a.attr('href')
# a.attr.href(属性名)
# 文本
a.text()
# html
a.html()
DOM 操作
# addClass、removeClass
doc = pq(html)
li = doc('.item-1.active')
print(li.addClass('active')) # 增加属性 class = "active"
print(li.removeClass('active')) # 移除属性 class = "active"
# 修改 attr、css
li.attr('name','link') # 增加一个属性 :name=“link”
li.css('font-size','14px') # 增加一个css :style="font-size: 14px"
# remove 方法
# 移除某个标签
li.find('p').remove() # 将 li 标签下的 p 标签移除
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/144938.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...