xpath学习笔记

作者：YXN-python 阅读量：338 发布日期：2024-05-15

xpath语法

Xpath的使用

pip install lxml

关键点总结

路径符号：/ 表示直接子节点，// 表示任意层级。
索引从1开始：如 [1] 表示第一个节点。
轴语法：通过 ancestor、following 等轴精准定位节点关系。
灵活筛选：结合属性、位置、文本内容等多维度条件过滤节点。

节点选择

##### 1、全节点选择
# //* 匹配文档所有节点
//*

##### 2、指定标签节点
# //tag 选择指定标签的所有节点
//head  # 所有<head>节点

层级关系

直接子节点：//div/a 匹配<div>下的直接子节点<a>
子孙节点：//body//a 匹配<body>下任意层级的<a>

属性选择

##### 1、属性筛选
# [@attr="value"] 根据属性值筛选节点
//a[@href="image1.html"]  # href为指定值的<a>

##### 2、属性值提取
# @attr 获取节点属性值
//a/@href  # 提取所有<a>的href属性

##### 3、多值属性匹配
# contains(@attr, "value") 匹配属性包含某值的节点（如多class）
//a[contains(@class, "li")]  # class包含"li"的<a>

父节点选择

##### 简写
# /.. 选择父节点
a = html.xpath('//a/..')  # <a>的直接父节点

##### 轴语法
# parent::tag 指定父节点类型
a = html.xpath('//a/parent::div')  # <a>的父级<div>

按位置筛选

######### 1、按位置筛选（索引从1开始）
# 选择第 n 个
//tr[n]

# 最后一个
//tr[last()]  

# 选择前 n 个
//tr[position() <= n]

# 选择最后 n 个
//tr[position() >= last() - n + 1]

# 倒数第三个
//tr[last()-2]

######### 2. 选择某个区间内的
# 选择从第 m 个到第 n 个
//tr[position() >= m and position() <= n]

######### 3. 选择偶数或奇数位置的
# 选择偶数位置的
//tr[position() mod 2 = 0]

# 选择奇数位置的	
//tr[position() mod 2 = 1]

######### 4. 排除某些位置的
# 排除前 n 个
//tr[position() > n]

# 排除最后 n 个
//tr[position() <= last() - n]

######### 5. 使用范围函数（XPath 2.0+）
# 如果支持 XPath 2.0 或更高版本，可以使用 subsequence 函数来选择区间：
subsequence(//li, m, n)

轴（Axes）操作

##### 1、祖先节点
# ancestor::tag 选择祖先节点
//a/ancestor::div  # <a>的所有祖先<div>

##### 2、后代节点
# descendant::tag 选择后代节点
//a[6]/descendant::*  # 第6个<a>的所有子节点

##### 3、后续节点
# following::tag 选择文档顺序后续的所有节点
//a[1]/following::*  # 第一个<a>之后的所有节点

##### 4、同级兄弟节点
# following-sibling::tag 选择后续同级节点
//a[1]/following-sibling::a  # 第一个<a>之后的所有同级<a>

##### 5、属性轴
# attribute::* 提取所有属性
//a[1]/attribute::*  # 第一个<a>的所有属性

通配符与联合选择

##### 通配符

# *: 匹配任意元素节点
//div/*  # <div>下的所有直接子元素

# @*: 匹配任意属性
//a[@*] # 带有任意属性的<a>

# node(): 匹配任意类型节点（元素、文本、属性等）
//div/node()  # <div>下的所有子节点（含文本）

##### 联合选择器
# | 合并多个路径的结果
a = html.xpath('//div/a | //span/a')  # 同时选择<div>和<span>下的<a>

多条件组合

##### 1、逻辑与/或 and、or 组合多个条件
//a[contains(@class, "li") or @name="items"]
//a[contains(@class, "li") and @name="items"]

复杂条件与嵌套

##### 多层级条件筛选
a = html.xpath('//div[contains(@class, "container")]//a[text()="详情"]')  
# 在class含"container"的<div>下，找到文本为"详情"的<a>

##### 嵌套函数与轴结合
a = html.xpath('//a[contains(@href, "image")]/ancestor::div[1]')  
# 找到href含"image"的<a>的第一个祖先<div>

动态路径、模糊匹配、变量

##### 动态属性名
# 使用变量动态构建属性名（需结合编程语言特性）：
attr = "href"
a = html.xpath(f'//a[@{attr}="image1.html"]')

##### 部分属性值匹配
# 匹配 `href` 包含 `product_` 的链接
a = html.xpath('//a[contains(@href, "product_")]/@href')

# 匹配 `id` 以 `user-` 开头的元素
a = html.xpath('//div[starts-with(@id, "user-")]/text()')

##### 正则表达式（需结合 lxml 扩展）
# 使用 `re:test()` 函数（需注册命名空间）
ns = {"re": "http://exslt.org/regular-expressions"}
a = html.xpath('//a[re:test(@href, "^/product/\d+$")]', namespaces=ns)

##### 参数化查询
# （需结合 XPath 2.0+ 或扩展库，部分环境支持）
# 假设支持外部变量（伪代码）
a = html.xpath('//a[@id=$id]', id="link1")

预编译 XPath

# 重复使用的路径可预编译提升效率：
find_links = etree.XPath('//a[@class="link"]')
a = find_links(html)

示例：

from lxml import etree

html = etree.HTML(doc)
a = html.xpath('//*')  # 所有节点
a = html.xpath('//head')  # 指定head节点
a = html.xpath('//head')  # 指定节点为列表

a = html.xpath('//div/a')  # 子节点里面的子孙节点   div>a
a = html.xpath('//body//a')  # 子节点里面的子孙节点   div>a
a = html.xpath('//body//a[@href="image1.html"]/..')  # 上一节点 div   a..
a = html.xpath('//body//a[1]/..')  # 第一个a标签..
a = html.xpath('//body//a[1]/parent::*')  # 也可以通过 parent * div
a = html.xpath('//body//a[1]/parent::div')

a = html.xpath('//body//a[@href="image1.html"]')  # 属性匹配 标签为 a
a = html.xpath('//body//a[@href="image1.html"]/text()')  # 内容获取 a标签内的内容['Name: My image 1 ']
a = html.xpath('//body//a/@href')  # 属性获取 获取所有标签里面的a
a = html.xpath('//body//a/@id')  # 属性获取
a = html.xpath('//body//a[1]/@id')  # 属性获取 从1开始不是0
a = html.xpath('//body//a[@class="li"]')  # 属性多值匹配 a 标签有多个class类，直接匹配就不可以了，需要用contains
a = html.xpath('//body//a[@name="items"]')  # 属性多值匹配 匹配a标签内有name = items的标签
a = html.xpath('//body//a[contains(@class,"li")]')  # 属性多值匹配 匹配a标签内有class = li的标签
a = html.xpath('//body//a[contains(@class, "li")]/text()')  # 属性多值匹配 匹配a标签内有class = li的标签的值
a = html.xpath('//body//a[contains(@class, "li") or @name="itmes"]/text()')  # 多属性匹配 匹配a标签内有class=li or name=itmes的内容
a = html.xpath('//body//a[contains(@class,"li") and @name="items"]/text()')  # 多属性匹配 匹配a标签内有class=li and name=itmes的内容
a = html.xpath('//a[2]/text()')  # 按序选择 查找第二个a标签的内容
a = html.xpath('//a[3]/@href')  # 按序选择 查找第三个a标签的@href内容
a = html.xpath('//a[last()]/@href')  # 按序选择 查找最后一个a标签的@href内容
a = html.xpath('//a[position()<3]/@href')  # 按序选择 查找标签位置小于3的位置
a = html.xpath('//a[last()-2]/@href')  # 按序选择 查找倒数第二个a标签
a = html.xpath('//a/ancestor::*')  # ancestor：祖先节点 使用了* 获取所有祖先节点
a = html.xpath('//a/ancestor::div')  # 获取祖先节点中的div

a = html.xpath('//a[1]/attribute::*')  # 获取第一个a标签的属性值
a = html.xpath('//a[1]/attribute::href')  # 获取第一个a标签的href属性值
a = html.xpath('//a[1]/child::*')  # 获取第一个a标签的的子节点
a = html.xpath('//a[6]/descendant::*')  # 获取第六个a标签的子节点

a = html.xpath('//a[1]/following::*')  # 获取第一个a标签之后的所有节点
a = html.xpath('//a[1]/following::*[1]/@href')  # 获取第1个a标签之后的所有节点里面的第一个href里面所有的节点

a = html.xpath('//a[1]/following-sibling::*')  # 获取第一个a标签之后所有同级节点
a = html.xpath('//a[1]/following-sibling::a')  # 获取第一个a标签之后同级a节点
a = html.xpath('//a[1]/following-sibling::*[2]')  # 获取第一个a标签之后所有同级节点第二个节点
a = html.xpath('//a[1]/following-sibling::*[2]/@href')  # 获取第一个a标签之后所有同级节点第二个节点里面href的属性

print(a)

1.依靠自己的属性，文本定位

//td[text()='Data Import']

//div[contains(@class,'cux-rightArrowIcon-on')]

//a[text()='马上注册']

//input[@type='radio' and @value='1']   # 多条件

//span[@name='bruce'][text()='bruce1'][1]   # 多条件

 //span[@id='bruce1' or text()='bruce2']   # 找出多个

 //span[text()='bruce1' and text()='bruce2']    #找出多个

2.依靠父节点定位

//div[@class='x-grid-col-name x-grid-cell-inner']/div

//div[@id='dynamicGridTestInstanceformclearuxformdiv']/div

//div[@id='test']/input

3.依靠子节点定位

//div[div[@id='navigation']]

//div[div[@name='listType']]

//div[p[@name='testname']]

4.混合型

//div[div[@name='listType']]//img

//td[a//font[contains(text(),'seleleium2从零开始 视屏')]]//input[@type='checkbox']   #   包含特定文本的字体元素内的复选框输入元素

5.进阶部分

 //input[@id='123']/following-sibling::input  # 找下一个兄弟节点

 //input[@id='123']/preceding-sibling::span  # 上一个兄弟节点

 //input[starts-with(@id,'123')]  # 以什么开头

 //span[not(contains(text(),'xpath')）]  # 不包含xpath字段的span

6.索引

//div/input[2]

//div[@id='position']/span[3]

//div[@id='position']/span[position()=3]

//div[@id='position']/span[position()>3]

//div[@id='position']/span[position()<3]

//div[@id='position']/span[last()]

//div[@id='position']/span[last()-1]

7.substring 截取判断

<div data-for="result" id="swfEveryCookieWrap"></div>

//*[substring(@id,4,5)='Every']/@id  # 截取该属性 定位3,取长度5的字符 

//*[substring(@id,4)='EveryCookieWrap']  # 截取该属性从定位3 到最后的字符 

//*[substring-before(@id,'C')='swfEvery']/@id  # 属性 'C'之前的字符匹配

//*[substring-after(@id,'C')='ookieWrap']/@id  # 属性'C之后的字符匹配

8.通配符*

//span[@*='bruce']

//*[@name='bruce']

9.轴

//div[span[text()='+++current node']]/parent::div  # 找父节点

//div[span[text()='+++current node']]/ancestor::div  # 找祖先节点

10.孙子节点

//div[span[text()='current note']]/descendant::div/span[text()='123']

//div[span[text()='current note']]//div/span[text()='123'] #  两个表达的意思一样

11.following pre

//span[@class="fk fk_cur"]/../following::a   # 往下的所有a

//span[@class="fk fk_cur"]/../preceding::a[1]  # 往上的所有a

xpath提取多个标签下的text

如果我有一百段这样类似的html代码，内部的标签不固定，又如何使用xpath表达式，以最快最方便的方式提取出来？

对于如下的代码：

<div id="test3">
	我左青龙，
	<span id="tiger">
		右白虎，
        <ul>上朱雀，
            <li>下玄武。</li>
        </ul>
        老牛在当中，
	</span>
	龙头在胸口。
<div>

使用xpath的string(.)

data = selector.xpath('//div[@id="test3"]')
info = data.xpath('string(.)').extract()[0]

这样，就可以把“我左青龙，右白虎，上朱雀，下玄武。老牛在当中，龙头在胸口”整个句子提取出来，赋值给info变量。

模糊查询 contains

目前许多web框架，都是动态生成界面的元素id，因此在每次操作相同界面时，ID都是变化的，这样为自动化测试造成了一定的影响。

<div class="eleWrapper" title="请输入用户名">
	<input type="text" class="textfield" name="ID9sLJQnkQyLGLhYShhlJ6gPzHLgvhpKpLzp2Tyh4hyb1b4pnvzxFR!-166749344!1357374592067" id="nt1357374592068"  />
</div>

解决方法使用xpath的匹配功能，//input[contains(@id,'nt')]

测试使用的XML

<Root>
    <Person ID="1001">
        <Name lang="zh-cn">张城斌</Name>
        <Email xmlns="www.quicklearn.cn">cbcye@live.com</Email>
        <Blog>http://cbcye.cnblogs.com</Blog>
    </Person>
    <Person ID="1002">
        <Name lang="en">Gary Zhang</Name>
        <Email xmlns="www.quicklearn.cn">GaryZhang@cbcye.com</Email>
        <Blog>http://www.quicklearn.cn</Blog>
    </Person>
</Root>

查询所有Blog节点值中带有 cn 字符串的Person节点Xpath表达式：

/Root//Person[contains(Blog,'cn')]

2.查询所有Blog节点值中带有 cn 字符串并且属性ID值中有01的Person节点

Xpath表达式：

/Root//Person[contains(Blog,'cn') and contains(@ID,'01')]

内置函数

1. 字符串处理函数

string()

将任意类型转换为字符串

string(//div[@id="title"])  // 返回该 div 的文本内容

concat()

拼接多个字符串

concat(//span[@class="first"], "-", //span[@class="last"])   // 输出类似 "John-Doe"

substring()

截取子字符串

语法: substring(字符串, 起始位置, 长度)

substring("Hello World", 7, 5)  // 返回 "World"

contains()

判断字符串是否包含子串

//div[contains(@class, "active")]  // 选择 class 包含 "active" 的 div

starts-with()

判断字符串是否以某子串开头

//a[starts-with(@href, "https://")]  // 选择 href 以 https 开头的链接

normalize-space()

去除字符串首尾空格，并将连续空格替换为单个空格

normalize-space(//p)  // 清理段落文本中的多余空格

2. 数值处理函数

sum()

计算节点集的数值总和

sum(//item/price)  // 计算所有 price 元素的和

floor(), ceiling(), round()

数值取整

round(3.6)    // 4
floor(3.6)    // 3
ceiling(3.2)  // 4

3. 布尔函数

not()

取反布尔值

//input[not(@disabled)]  // 选择未禁用的输入框

boolean()

将值转换为布尔值

boolean(//error)  // 如果存在 error 节点返回 true

4. 节点集函数

position()

返回节点在节点集中的位置（从 1 开始）

//li[position() = 1]  // 选择第一个 li 元素

last()

返回节点集的最后一个位置

//li[last()]          // 选择最后一个 li 元素
//li[position() = last() - 1]  // 选择倒数第二个 li

count()

统计节点数量

count(//div[@class='item']) //统计类名为 item 的 <div> 元素数量

5. 高级函数（ 2.0+）

注意：部分库（如 Python lxml）仅支持 1.0，以下函数可能不可用。

tokenize()

按正则表达式分割字符串

tokenize("2023-10-01", "-")  // 返回 ["2023", "10", "01"]

matches()

正则表达式匹配

//div[matches(text(), "\d+")]  // 选择包含数字的 div

upper-case(), lower-case()

转换大小写

upper-case("Hello")  // "HELLO"

6. 实用技巧

组合函数

//div[contains(normalize-space(@class), "main")]  
// 清理 class 的空格后判断是否包含 "main"

动态路径

//book[author = //author[name='J.K. Rowling']]/title  
// 嵌套选择作者为 "J.K. Rowling" 的书籍标题

注意事项

版本兼容性: XPath 1.0 和 2.0 的函数支持差异较大，需确认运行环境。
转义字符: 如 contains(text(), "Café") 需注意编码问题。
性能: 复杂函数可能影响查询效率，尽量简化表达式。

YXN-python

2024-05-15