Web开发的实用工具库w3lib

作者：YXN-python 阅读量：204 发布日期：2025-02-24

w3lib 是一个用于 Web 开发的实用工具库，提供了许多处理 URL、HTML等功能的函数。

官方文档：https://w3lib.readthedocs.io/en/latest/

1. w3lib.url 模块

用于处理 URL 的模块，包括 URL 的规范化、编码、解码等。

常用函数：

canonicalize_url: 规范化 URL。
url_query_cleaner: 清理 URL 中的查询参数。
add_or_replace_parameter: 添加或替换 URL 中的参数。
safe_url_string: 将 URL 转换为安全字符串。

from w3lib.url import canonicalize_url, url_query_cleaner, add_or_replace_parameter, safe_url_string

# 规范化和标准化 URL。 它将 URL 转换为统一的标准格式，这在比较 URL 或确保存储和处理 URL 时的一致性非常有用。
url1 = 'http://www.example.com/do?c=3&b=5&b=2&a=50'
url2 = 'http://www.example.com/do?a=50&b=2&b=5&c=3'
print(canonicalize_url(url1))
print(url2)

# 清理URL参数，只留下那些在参数列表中传递的参数，保持顺序
print(url_query_cleaner("https://www.example.com?foo=bar&baz=123&name=test", ["foo", "baz"]))
# 输出: https://www.example.com?foo=bar&baz=123

# 添加或替换 URL 中的参数
print(add_or_replace_parameter("https://www.example.com?foo=bar", "foo", "new_value"))
# 输出: https://www.example.com?foo=new_value

# 将 URL 转换为安全字符串
print(safe_url_string("https://www.example.com/测试"))
# 输出: https://www.example.com/%E6%B5%8B%E8%AF%95

2. w3lib.html 模块

用于处理 HTML 内容的模块。

常用函数：

remove_tags: 移除 HTML 标签。
replace_entities: 替换 HTML 实体（如  ）。
get_base_url: 从 HTML 中提取基础 URL。

from w3lib.html import remove_tags, replace_entities, get_base_url

# 移除 HTML 标签
html_content = "<p>Hello <b>World</b></p>"
print(remove_tags(html_content))  # 输出: Hello World

# 替换 HTML 实体
text = "Price: &nbsp;100 &lt; 200"
print(replace_entities(text))  # 输出: Price:  100 < 200

# 提取基础 URL
html_with_base = '<html><head><base href="https://www.example.com"></head></html>'
print(get_base_url(html_with_base))  # 输出: https://www.example.com

YXN-python

2025-02-24