爬虫踩坑系列——etree.HTML解析异常

全栈程序员-站长 • 2026年3月17日下午5:53 • 未分类 • 阅读 1

在爬虫的过程中，难免会遇到各种各样的问题。在这里，为大家分享一个关于etree.HTML解析异常的问题。
1.问题描述：
爬虫过程中，一般会使用requests.get()方法获取一个网页上的HTML内容，然后通过lxml库中的etree.HTML来解析这个网页的结构，最后通过xpath获取自己所需的内容。

本人爬虫的具体代码可简单抽象如下：

	res = requests.get(url)
	html = etree.HTML(res.text)
	contents = html.xpaht('//div/xxxx')

然后遇到了如下的错误信息：

 Traceback (most recent call last): File "xxxxxxxx.py", line 157, in 
  
    get_website_title_content(url) File "xxxxxxxx.py", line 141, in get_website_title_content html = etree.HTML(html_text) File "src\lxml\etree.pyx", line 3170, in lxml.etree.HTML File "src\lxml\parser.pxi", line 1872, in lxml.etree._parseMemoryDocument ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

关键错误就是 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

2.解决方法
通过查阅相关资料，造成这个错误的原因其实是requests返回的 res.text 和 res.content 两者区别的问题。查阅requests源代码中是text和content定义（如下所示）可知：res.text返回的是Unicode类型的数据，而res.content返回的是bytes类型的数据。

 @property def content(self): """Content of the response, in bytes.""" if self._content is False: # Read the contents. if self._content_consumed: raise RuntimeError( 'The content for this response was already consumed') if self.status_code == 0 or self.raw is None: self._content = None else: self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b'' self._content_consumed = True # don't need to release the connection; that's been handled by urllib3 # since we exhausted the data. return self._content @property def text(self): """Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property. """ # Try charset from content-type content = None encoding = self.encoding if not self.content: return str('') # Fallback to auto-detected encoding. if self.encoding is None: encoding = self.apparent_encoding # Decode unicode from given encoding. try: content = str(self.content, encoding, errors='replace') except (LookupError, TypeError): # A LookupError is raised if the encoding was not found which could # indicate a misspelling or similar mistake. # # A TypeError can be raised if encoding is None # # So we try blindly encoding. content = str(self.content, errors='replace') return content

导致该错误的原因是etree解析是不支持编码声明的Unicode字符串的。
因此解决方法很简单，第一种就是直接使用 res.content，如下：

	res = requests.get(url)
	html = etree.HTML(res.content )
	contents = html.xpath('//div/xxxx')

第二种方法则是将Unicode字符串转换为bytes数组，如下：

	res = requests.get(url)
	html_text = bytes(bytearray(res.text, encoding='utf-8'))
	html = etree.HTML(html_text)
	contents = html.xpath('//div/xxxx')

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/221530.html原文链接：https://javaforall.net

爬虫踩坑系列——etree.HTML解析异常

关于作者

全栈程序员-站长

发表回复

爬虫踩坑系列——etree.HTML解析异常

关于作者

全栈程序员-站长

相关推荐

Windows搭建SVN服务器「建议收藏」

（转载）LaTeX实战经验：从零开始快速入门

面试官，不要再问我三次握手和四次挥手「建议收藏」

tophat使用_tophat是什么意思

Mock 测试

PMM

发表回复