lxml.etree 入门

完整的API 请看 http://lxml.de/api/index.html 。

通常像下面这样导入 lxml.etree 模块:

from lxml import etree

Element 类

root = etree.Element("root") print(root.tag) #root

添加子节点的方法之一，append方法：

root.append( etree.Element("child1") )

更有效的方法是SubElement工厂函数，使用如下：

child2 = etree.SubElement(root, "child2") child3 = etree.SubElement(root, "child3")

这个方法，建立元素，添加子节点一步完成。

使用tostring 方法，可以看到刚才建立的 xml文件全貌。

>>>print(etree.tostring(root, pretty_print=True)) <root> <child1/> <child2/> <child3/>  
   root>

child = root[0] print(child.tag) #child1 print(len(root)) #3 root.index(root[1]) # lxml.etree only! #1 children = list(root) for child in root: print(child.tag) #child1 #child2 #child3 root.insert(0, etree.Element("child0")) start = root[:1] end = root[-1:] print(start[0].tag) #child0 print(end[0].tag) #child3

在ElementTree 1.3 and lxml 2.0 以前的版本中，可以使用下面的代码来判断，一个节点是否有子节点。

if root: # this no longer works! print("The root element has children")

但这个已经不再支持了。因为有些人认为，节点也是“某种东西”，所以对节点判断，本来就应该是True。即使这个节点没有子节点。代替方案是用 len(element) 。

print(etree.iselement(root)) # test if it's some kind of Element True if len(root): # test if it has children print("The root element has children") #The root element has children

在lxml和Python原生的list之间，还要一点不同。请看代码

>>> for child in root: ... print(child.tag) child0 child1 child2 child3 >>> root[0] = root[-1] # this moves the element in lxml.etree! >>> for child in root: ... print(child.tag) child3 child1 child2

它把最后一个元素移动到第一个位置了。

>>> l = [0, 1, 2, 3] >>> l[0] = l[-1] >>> l [3, 1, 2, 3]

可是在原生的list中，只是把最后一个对象的引用拷贝到第一个位置。也就是说同一个对象可以同时出现在不同的地方。这一点在lxml中不一样，他用的是移动，而不是拷贝。

注意在原生的ElementTree中，节点就像list一样可以复制在许多不同的地方，可这样也有一个明显的缺点，就是改变这个节点后，所有引用这个节点的地方都会一个改变，这可能不是你想要的。

Element 总是有一个确切的父节点，可以通过getparent() 方法来查询，这在原生ElementTree中并不支持。

>>> root is root[0].getparent() # lxml.etree only! True

>>> from copy import deepcopy >>> element = etree.Element("neu") >>> element.append( deepcopy(root[1]) ) >>> print(element[0].tag) child1 >>> print([ c.tag for c in root ]) ['child3', 'child1', 'child2']

上面这个例子展示的是，用deepcopy复制了root[1]这个元素后，并没有移动它。所有还是能在原来的root中打印出来的。

相邻元素之间的访问

>>> root[0] is root[1].getprevious() # lxml.etree only! True >>> root[1] is root[0].getnext() # lxml.etree only! True

元素像字典一样携带属性。

XML节点支持属性，可以在Element函数中直接创建属性。

>>> root = etree.Element("root", interesting="totally") >>> etree.tostring(root) b' 
   '

属性即是无序的键值对，所以用字典可以方便的处理。

>>> print(root.get("interesting")) totally >>> print(root.get("hello")) None >>> root.set("hello", "Huhu") >>> print(root.get("hello")) Huhu >>> etree.tostring(root) b' 
   ' >>> sorted(root.keys()) ['hello', 'interesting'] >>> for name, value in sorted(root.items()): ...  print('%s = %r' % (name, value)) hello = 'Huhu' interesting = 'totally'

如果你想获取字典这样的结构，就用attrib属性获取。

>>> attributes = root.attrib >>> print(attributes["interesting"]) totally >>> print(attributes.get("no-such-attribute")) None >>> attributes["hello"] = "Guten Tag" >>> print(attributes["hello"]) Guten Tag >>> print(root.get("hello")) Guten Tag

>>> d = dict(root.attrib) >>> sorted(d.items()) [('hello', 'Guten Tag'), ('interesting', 'totally')]

在节点中包含文本

在节点中可以携带文字，

>>> root = etree.Element("root") >>> root.text = "TEXT" >>> print(root.text) TEXT >>> etree.tostring(root) b' 
   
     TEXT 
   '

<html><body>Hello<br/>World 
   body> 
   html>

像这个例子，这个标签被文本包围。节点为了支持这个特性，使用了tail属性。它包含的文本直接跟在节点后面，并且在下一个节点之前。下面是例子：

>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body") >>> body.text = "TEXT"

>>> etree.tostring(html)
b'  TEXT' >>> br = etree.SubElement(body, "br")
>>> etree.tostring(html)
b'  TEXT
' >>> br.tail = "TAIL"
>>> etree.tostring(html)
b'  TEXT
TAIL'

>>> etree.tostring(br) b'
TAIL'
 >>> etree.tostring(br, with_tail=False) # lxml.etree only! b'
'

>>> etree.tostring(html, method="text") b'TEXTTAIL'

比如要把word中的文本读出来，又不需要大堆的标签时。这个就派上用场了。

另一个提取文本的方法是 XPath ，并且还能把提取的文本，放入一个list中间。

>>> print(html.xpath("string()")) # lxml.etree only! TEXTTAIL >>> print(html.xpath("//text()")) # lxml.etree only! ['TEXT', 'TAIL']

如果你要经常用这个功能，还可以把它封装成一个函数

>>> build_text_list = etree.XPath("//text()") # lxml.etree only! >>> print(build_text_list(html))
['TEXT', 'TAIL']

通过XPath返回的对象有些聪明，它能够知道自己的来源。你可以通过getparent()方法，来知道它来自哪个节点。就像你通过节点直接查看那样。

>>> texts = build_text_list(html)
>>> print(texts[0]) TEXT >>> parent = texts[0].getparent() >>> print(parent.tag) body >>> print(texts[1]) TAIL >>> print(texts[1].getparent().tag) br

你还可以知道，这个文本是普通文本，还是tail文本。

>>> print(texts[0].is_text) True >>> print(texts[1].is_text) False >>> print(texts[1].is_tail) True

While this works for the results of the text() function, lxml will not tell you the origin of a string value that was constructed by the XPath functions string() or concat():

当使用text()函数来获取文本时，是可以获取parent的关系的。但是如果是用string()和concat方法，就不能获取这样的特性了。示例代码

#!/usr/local/python2.7/bin/python #encoding=UTF-8 from lxml import etree

html = etree.Element("html")
html.text= 'abc'
html.tail= 'xyz'

ch1 =etree.SubElement(html, "child1") ch1.text="child1" print etree.tostring(html,pretty_print=True) #string方法，无法获取getparent的信息
t = html.xpath("string()") # lxml.etree only! print t.getparent() #下面可以获取getparent的信息
t2 = html.xpath("//text()") # lxml.etree only! print t2 for a in t2: print a.getparent().tag

未完

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/226151.html原文链接：https://javaforall.net

lxml.etree 入门

Element 类

关于作者

全栈程序员-站长

发表回复

lxml.etree 入门

Element 类

关于作者

全栈程序员-站长

相关推荐

数据库的五种索引类型[通俗易懂]

如何设置网址跳转_怎么让域名跳转到另一个域名

rdlc mysql_RDLC使用手册_RDLC报表部署

restful 幂等性(什么是幂次法则)

腾讯元宝接入DeepSeek-R1满血版 ｜ 让普通用户也有神龙可用

IDEA方法注释模板设置

发表回复

腾讯元宝接入DeepSeek-R1满血版｜让普通用户也有神龙可用