大佬教程收集整理的这篇文章主要介绍了Pandas 读取 xml 无法正常用于单标签 xml,大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。
我正在使用 pandas_read_xml 包来读取 xml 文件并将其处理为 Pandas 数据帧。在绝大多数情况下,该软件包完全符合我的目的。但是,当读取只有一个标签的 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。
# import package import pandas_read_xml as pdx from pandas_read_xml import fully_flatten # Example 1 url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml’ df_1 = pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSec']) df_1 = pdx.fully_flatten(df_1)
@H_489_9@生成的 df_1 包含 163 行和 31 列,其中每一行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取一个 xml 时,输出有点奇怪,其中只出现了一个标记“invstOrSec”。
# Example 2 url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml’ df_2 = pdx.read_xml(url_2,'invstOrSec']) df_2 = pdx.fully_flatten(df_2)
@H_489_9@生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它实际上应该是 1 行却包含 6 行。我观察到这种行为只发生在标签“invstOrSec”只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。
解决方法
首先感谢您的反馈!我写了 pandas-read-xml 是因为 pandas 没有 pd.read_xml() 实现。您(和我们其他人)会很高兴知道 Pandas read_xml 的开发版即将推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)
至于您当前的难题,这是 XML 结构的结果(也是我不喜欢的结果之一)。与可以在列表中返回单个元素的 JSON 不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。
基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签......我没有多大意义,是吗?让我用你的例子来解释。
以下是我建议您使用它的方式:
# Import package import pandas_read_xml as pdx from pandas_read_xml import fully_flatten # Example 1 url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml' df_1 = pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSec']).pipe(fully_flatten) # Example 2 url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml" df_2 = pdx.read_xml(url_2,'invstOrSecs'],transpose=TruE).pipe(fully_flatten) df_2
@H_489_9@有什么区别?
在示例 1 中,您已经期望在标签内有多个。 因此,传递 root_tag_list=['edgarSubmission','invstOrSec'] 会在幕后返回一个列表。 full_flatten 过程首先将列表分解为行。
在示例 2 中,如果您使用相同的 root_tag_list,pandas 不会读取列表。相反,它正在阅读与单行相对应的字典。实际上,它将作为列的标记视为行。相反,我会将它上面的一个标签作为根标签传递,然后转置它,然后完全扁平化。
是的...我知道...这是一种解决方法。但是......话又说回来,我没有创建pandas-read-xml希望解决所有问题。在 Pandas 本身支持读取 XML 之前,它一直是一个临时解决方案(它看起来很快就会推出)。
告诉我进展如何!
编辑:
关于如何使 XML 到 Pandas DataFrame 的转换可以根据 XML 是否只有一个“行”标签或多个“行”标签进行切换,我有以下两种选择。
在多行的情况下,DataFrame 将产生一个带有整数索引(行号)的 DataFrame,而在单行的情况下,DataFrame 索引将是“字符串”,本来是列。因此,一种策略是检测并相应地重新执行。 (您可能可以通过更智能的方法避免重复下载)
# Import package import pandas as pd import pandas_read_xml as pdx from pandas_read_xml import fully_flatten # Example 3 dfs = [] url_components = ['1279392/000114554921008161','1279394/000114554921008162'] for url_component in url_components: url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/priMary_doc.xml' temp = pdx.read_xml(url,'invstOrSecs']) if 0 not in temp.index: temp = pdx.read_xml(url,transpose=TruE) else: temp = pdx.read_xml(url,'invstOrSec']) dfs.append(temp) df = pd.concat(dfs,ignore_index=TruE).pipe(fully_flatten) df
@H_489_9@另一种选择是使用底层工具。 pandas_read_xml 背后没有什么神奇之处,它使用了一个名为 xmltoDict 的包。读取 XML,转换为 Dicts,然后转换为 Pandas,然后展平。唯一的缺点是因为保留了标签“invstOrSec”的名称,它们成为列名称的前缀。您应该能够轻松删除它们。
# Import package import pandas as pd import pandas_read_xml as pdx import xmltoDict from pandas_read_xml import fully_flatten # Example 4 url_components = ['1279392/000114554921008161','1279394/000114554921008162'] xmlDicts = [] for url_component in url_components: url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/priMary_doc.xml' xml = pdx.read_xml_from_url(url) xmlDicts.append(xmltoDict.parse(xml)['edgarSubmission']['formData']['invstOrSecs']) df = pd.DataFrame.from_Dict(xmlDicts).pipe(fully_flatten) df
@H_489_9@希望有帮助!
编辑:
所以,我更新了包(现在是 0.2.0 版)。现在,pandas_read_xml 应该将根标记视为生成的 Pandas 数据框中的行作为默认值,因此无需区分有时具有单个“行”和有时具有多行的 XML。
如果在其他情况下这是一个问题,那么有一个新参数
,root_is_rows
默认为 True,但可以设为 false。确实,在即将发布的 Pandas 1.3 中,
read_xml
将允许您将解析的节点迁移到数据帧中。但是,因为 XML 可以具有超出按列的 2D 行的许多维度,如下所述:此方法最适合导入浅层 XML 文档
因此,不会立即选取任何嵌套元素,如此处所示,大约有 20 列。请注意,由于文档中的默认命名空间,需要使用
namespaces
。熊猫 1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml" df = pd.read_xml(url,xpath="//edgar:invstOrSec",namespaces={"edgar": "http://www.sec.gov/edgar/nport"}) print(df) # name lei title cusip ... fairValLevel securityLending assetCat debtSec # 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN # 1 Regatta XV Funding Ltd.,Subordinated Note,PR... NaN Regatta XV Funding Ltd.,PR... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN # 2 Hired,Inc.,Series C Preferred Stock NaN Hired,Series C Preferred Stock NaN ... 3.0 NaN EP NaN # 3 WestVIEW CAPITAL PARTNERS II LP NaN WestVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN # 4 VOYAGER CAPITAL FUND III,l.P. NaN VOYAGER CAPITAL FUND III,l.P. 999999999 ... NaN NaN None NaN .. ... ... ... ... ... ... ... ... ... # 158 ARCLIGHT ENERGY PARTNERS FUND V,l.P. NaN ARCLIGHT ENERGY PARTNERS FUND V,l.P. 999999999 ... NaN NaN None NaN # 159 ALLOY MERCHANT PARTNERS l.P. NaN ALLOY MERCHANT PARTNERS l.P. 999999999 ... NaN NaN None NaN # 160 ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN # 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN # 162 ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN # [163 rows x 20 columns] url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml" df = pd.read_xml(url,namespaces={"edgar": "http://www.sec.gov/edgar/nport"}) print(df) # name lei title cusip ... invCountry isReStrictedSec fairValLevel securityLending # 0 Salient Private Access Master Fund,l.P. NaN Salient Private Access Master Fund,l.P. 999999999 ... US Y NaN NaN # [1 rows x 18 columns]
@H_489_9@幸运的是,
read_xml
支持带有默认lxml
解析器的 XSLT(旨在转换 XML 文档的专用语言)。使用 XSLT,您可以将迁移所需的节点展平以检索 32 列。xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:edgar="http://www.sec.gov/edgar/nport"> <xsl:output method="xml" indent="yes" /> <xsl:Strip-space elements="*"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates SELEct="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="edgar:invstOrSec"> <xsl:copy> <xsl:apply-templates SELEct="*|*/*"/> </xsl:copy> </xsl:template> </xsl:stylesheet> """ url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml" df = pd.read_xml(url,namespaces={"edgar": "http://www.sec.gov/edgar/nport"},stylesheet=xsl) print(df) # name lei title cusip ... AnnualizedRt isDefault areIntrstPmntsInArrs isPaidKind # 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None # 1 Regatta XV Funding Ltd.,PR... 75888PAC7 ... 0.0624 N N N # 2 Hired,Series C Preferred Stock NaN ... NaN None None None # 3 WestVIEW CAPITAL PARTNERS II LP NaN WestVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None # 4 VOYAGER CAPITAL FUND III,l.P. 999999999 ... NaN None None None .. ... ... ... ... ... ... ... ... ... # 158 ARCLIGHT ENERGY PARTNERS FUND V,l.P. 999999999 ... NaN None None None # 159 ALLOY MERCHANT PARTNERS l.P. NaN ALLOY MERCHANT PARTNERS l.P. 999999999 ... NaN None None None # 160 ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None # 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None # 162 ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None # [163 rows x 32 columns]
@H_489_9@熊猫
要通过 XPath 方法实现相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表,并传递到
DataFrame
构造函数。下面使用与上面相同的 XSLT 和 XPath 命名空间。import lxml.etree as lx import pandas as pd import urllib.request as rq url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml" xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:edgar="http://www.sec.gov/edgar/nport"> <xsl:output method="xml" indent="yes" /> <xsl:Strip-space elements="*"/> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates SELEct="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="edgar:invstOrSec"> <xsl:copy> <xsl:apply-templates SELEct="*|*/*"/> </xsl:copy> </xsl:template> </xsl:stylesheet> """ content = rq.urlopen(url) # LOAD XML AND XSL doc = lx.fromString(content.read()) style = lx.fromString(xsl) # INITIALIZE AND TRANSFORM ORIGINAL DOC transformer = lx.XSLT(stylE) result = transformer(doC) # RUN XPATH PARSING ON FLATTER XML data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*") } for inv in result.xpath("//edgar:invstOrSec",namespaces={"edgar": "http://www.sec.gov/edgar/nport"})] # BIND DATA FOR DATA FRAME df = pd.DataFrame(data) print(df) # name lei title ... isDefault areIntrstPmntsInArrs isPaidKind # 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN # 1 Regatta XV Funding Ltd.,PR... N/A Regatta XV Funding Ltd.,PR... ... N N N # 2 Hired,Series C Preferred Stock N/A Hired,Series C Preferred Stock ... NaN NaN NaN # 3 WestVIEW CAPITAL PARTNERS II LP N/A WestVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN # 4 VOYAGER CAPITAL FUND III,l.P. N/A VOYAGER CAPITAL FUND III,l.P. ... NaN NaN NaN # .. ... ... ... ... ... ... ... # 158 ARCLIGHT ENERGY PARTNERS FUND V,l.P. N/A ARCLIGHT ENERGY PARTNERS FUND V,l.P. ... NaN NaN NaN # 159 ALLOY MERCHANT PARTNERS l.P. N/A ALLOY MERCHANT PARTNERS l.P. ... NaN NaN NaN # 160 ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN # 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN # 162 ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN # [163 rows x 32 columns]
@H_489_9@大佬总结
以上是大佬教程为你收集整理的Pandas 读取 xml 无法正常用于单标签 xml全部内容,希望文章能够帮你解决Pandas 读取 xml 无法正常用于单标签 xml所遇到的程序开发问题。
如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。
本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。