程序问答   发布时间:2022-06-01  发布网站:大佬教程  code.js-code.com
大佬教程收集整理的这篇文章主要介绍了Pandas 读取 xml 无法正常用于单标签 xml大佬教程大佬觉得挺不错的,现在分享给大家,也给大家做个参考。

如何解决Pandas 读取 xml 无法正常用于单标签 xml?

开发过程中遇到Pandas 读取 xml 无法正常用于单标签 xml的问题如何解决?下面主要结合日常开发的经验,给出你关于Pandas 读取 xml 无法正常用于单标签 xml的解决方法建议,希望对你解决Pandas 读取 xml 无法正常用于单标签 xml有所启发或帮助;

我正在使用 pandas_read_xml 包来读取 xml 文件并将其处理为 Pandas 数据帧。在绝大多数情况下,该软件包完全符合我的目的。但是,当读取只有一个标签的 url 时,数据帧输出有点关闭。让我用以下两个例子来说明这一点。

# import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
@H_489_9@

生成的 df_1 包含 163 行和 31 列,其中每一行对应一个唯一的证券。这符合我想要的结果。但是,当我尝试读取一个 xml 时,输出有点奇怪,其中只出现了一个标记“invstOrSec”。

# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml’
df_2  = pdx.read_xml(url_2,'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)
@H_489_9@

生成的 df_2 包含 6 行和 19 列。我真的无法理解为什么它实际上应该是 1 行却包含 6 行。我观察到这种行为只发生在标签“invstOrSec”只出现一次的情况下。对此的任何帮助将不胜感激。如果我的问题不清楚,请告诉我。

解决方法

首先感谢您的反馈!我写了 pandas-read-xml 是因为 pandas 没有 pd.read_xml() 实现。您(和我们其他人)会很高兴知道 Pandas read_xml 的开发版即将推出! (https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html)

至于您当前的难题,这是 XML 结构的结果(也是我不喜欢的结果之一)。与可以在列表中返回单个元素的 JSON 不同,XML 结构只有一个 XML 标记,它被解释为单个值而不是列表。

基本上,如果只有一个“行”标签,那么“列”标签现在被视为列标签......我没有多大意义,是吗?让我用你的例子来解释。

以下是我建议您使用它的方式:

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = 'https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml'
df_1 =  pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSec']).pipe(fully_flatten)

# Example 2
url_2 = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml"
df_2  = pdx.read_xml(url_2,'invstOrSecs'],transpose=TruE).pipe(fully_flatten)
df_2
@H_489_9@

有什么区别?

在示例 1 中,您已经期望在标签内有多个。 因此,传递 root_tag_list=['edgarSubmission','invstOrSec'] 会在幕后返回一个列表。 full_flatten 过程首先将列表分解为行。

在示例 2 中,如果您使用相同的 root_tag_list,pandas 不会读取列表。相反,它正在阅读与单行相对应的字典。实际上,它将作为列的标记视为行。相反,我会将它上面的一个标签作为根标签传递,然后转置它,然后完全扁平化。

是的...我知道...这是一种解决方法。但是......话又说回来,我没有创建pandas-read-xml希望解决所有问题。在 Pandas 本身支持读取 XML 之前,它一直是一个临时解决方案(它看起来很快就会推出)。

告诉我进展如何!

编辑:

关于如何使 XML 到 Pandas DataFrame 的转换可以根据 XML 是否只有一个“行”标签或多个“行”标签进​​行切换,我有以下两种选择。

在多行的情况下,DataFrame 将产生一个带有整数索引(行号)的 DataFrame,而在单行的情况下,DataFrame 索引将是“字符串”,本来是列。因此,一种策略是检测并相应地重新执行。 (您可能可以通过更智能的方法避免重复下载)

# Import package
import pandas as pd
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 3

dfs = []
url_components = ['1279392/000114554921008161','1279394/000114554921008162']

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/priMary_doc.xml'
    temp = pdx.read_xml(url,'invstOrSecs'])
    if 0 not in temp.index:
        temp = pdx.read_xml(url,transpose=TruE)
    else:
        temp = pdx.read_xml(url,'invstOrSec'])
    dfs.append(temp)

df = pd.concat(dfs,ignore_index=TruE).pipe(fully_flatten)

df
@H_489_9@

另一种选择是使用底层工具。 pandas_read_xml 背后没有什么神奇之处,它使用了一个名为 xmltoDict 的包。读取 XML,转换为 Dicts,然后转换为 Pandas,然后展平。唯一的缺点是因为保留了标签“invstOrSec”的名称,它们成为列名称的前缀。您应该能够轻松删除它们。

# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltoDict
from pandas_read_xml import fully_flatten

# Example 4

url_components = ['1279392/000114554921008161','1279394/000114554921008162']
xmlDicts = []

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/priMary_doc.xml'
    xml = pdx.read_xml_from_url(url)
    xmlDicts.append(xmltoDict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
    
df = pd.DataFrame.from_Dict(xmlDicts).pipe(fully_flatten)

df
@H_489_9@

希望有帮助!

编辑:

所以,我更新了包(现在是 0.2.0 版)。现在,pandas_read_xml 应该将根标记视为生成的 Pandas 数据框中的行作为默认值,因此无需区分有时具有单个“行”和有时具有多行的 XML。

如果在其他情况下这是一个问题,那么有一个新参数 root_is_rows 默认为 True,但可以设为 false。

,

确实,在即将发布的 Pandas 1.3 中,read_xml 将允许您将解析的节点迁移到数据帧中。但是,因为 XML 可以具有超出按列的 2D 行的许多维度,如下所述:

此方法最适合导入浅层 XML 文档

因此,不会立即选取任何嵌套元素,如此处所示,大约有 20 列。请注意,由于文档中的默认命名空间,需要使用 namespaces

熊猫 1.3+

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml"
df = pd.read_xml(url,xpath="//edgar:invstOrSec",namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
# 1    Regatta XV Funding Ltd.,Subordinated Note,PR...  NaN  Regatta XV Funding Ltd.,PR...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
# 2                Hired,Inc.,Series C Preferred Stock  NaN              Hired,Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
# 3                      WestVIEW CAPITAL PARTNERS II LP  NaN                    WestVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
# 4                       VOYAGER CAPITAL FUND III,l.P.  NaN                     VOYAGER CAPITAL FUND III,l.P.  999999999  ...           NaN              NaN      None     NaN
..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V,l.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V,l.P.  999999999  ...           NaN              NaN      None     NaN
# 159                       ALLOY MERCHANT PARTNERS l.P.  NaN                       ALLOY MERCHANT PARTNERS l.P.  999999999  ...           NaN              NaN      None     NaN
# 160  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
# 162  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN

# [163 rows x 20 columns]


url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/priMary_doc.xml"
df = pd.read_xml(url,namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                        name  lei                                     title      cusip  ...  invCountry  isReStrictedSec fairValLevel securityLending
# 0  Salient Private Access Master Fund,l.P.  NaN  Salient Private Access Master Fund,l.P.  999999999  ...          US                Y          NaN             NaN

# [1 rows x 18 columns]
@H_489_9@

幸运的是,read_xml 支持带有默认 lxml 解析器的 XSLT(旨在转换 XML 文档的专用语言)。使用 XSLT,您可以将迁移所需的节点展平以检索 32 列。

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:Strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates SELEct="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates SELEct="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml"
df = pd.read_xml(url,namespaces={"edgar": "http://www.sec.gov/edgar/nport"},stylesheet=xsl)
print(df)
#                                                   name  lei                                              title      cusip  ...  AnnualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
# 1    Regatta XV Funding Ltd.,PR...  75888PAC7  ...        0.0624          N                     N           N
# 2                Hired,Series C Preferred Stock        NaN  ...           NaN       None                  None        None
# 3                      WestVIEW CAPITAL PARTNERS II LP  NaN                    WestVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
# 4                       VOYAGER CAPITAL FUND III,l.P.  999999999  ...           NaN       None                  None        None
..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V,l.P.  999999999  ...           NaN       None                  None        None
# 159                       ALLOY MERCHANT PARTNERS l.P.  NaN                       ALLOY MERCHANT PARTNERS l.P.  999999999  ...           NaN       None                  None        None
# 160  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
# 162  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None

# [163 rows x 32 columns]
@H_489_9@

熊猫

要通过 XPath 方法实现相同的结果需要更多步骤,您必须在这些步骤中处理 URL 请求和 XML 解析以构建数据框。具体来说,从转换、解析的 XML 创建一个字典列表,并传递到 DataFrame 构造函数。下面使用与上面相同的 XSLT 和 XPath 命名空间。

import lxml.etree as lx
import pandas as pd
import urllib.request as rq

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/priMary_doc.xml"

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:Strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates SELEct="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates SELEct="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

content = rq.urlopen(url)

# LOAD XML AND XSL
doc = lx.fromString(content.read())
style = lx.fromString(xsl)

# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(stylE)
result = transformer(doC)

# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
        } for inv in result.xpath("//edgar:invstOrSec",namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]

# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)

print(df)
#                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
# 1    Regatta XV Funding Ltd.,PR...  N/A  Regatta XV Funding Ltd.,PR...  ...         N                    N           N
# 2                Hired,Series C Preferred Stock  N/A              Hired,Series C Preferred Stock  ...       NaN                  NaN         NaN
# 3                      WestVIEW CAPITAL PARTNERS II LP  N/A                    WestVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
# 4                       VOYAGER CAPITAL FUND III,l.P.  N/A                     VOYAGER CAPITAL FUND III,l.P.  ...       NaN                  NaN         NaN
# ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V,l.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V,l.P.  ...       NaN                  NaN         NaN
# 159                       ALLOY MERCHANT PARTNERS l.P.  N/A                       ALLOY MERCHANT PARTNERS l.P.  ...       NaN                  NaN         NaN
# 160  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATin AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
# 162  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATin AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN

# [163 rows x 32 columns]

@H_489_9@

大佬总结

以上是大佬教程为你收集整理的Pandas 读取 xml 无法正常用于单标签 xml全部内容,希望文章能够帮你解决Pandas 读取 xml 无法正常用于单标签 xml所遇到的程序开发问题。

如果觉得大佬教程网站内容还不错,欢迎将大佬教程推荐给程序员好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。
标签:xml读取