===============================================================================
Parsing
===============================================================================
Taxonomy Parsing
^^^^^^^^^^^^^^^^
`Arelle `__ is a popular open source library for working
with XBRL data, and it is used for parsing the taxonomy. Arelle can parse a taxonomy
from either a URL to the taxonomy entry point, or a local path to a zipfile containing
the entire taxonomy.
The extractor will first take a parsed :term:`Taxonomy` and construct a taxonomy
object defined in :mod:`ferc_xbrl_extractor.taxonomy`. This is done because the
data structures used by Arelle are not well documented, so they are immediately
translated into custom data structures. After creating these structures, the extractor
will generate a `frictionless tabular datapackage `__,
which contains a schema for the new SQLite DB, as well as useful metadata.
Each table in the SQLite DB is derived from a :term:`Link Role` in the Taxonomy.
This is done by traversing down the :term:`Concept` tree that is rooted at the Link
Role. It will find all Concepts which are leaf nodes of this tree. These
leaf nodes are Concepts in the FERC taxonomy that are expected to have
:term:`Facts ` reported against them, while Concepts higher in the tree
structure exist for defining relationships. The leaf nodes will be sorted by their
:term:`Period` type (either duration or instant), and a table schema will be
generated for each of these Period types. The following diagram demonstrates this
process.
.. figure:: _static/concept_tree.png
:width: 600
:name: concept-tree
Example Concept tree from FERC Form 1
The Link Role shown here, ``204 – Schedule Electric Plant in Service``, is turned into
two tables in the resulting database, ``electric_plant_in_service_204_instant`` and
``electric_plant_in_service_204_duration``. The Concepts in the box at the bottom
will become the columns of these tables, and they will be split based on their Period
type.
There will also be columns added to these tables that correspond to the
:term:`Context`. For duration tables there will always be a ``start_date``,
``end_date``, and ``entity_id`` column. Instant tables will only have a ``date``,
and ``entity_id`` columns. There will also be columns added for any :term:`Axes `
that are defined in the taxonomy. This set of columns created from the Context is
used as the primary key for the table. Using the example from above there are no Axes
defined for the Link Role, ``204 – Schedule Electric Plant in Service``, so the
primary keys for the two tables will only contain these date and entity columns.
Filing Parsing
^^^^^^^^^^^^^^
While Arelle is helpful for parsing taxonomies, it has proven too slow for parsing
large sets of :term:`Filings `. So, we've used the
`lxml `__ library to directly parse filings ourselves. This is not
too difficult as this individual XBRL instances are a fairly simple XML structure
that is easy to parse. These files contain a list of contexts, and a list of facts
that we read into custom data structures.
The first step to parsing a filing is to read all its facts, and save them in a
dictionary that is indexed by the Axes in it's Context. Next, the extractor will
loop through the tables created during the taxonomy parsing, and find which facts
should end up in each table. This is done by looking up the facts with a Axes that
match those in the tables primary key. Each fact that meets this condition, and whose
name matches one of the columns in the table will be added to the table. Rows are
then created by finding all facts with the same primary key.
Going back to the example above, because there are no Axes defined for the Link Role
the extractor would look for all Facts whose Context does not have any Axes. Next, it
would it filter that list of Facts to only those whose name matches one of the columns
expected for the two output tables. Finally, it would group Facts with identical
Contexts (i.e. they have the exact same dates and entity ID's) into rows, which are
added to the table.