ferc_xbrl_extractor.datapackage
===============================

.. py:module:: ferc_xbrl_extractor.datapackage

.. autoapi-nested-parse::

   Define structures for creating a datapackage descriptor.


Attributes
----------

.. autoapisummary::

   ferc_xbrl_extractor.datapackage.logger
   ferc_xbrl_extractor.datapackage.ENTITY_ID
   ferc_xbrl_extractor.datapackage.FILING_NAME
   ferc_xbrl_extractor.datapackage.PUBLICATION_TIME
   ferc_xbrl_extractor.datapackage.START_DATE
   ferc_xbrl_extractor.datapackage.END_DATE
   ferc_xbrl_extractor.datapackage.INSTANT_DATE
   ferc_xbrl_extractor.datapackage.DURATION_COLUMNS
   ferc_xbrl_extractor.datapackage.INSTANT_COLUMNS
   ferc_xbrl_extractor.datapackage.FIELD_TO_PANDAS
   ferc_xbrl_extractor.datapackage.CONVERT_DTYPES
   ferc_xbrl_extractor.datapackage.TABLE_NAME_PATTERN
   ferc_xbrl_extractor.datapackage.UPPERCASE_WORD_PATTERN


Classes
-------

.. autoapisummary::

   ferc_xbrl_extractor.datapackage.Field
   ferc_xbrl_extractor.datapackage.Schema
   ferc_xbrl_extractor.datapackage.Dialect
   ferc_xbrl_extractor.datapackage.Resource
   ferc_xbrl_extractor.datapackage.FactTable
   ferc_xbrl_extractor.datapackage.Datapackage


Functions
---------

.. autoapisummary::

   ferc_xbrl_extractor.datapackage._get_fields_from_concepts
   ferc_xbrl_extractor.datapackage._lowercase_words
   ferc_xbrl_extractor.datapackage.clean_table_names
   ferc_xbrl_extractor.datapackage.fuzzy_dedup


Module Contents
---------------

.. py:data:: logger

.. py:class:: Field(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   A generic field descriptor, as per Frictionless Data specs.

   See https://specs.frictionlessdata.io/table-schema/#field-descriptors.


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: title
      :type:  str


   .. py:attribute:: type_
      :type:  str
      :value: None


   .. py:attribute:: format_
      :type:  str
      :value: None


   .. py:attribute:: description
      :type:  str


   .. py:method:: from_concept(concept: ferc_xbrl_extractor.taxonomy.Concept) -> Field
      :classmethod:


      Construct a Field from an XBRL Concept.

      :param concept: XBRL Concept used to create a Field.


   .. py:method:: __hash__()

      Implement hash method to allow creating sets of Fields.


.. py:data:: ENTITY_ID

   Field representing an entity ID (Present in all tables).

.. py:data:: FILING_NAME

   Field representing the filing name (Present in all tables).

.. py:data:: PUBLICATION_TIME

   Field representing the publication time (injected into all tables).

.. py:data:: START_DATE

   Field representing start date (Present in all duration tables).

.. py:data:: END_DATE

   Field representing end date (Present in all duration tables).

.. py:data:: INSTANT_DATE

   Field representing an instant date (Present in all instant tables).

.. py:data:: DURATION_COLUMNS

   Fields common to all duration tables.

.. py:data:: INSTANT_COLUMNS

   Fields common to all instant tables.

.. py:data:: FIELD_TO_PANDAS
   :type:  dict[str, str]

   Pandas data type by schema field type (Data Package `field.type`).

.. py:data:: CONVERT_DTYPES
   :type:  dict[str, collections.abc.Callable]

   Map callables to schema field type to convert parsed values (Data Package `field.type`).

.. py:data:: TABLE_NAME_PATTERN

   Simple regex pattern used to clean up table names.

.. py:data:: UPPERCASE_WORD_PATTERN

   Regex pattern to find fully uppercase words.

   There are several tables in the FERC taxonomy that contain completely uppercase words,
   which make converting to snakecase difficult.

.. py:function:: _get_fields_from_concepts(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) -> tuple[list[Field], list[Field]]

   Traverse concept tree to get columns and axes that will be used in output table.

   A 'fact table' in XBRL arranges Concepts into a a tree where the leaf nodes are
   individual facts that will become columns in the output tables. Axes are used to
   identify context of each fact, and will become a part of the primary key in the
   output table.

   :param concept: The root concept of the tree.
   :param period_type: Period type of current table (only return columns with corresponding
                       period type).

   :returns: Axes in table (become part of primary key).
             columns: List of fields in table.
   :rtype: axes


.. py:function:: _lowercase_words(name: str) -> str

   Convert fully uppercase words so only first letter is uppercase.

   Pattern finds uppercase characters that are immediately preceded by
   an uppercase character. Later when the name is converted to snakecase,
   an underscore would be inserted between each of these charaters if this
   conversion is not performed.


.. py:function:: clean_table_names(name: str) -> str | None

   Function to clean table names.

   :param name: Unprocessed table name.

   :returns:

             Cleaned table name or None if table name doesn't match expected
                         pattern.
   :rtype: table_name


.. py:class:: Schema(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   A generic table schema, as per Frictionless Data specs.

   See https://specs.frictionlessdata.io/table-schema/.


   .. py:attribute:: fields
      :type:  list[Field]


   .. py:attribute:: primary_key
      :type:  list[str]


   .. py:method:: from_concept_tree(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) -> Schema
      :classmethod:


      Deduce schema from concept tree.

      Traverse Concept tree to get columns that should comprise output table.
      Concepts with names ending in 'Axis' will become a part of the composite
      primary key for each table. Tables with a duration period type will also
      have the columns 'entity_id', 'filing_name', 'start_date', and 'end_date' in
      their primary key, while tables with 'instant' period type will include
      'entity_id', 'filing_name', and 'date'. The remaining columns will come from
      leaf nodes in the concept graph.

      :param concept: Root concept of concept tree.
      :param period_type: Period type of table.


.. py:class:: Dialect(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   Dialect used for frictionless SQL resources.


   .. py:attribute:: table
      :type:  str


.. py:class:: Resource(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   A generic tabular data resource, as per Frictionless Data specs.

   See https://specs.frictionlessdata.io/data-resource.


   .. py:attribute:: path
      :type:  str


   .. py:attribute:: profile
      :type:  str
      :value: 'tabular-data-resource'


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: dialect
      :type:  Dialect


   .. py:attribute:: title
      :type:  str


   .. py:attribute:: description
      :type:  str


   .. py:attribute:: format_
      :type:  str
      :value: None


   .. py:attribute:: mediatype
      :type:  str
      :value: 'application/vnd.sqlite3'


   .. py:attribute:: schema_
      :type:  Schema
      :value: None


   .. py:method:: from_link_role(fact_table: ferc_xbrl_extractor.taxonomy.LinkRole, period_type: str, db_uri: str) -> Union[Resource, None]
      :classmethod:


      Generate a Resource from a fact table (defined by a LinkRole).

      If the fact table is empty, i.e. there are no data columns, return None.

      :param fact_table: Link role which defines a fact table.
      :param period_type: Period type of table.
      :param db_uri: Path to database required for a Frictionless resource.


   .. py:method:: get_period_type()

      Helper function to get period type from schema.


   .. py:method:: merge_resources(other: Resource, other_version: str) -> Resource

      Merge same resource from multiple taxonomies.

      This method attempts to merge resource definitions from multiple taxonomies
      creating a unified schema for the table in question. It does this by first
      comparing the primary keys of the two tables. If the primary keys aren't
      exactly the same it will raise an error. For the remaining columns, this
      method will check if there are any that are new or missing in ``other``.
      New columns will be added to the tables schema, and missing columns will
      be logged, but remain in the schema.


.. py:class:: FactTable(schema: Schema, period_type: str)

   Class to handle constructing a dataframe from an XBRL fact table.

   Structure of the dataframe is defined by the XBRL taxonomy. Facts and contexts
   parsed from an individual XBRL filing are then used to populate the dataframe
   with relevant data.


   .. py:attribute:: schema


   .. py:attribute:: columns


   .. py:attribute:: axes


   .. py:attribute:: data_columns


   .. py:attribute:: instant


   .. py:method:: construct_dataframe(instance: ferc_xbrl_extractor.instance.Instance) -> pandas.DataFrame

      Construct dataframe from a parsed XBRL instance.

      :param instance: Parsed XBRL instance used to construct dataframe.


.. py:class:: Datapackage(/, **data: Any)

   Bases: :py:obj:`pydantic.BaseModel`


   A generic Data Package, as per Frictionless Data specs.

   See https://specs.frictionlessdata.io/data-package.


   .. py:attribute:: profile
      :type:  str
      :value: 'tabular-data-package'


   .. py:attribute:: name
      :type:  str


   .. py:attribute:: title
      :type:  str
      :value: 'Ferc1 data extracted from XBRL filings'


   .. py:attribute:: resources
      :type:  list[Resource]


   .. py:method:: from_taxonomies(taxonomies: dict[str, ferc_xbrl_extractor.taxonomy.Taxonomy], db_uri: str, form_number: int = 1) -> Datapackage
      :classmethod:


      Construct a Datapackage from parsed XBRL taxonomies.

      FERC regularly releases new versions of their XBRL taxonomies, meaning
      data from different years conforms to slightly different structures. This
      method will attempt to merge these taxonomy versions into a single unified
      schema defined in a Datapackage descriptor.

      The exact logic for merging taxonomies is as follows. First, the oldest
      available taxonomy is used to construct a baseline datapackage descriptor.
      Next, it will parse subsequent versions and compare the set of tables
      found with the baseline. New tables will be added to the schema, removed
      tables will simply be logged but remain in the schema, and tables in both
      versions will do a deeper column level comparison. For more info on the table
      comparison, see ``Resource.merge_resources``.

      :param taxonomies: List of taxonomies to merge into a Datapackage.
      :param db_uri: Path to database required for a Frictionless resource.
      :param form_number: FERC form number used for datapackage name.


   .. py:method:: get_fact_tables(filter_tables: set[str] | None = None) -> dict[str, FactTable]

      Use schema's defined in datapackage resources to construct FactTables.

      :param filter_tables: Optionally specify the set of tables to extract.
                            If None, all possible tables will be extracted.


.. py:function:: fuzzy_dedup(df: pandas.DataFrame) -> pandas.DataFrame

   Deduplicate a 1-column dataframe with numbers that are close in value.

   We pick the number with the highest precision, up to a max precision of 6
   digits after the decimal point.

   If we get passed duplicated str values, or non-numeric values at all, we
   raise a ValueError - though we can add more code here to handle specific
   cases if we need to.

   :param df: the dataframe to be deduplicated.