ferc_xbrl_extractor.datapackage

Define structures for creating a datapackage descriptor.

Module Contents

Classes

Field

A generic field descriptor, as per Frictionless Data specs.

Schema

A generic table schema, as per Frictionless Data specs.

Dialect

Dialect used for frictionless SQL resources.

Resource

A generic tabular data resource, as per Frictionless Data specs.

FactTable

Class to handle constructing a dataframe from an XBRL fact table.

Datapackage

A generic Data Package, as per Frictionless Data specs.

Functions

_get_fields_from_concepts(→ tuple[list[Field], ...)

Traverse concept tree to get columns and axes that will be used in output table.

_lowercase_words(→ str)

Convert fully uppercase words so only first letter is uppercase.

clean_table_names(→ str | None)

Function to clean table names.

fuzzy_dedup(→ pandas.DataFrame)

Deduplicate a 1-column dataframe with numbers that are close in value.

Attributes

ENTITY_ID

Field representing an entity ID (Present in all tables).

FILING_NAME

Field representing the filing name (Present in all tables).

PUBLICATION_TIME

Field representing the publication time (injected into all tables).

START_DATE

Field representing start date (Present in all duration tables).

END_DATE

Field representing end date (Present in all duration tables).

INSTANT_DATE

Field representing an instant date (Present in all instant tables).

DURATION_COLUMNS

Fields common to all duration tables.

INSTANT_COLUMNS

Fields common to all instant tables.

FIELD_TO_PANDAS

Pandas data type by schema field type (Data Package field.type).

CONVERT_DTYPES

Map callables to schema field type to convert parsed values (Data Package field.type).

TABLE_NAME_PATTERN

Simple regex pattern used to clean up table names.

UPPERCASE_WORD_PATTERN

Regex pattern to find fully uppercase words.

class ferc_xbrl_extractor.datapackage.Field(/, **data: Any)[source]

Bases: pydantic.BaseModel

A generic field descriptor, as per Frictionless Data specs.

See https://specs.frictionlessdata.io/table-schema/#field-descriptors.

name: str[source]
title: str[source]
type_: str[source]
format_: str[source]
description: str[source]
classmethod from_concept(concept: ferc_xbrl_extractor.taxonomy.Concept) Field[source]

Construct a Field from an XBRL Concept.

Parameters:

concept – XBRL Concept used to create a Field.

__hash__()[source]

Implement hash method to allow creating sets of Fields.

ferc_xbrl_extractor.datapackage.ENTITY_ID[source]

Field representing an entity ID (Present in all tables).

ferc_xbrl_extractor.datapackage.FILING_NAME[source]

Field representing the filing name (Present in all tables).

ferc_xbrl_extractor.datapackage.PUBLICATION_TIME[source]

Field representing the publication time (injected into all tables).

ferc_xbrl_extractor.datapackage.START_DATE[source]

Field representing start date (Present in all duration tables).

ferc_xbrl_extractor.datapackage.END_DATE[source]

Field representing end date (Present in all duration tables).

ferc_xbrl_extractor.datapackage.INSTANT_DATE[source]

Field representing an instant date (Present in all instant tables).

ferc_xbrl_extractor.datapackage.DURATION_COLUMNS[source]

Fields common to all duration tables.

ferc_xbrl_extractor.datapackage.INSTANT_COLUMNS[source]

Fields common to all instant tables.

ferc_xbrl_extractor.datapackage.FIELD_TO_PANDAS: dict[str, str][source]

Pandas data type by schema field type (Data Package field.type).

ferc_xbrl_extractor.datapackage.CONVERT_DTYPES: dict[str, collections.abc.Callable][source]

Map callables to schema field type to convert parsed values (Data Package field.type).

ferc_xbrl_extractor.datapackage.TABLE_NAME_PATTERN[source]

Simple regex pattern used to clean up table names.

ferc_xbrl_extractor.datapackage.UPPERCASE_WORD_PATTERN[source]

Regex pattern to find fully uppercase words.

There are several tables in the FERC taxonomy that contain completely uppercase words, which make converting to snakecase difficult.

ferc_xbrl_extractor.datapackage._get_fields_from_concepts(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) tuple[list[Field], list[Field]][source]

Traverse concept tree to get columns and axes that will be used in output table.

A ‘fact table’ in XBRL arranges Concepts into a a tree where the leaf nodes are individual facts that will become columns in the output tables. Axes are used to identify context of each fact, and will become a part of the primary key in the output table.

Parameters:
  • concept – The root concept of the tree.

  • period_type – Period type of current table (only return columns with corresponding period type).

Returns:

Axes in table (become part of primary key). columns: List of fields in table.

Return type:

axes

ferc_xbrl_extractor.datapackage._lowercase_words(name: str) str[source]

Convert fully uppercase words so only first letter is uppercase.

Pattern finds uppercase characters that are immediately preceded by an uppercase character. Later when the name is converted to snakecase, an underscore would be inserted between each of these charaters if this conversion is not performed.

ferc_xbrl_extractor.datapackage.clean_table_names(name: str) str | None[source]

Function to clean table names.

Parameters:

name – Unprocessed table name.

Returns:

Cleaned table name or None if table name doesn’t match expected

pattern.

Return type:

table_name

class ferc_xbrl_extractor.datapackage.Schema(/, **data: Any)[source]

Bases: pydantic.BaseModel

A generic table schema, as per Frictionless Data specs.

See https://specs.frictionlessdata.io/table-schema/.

fields: list[Field][source]
primary_key: list[str][source]
classmethod from_concept_tree(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) Schema[source]

Deduce schema from concept tree.

Traverse Concept tree to get columns that should comprise output table. Concepts with names ending in ‘Axis’ will become a part of the composite primary key for each table. Tables with a duration period type will also have the columns ‘entity_id’, ‘filing_name’, ‘start_date’, and ‘end_date’ in their primary key, while tables with ‘instant’ period type will include ‘entity_id’, ‘filing_name’, and ‘date’. The remaining columns will come from leaf nodes in the concept graph.

Parameters:
  • concept – Root concept of concept tree.

  • period_type – Period type of table.

class ferc_xbrl_extractor.datapackage.Dialect(/, **data: Any)[source]

Bases: pydantic.BaseModel

Dialect used for frictionless SQL resources.

table: str[source]
class ferc_xbrl_extractor.datapackage.Resource(/, **data: Any)[source]

Bases: pydantic.BaseModel

A generic tabular data resource, as per Frictionless Data specs.

See https://specs.frictionlessdata.io/data-resource.

path: str[source]
profile: str = 'tabular-data-resource'[source]
name: str[source]
dialect: Dialect[source]
title: str[source]
description: str[source]
format_: str[source]
mediatype: str = 'application/vnd.sqlite3'[source]
schema_: Schema[source]

Generate a Resource from a fact table (defined by a LinkRole).

Parameters:
  • fact_table – Link role which defines a fact table.

  • period_type – Period type of table.

  • db_uri – Path to database required for a Frictionless resource.

get_period_type()[source]

Helper function to get period type from schema.

class ferc_xbrl_extractor.datapackage.FactTable(schema: Schema, period_type: str)[source]

Class to handle constructing a dataframe from an XBRL fact table.

Structure of the dataframe is defined by the XBRL taxonomy. Facts and contexts parsed from an individual XBRL filing are then used to populate the dataframe with relevant data.

construct_dataframe(instance: ferc_xbrl_extractor.instance.Instance) pandas.DataFrame[source]

Construct dataframe from a parsed XBRL instance.

Parameters:

instance – Parsed XBRL instance used to construct dataframe.

class ferc_xbrl_extractor.datapackage.Datapackage(/, **data: Any)[source]

Bases: pydantic.BaseModel

A generic Data Package, as per Frictionless Data specs.

See https://specs.frictionlessdata.io/data-package.

profile: str = 'tabular-data-package'[source]
name: str[source]
title: str = 'Ferc1 data extracted from XBRL filings'[source]
resources: list[Resource][source]
classmethod from_taxonomy(taxonomy: ferc_xbrl_extractor.taxonomy.Taxonomy, db_uri: str, form_number: int = 1) Datapackage[source]

Construct a Datapackage from an XBRL Taxonomy.

Parameters:
  • taxonomy – XBRL taxonomy which defines the structure of the database.

  • db_uri – Path to database required for a Frictionless resource.

  • form_number – FERC form number used for datapackage name.

get_fact_tables(filter_tables: set[str] | None = None) dict[str, FactTable][source]

Use schema’s defined in datapackage resources to construct FactTables.

Parameters:

filter_tables – Optionally specify the set of tables to extract. If None, all possible tables will be extracted.

ferc_xbrl_extractor.datapackage.fuzzy_dedup(df: pandas.DataFrame) pandas.DataFrame[source]

Deduplicate a 1-column dataframe with numbers that are close in value.

We pick the number with the highest precision, up to a max precision of 6 digits after the decimal point.

If we get passed duplicated str values, or non-numeric values at all, we raise a ValueError - though we can add more code here to handle specific cases if we need to.

Parameters:

df – the dataframe to be deduplicated.