ferc_xbrl_extractor.datapackage =============================== .. py:module:: ferc_xbrl_extractor.datapackage .. autoapi-nested-parse:: Define structures for creating a datapackage descriptor. Attributes ---------- .. autoapisummary:: ferc_xbrl_extractor.datapackage.logger ferc_xbrl_extractor.datapackage.ENTITY_ID ferc_xbrl_extractor.datapackage.FILING_NAME ferc_xbrl_extractor.datapackage.PUBLICATION_TIME ferc_xbrl_extractor.datapackage.START_DATE ferc_xbrl_extractor.datapackage.END_DATE ferc_xbrl_extractor.datapackage.INSTANT_DATE ferc_xbrl_extractor.datapackage.DURATION_COLUMNS ferc_xbrl_extractor.datapackage.INSTANT_COLUMNS ferc_xbrl_extractor.datapackage.FIELD_TO_PANDAS ferc_xbrl_extractor.datapackage.CONVERT_DTYPES ferc_xbrl_extractor.datapackage.TABLE_NAME_PATTERN ferc_xbrl_extractor.datapackage.UPPERCASE_WORD_PATTERN Classes ------- .. autoapisummary:: ferc_xbrl_extractor.datapackage.Field ferc_xbrl_extractor.datapackage.Schema ferc_xbrl_extractor.datapackage.Dialect ferc_xbrl_extractor.datapackage.Resource ferc_xbrl_extractor.datapackage.FactTable ferc_xbrl_extractor.datapackage.Datapackage Functions --------- .. autoapisummary:: ferc_xbrl_extractor.datapackage._get_fields_from_concepts ferc_xbrl_extractor.datapackage._lowercase_words ferc_xbrl_extractor.datapackage.clean_table_names ferc_xbrl_extractor.datapackage.fuzzy_dedup Module Contents --------------- .. py:data:: logger .. py:class:: Field(/, **data: Any) Bases: :py:obj:`pydantic.BaseModel` A generic field descriptor, as per Frictionless Data specs. See https://specs.frictionlessdata.io/table-schema/#field-descriptors. .. py:attribute:: name :type: str .. py:attribute:: title :type: str .. py:attribute:: type_ :type: str :value: None .. py:attribute:: format_ :type: str :value: None .. py:attribute:: description :type: str .. py:method:: from_concept(concept: ferc_xbrl_extractor.taxonomy.Concept) -> Field :classmethod: Construct a Field from an XBRL Concept. :param concept: XBRL Concept used to create a Field. .. py:method:: __hash__() Implement hash method to allow creating sets of Fields. .. py:data:: ENTITY_ID Field representing an entity ID (Present in all tables). .. py:data:: FILING_NAME Field representing the filing name (Present in all tables). .. py:data:: PUBLICATION_TIME Field representing the publication time (injected into all tables). .. py:data:: START_DATE Field representing start date (Present in all duration tables). .. py:data:: END_DATE Field representing end date (Present in all duration tables). .. py:data:: INSTANT_DATE Field representing an instant date (Present in all instant tables). .. py:data:: DURATION_COLUMNS Fields common to all duration tables. .. py:data:: INSTANT_COLUMNS Fields common to all instant tables. .. py:data:: FIELD_TO_PANDAS :type: dict[str, str] Pandas data type by schema field type (Data Package `field.type`). .. py:data:: CONVERT_DTYPES :type: dict[str, collections.abc.Callable] Map callables to schema field type to convert parsed values (Data Package `field.type`). .. py:data:: TABLE_NAME_PATTERN Simple regex pattern used to clean up table names. .. py:data:: UPPERCASE_WORD_PATTERN Regex pattern to find fully uppercase words. There are several tables in the FERC taxonomy that contain completely uppercase words, which make converting to snakecase difficult. .. py:function:: _get_fields_from_concepts(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) -> tuple[list[Field], list[Field]] Traverse concept tree to get columns and axes that will be used in output table. A 'fact table' in XBRL arranges Concepts into a a tree where the leaf nodes are individual facts that will become columns in the output tables. Axes are used to identify context of each fact, and will become a part of the primary key in the output table. :param concept: The root concept of the tree. :param period_type: Period type of current table (only return columns with corresponding period type). :returns: Axes in table (become part of primary key). columns: List of fields in table. :rtype: axes .. py:function:: _lowercase_words(name: str) -> str Convert fully uppercase words so only first letter is uppercase. Pattern finds uppercase characters that are immediately preceded by an uppercase character. Later when the name is converted to snakecase, an underscore would be inserted between each of these charaters if this conversion is not performed. .. py:function:: clean_table_names(name: str) -> str | None Function to clean table names. :param name: Unprocessed table name. :returns: Cleaned table name or None if table name doesn't match expected pattern. :rtype: table_name .. py:class:: Schema(/, **data: Any) Bases: :py:obj:`pydantic.BaseModel` A generic table schema, as per Frictionless Data specs. See https://specs.frictionlessdata.io/table-schema/. .. py:attribute:: fields :type: list[Field] .. py:attribute:: primary_key :type: list[str] .. py:method:: from_concept_tree(concept: ferc_xbrl_extractor.taxonomy.Concept, period_type: str) -> Schema :classmethod: Deduce schema from concept tree. Traverse Concept tree to get columns that should comprise output table. Concepts with names ending in 'Axis' will become a part of the composite primary key for each table. Tables with a duration period type will also have the columns 'entity_id', 'filing_name', 'start_date', and 'end_date' in their primary key, while tables with 'instant' period type will include 'entity_id', 'filing_name', and 'date'. The remaining columns will come from leaf nodes in the concept graph. :param concept: Root concept of concept tree. :param period_type: Period type of table. .. py:class:: Dialect(/, **data: Any) Bases: :py:obj:`pydantic.BaseModel` Dialect used for frictionless SQL resources. .. py:attribute:: table :type: str .. py:class:: Resource(/, **data: Any) Bases: :py:obj:`pydantic.BaseModel` A generic tabular data resource, as per Frictionless Data specs. See https://specs.frictionlessdata.io/data-resource. .. py:attribute:: path :type: str .. py:attribute:: profile :type: str :value: 'tabular-data-resource' .. py:attribute:: name :type: str .. py:attribute:: dialect :type: Dialect .. py:attribute:: title :type: str .. py:attribute:: description :type: str .. py:attribute:: format_ :type: str :value: None .. py:attribute:: mediatype :type: str :value: 'application/vnd.sqlite3' .. py:attribute:: schema_ :type: Schema :value: None .. py:method:: from_link_role(fact_table: ferc_xbrl_extractor.taxonomy.LinkRole, period_type: str, db_uri: str) -> Union[Resource, None] :classmethod: Generate a Resource from a fact table (defined by a LinkRole). If the fact table is empty, i.e. there are no data columns, return None. :param fact_table: Link role which defines a fact table. :param period_type: Period type of table. :param db_uri: Path to database required for a Frictionless resource. .. py:method:: get_period_type() Helper function to get period type from schema. .. py:method:: merge_resources(other: Resource, other_version: str) -> Resource Merge same resource from multiple taxonomies. This method attempts to merge resource definitions from multiple taxonomies creating a unified schema for the table in question. It does this by first comparing the primary keys of the two tables. If the primary keys aren't exactly the same it will raise an error. For the remaining columns, this method will check if there are any that are new or missing in ``other``. New columns will be added to the tables schema, and missing columns will be logged, but remain in the schema. .. py:class:: FactTable(schema: Schema, period_type: str) Class to handle constructing a dataframe from an XBRL fact table. Structure of the dataframe is defined by the XBRL taxonomy. Facts and contexts parsed from an individual XBRL filing are then used to populate the dataframe with relevant data. .. py:attribute:: schema .. py:attribute:: columns .. py:attribute:: axes .. py:attribute:: data_columns .. py:attribute:: instant .. py:method:: construct_dataframe(instance: ferc_xbrl_extractor.instance.Instance) -> pandas.DataFrame Construct dataframe from a parsed XBRL instance. :param instance: Parsed XBRL instance used to construct dataframe. .. py:class:: Datapackage(/, **data: Any) Bases: :py:obj:`pydantic.BaseModel` A generic Data Package, as per Frictionless Data specs. See https://specs.frictionlessdata.io/data-package. .. py:attribute:: profile :type: str :value: 'tabular-data-package' .. py:attribute:: name :type: str .. py:attribute:: title :type: str :value: 'Ferc1 data extracted from XBRL filings' .. py:attribute:: resources :type: list[Resource] .. py:method:: from_taxonomies(taxonomies: dict[str, ferc_xbrl_extractor.taxonomy.Taxonomy], db_uri: str, form_number: int = 1) -> Datapackage :classmethod: Construct a Datapackage from parsed XBRL taxonomies. FERC regularly releases new versions of their XBRL taxonomies, meaning data from different years conforms to slightly different structures. This method will attempt to merge these taxonomy versions into a single unified schema defined in a Datapackage descriptor. The exact logic for merging taxonomies is as follows. First, the oldest available taxonomy is used to construct a baseline datapackage descriptor. Next, it will parse subsequent versions and compare the set of tables found with the baseline. New tables will be added to the schema, removed tables will simply be logged but remain in the schema, and tables in both versions will do a deeper column level comparison. For more info on the table comparison, see ``Resource.merge_resources``. :param taxonomies: List of taxonomies to merge into a Datapackage. :param db_uri: Path to database required for a Frictionless resource. :param form_number: FERC form number used for datapackage name. .. py:method:: get_fact_tables(filter_tables: set[str] | None = None) -> dict[str, FactTable] Use schema's defined in datapackage resources to construct FactTables. :param filter_tables: Optionally specify the set of tables to extract. If None, all possible tables will be extracted. .. py:function:: fuzzy_dedup(df: pandas.DataFrame) -> pandas.DataFrame Deduplicate a 1-column dataframe with numbers that are close in value. We pick the number with the highest precision, up to a max precision of 6 digits after the decimal point. If we get passed duplicated str values, or non-numeric values at all, we raise a ValueError - though we can add more code here to handle specific cases if we need to. :param df: the dataframe to be deduplicated.