extended_data.io.files

File Data Type Utilities.

Module Contents

Classes

DataFile

Decoded file or URL data with source metadata and export helpers.

Functions

_resolve_data_file_encoding

Return the normalized encoding used by a DataFile artifact.

_safe_data_file_source

Return a source label safe for metadata and workflow steps.

_data_file_metadata

Return promoted artifact metadata for workflow and downstream handoff.

_github_auth_header_env

Return Git environment config for GitHub token auth without URL credentials.

get_parent_repository

Retrieves the Git repository object for a given path.

get_repository_name

Retrieves the name of the Git repository.

clone_repository_to_temp

Clones a Git repository to a temporary directory for file operations.

get_tld

Retrieves the top-level directory of a Git repository.

match_file_extensions

Matches the file extension of a given path against allowed or denied extensions.

get_encoding_for_file_path

Determines the encoding type based on the file extension.

file_path_depth

Calculates the depth of a given file path (the number of directories in the path).

file_path_rel_to_root

Constructs a relative path to the root directory from the given file path.

resolve_local_path

Resolves a file path relative to a top-level directory.

is_url

Check if a string is a valid and safe URL.

read_file

Reads a file from a local path or URL.

decode_file

Decodes file data based on file extension or explicit suffix.

read_data_file

Read and decode a local file or URL through the Tier 3 data boundary.

write_file

Writes data to a file with automatic format encoding.

delete_file

Deletes a file at the given path.

Data

FilePath

Type alias for file paths that can be represented as strings or os.PathLike objects.

API

extended_data.io.files.FilePath: TypeAlias = None

Type alias for file paths that can be represented as strings or os.PathLike objects.

class extended_data.io.files.DataFile

Decoded file or URL data with source metadata and export helpers.

source: extended_data.containers.ExtendedString = None
data: Any = None
encoding: extended_data.containers.ExtendedString = None
path: pathlib.Path | None = None
metadata: extended_data.containers.ExtendedDict = 'field(...)'
classmethod decode(file_data: str | memoryview | bytes | bytearray, *, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None, as_extended: bool = True, metadata: collections.abc.Mapping[str, Any] | None = None) extended_data.io.files.DataFile

Decode in-memory data into a first-class data file artifact.

classmethod read(file_path: extended_data.io.files.FilePath, *, suffix: str | None = None, as_extended: bool = True, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) extended_data.io.files.DataFile

Read and decode a local file or URL into a first-class data artifact.

as_builtin() Any

Return the artifact data lowered to built-in Python values.

as_extended() Any

Return a detached copy of artifact data promoted to Extended Data containers.

to_export_safe(*, export_to_yaml: bool = False) Any

Return the artifact data converted to export-safe primitive values.

wrap_for_export(allow_encoding: bool | str = True, **format_opts: Any) str

Return the artifact data wrapped as an encoded export string.

workflow(*, as_extended: bool = True) extended_data.workflows.DataWorkflow

Start a DataWorkflow from this artifact’s decoded data.

write(file_path: extended_data.io.files.FilePath | None = None, *, encoding: str | None = None, charset: str = 'utf-8', allow_empty: bool = False, tld: pathlib.Path | None = None) extended_data.io.files.DataFile

Write artifact data and return a new artifact for the output path.

extended_data.io.files._resolve_data_file_encoding(*, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None) str

Return the normalized encoding used by a DataFile artifact.

extended_data.io.files._safe_data_file_source(source: str) str

Return a source label safe for metadata and workflow steps.

extended_data.io.files._data_file_metadata(*, source: str, encoding: str, path: pathlib.Path | None, data: Any, extra: collections.abc.Mapping[str, Any] | None = None) extended_data.containers.ExtendedDict

Return promoted artifact metadata for workflow and downstream handoff.

extended_data.io.files._github_auth_header_env(github_token: str) dict[str, str]

Return Git environment config for GitHub token auth without URL credentials.

extended_data.io.files.get_parent_repository(file_path: extended_data.io.files.FilePath | None = None, search_parent_directories: bool = True) git.Repo | None

Retrieves the Git repository object for a given path.

Args: file_path (FilePath | None): The path to a file or directory within the repository. If None, defaults to the current working directory. search_parent_directories (bool): Whether to search parent directories for the Git repository. Defaults to True.

Returns: Repo | None: The Git repository object if found, otherwise None if the path is not a Git repository.

extended_data.io.files.get_repository_name(repo: git.Repo) str | None

Retrieves the name of the Git repository.

Args: repo (Repo): The Git repository object.

Returns: str | None: The name of the repository if found, otherwise None.

extended_data.io.files.clone_repository_to_temp(repo_owner: str, repo_name: str, github_token: str, branch: str | None = None) tuple[pathlib.Path, git.Repo]

Clones a Git repository to a temporary directory for file operations.

Args: repo_owner (str): The owner of the GitHub repository. repo_name (str): The name of the GitHub repository to clone. github_token (str): The GitHub token to access the repository. branch (str | None): The branch to clone. If None, the default branch is cloned.

Returns: tuple[Path, Repo]: The path to the cloned repository’s top-level directory and the Repo object.

Raises: EnvironmentError: If errors occur while trying to clone a Git repository.

extended_data.io.files.get_tld(file_path: extended_data.io.files.FilePath | None = None, search_parent_directories: bool = True) pathlib.Path | None

Retrieves the top-level directory of a Git repository.

Args: file_path (FilePath | None): The path to a file or directory within the repository. If None, defaults to the current working directory. search_parent_directories (bool): Whether to search parent directories for the Git repository. Defaults to True.

Returns: Path | None: The resolved top-level directory of the Git repository if found, otherwise None if the path is not a Git repository.

extended_data.io.files.match_file_extensions(p: extended_data.io.files.FilePath, allowed_extensions: list[str] | None = None, denied_extensions: list[str] | None = None) bool

Matches the file extension of a given path against allowed or denied extensions.

Args: p (FilePath): The path of the file to check. allowed_extensions (list[str] | None): List of allowed file extensions (without leading dot). denied_extensions (list[str] | None): List of denied file extensions (without leading dot).

Returns: bool: True if the file’s extension is allowed and not denied, otherwise False.

extended_data.io.files.get_encoding_for_file_path(file_path: extended_data.io.files.FilePath) str

Determines the encoding type based on the file extension.

Args: file_path (FilePath): The path of the file to check.

Returns: str: The encoding type as a string (e.g., “yaml”, “json”, “hcl”, “toml”, or “raw”).

extended_data.io.files.file_path_depth(file_path: extended_data.io.files.FilePath) int

Calculates the depth of a given file path (the number of directories in the path).

Args: file_path (FilePath): The file path to calculate depth for.

Returns: int: The depth of the file path, excluding the root.

extended_data.io.files.file_path_rel_to_root(file_path: extended_data.io.files.FilePath) str

Constructs a relative path to the root directory from the given file path.

Args: file_path (FilePath): The file path for which to construct the relative path.

Returns: str: A string representing the relative path to the root.

extended_data.io.files.resolve_local_path(file_path: extended_data.io.files.FilePath, tld: pathlib.Path | None = None) pathlib.Path

Resolves a file path relative to a top-level directory.

If the path is absolute, it is returned as-is (resolved). If the path is relative and a tld is provided, it is resolved relative to tld. If the path is relative and no tld is provided, attempts to find the Git repository root.

Args: file_path (FilePath): The path to resolve. tld (Path | None): Optional top-level directory for relative paths. If None, attempts to use the Git repository root.

Returns: Path: The resolved absolute path.

Raises: RuntimeError: If the path is relative and no tld is available.

extended_data.io.files.is_url(path: str) bool

Check if a string is a valid and safe URL.

Uses the validators library for robust URL validation, restricted to HTTP/HTTPS schemes only.

Args: path (str): The string to check.

Returns: bool: True if the string is a valid HTTP/HTTPS URL.

extended_data.io.files.read_file(file_path: extended_data.io.files.FilePath, decode: bool = True, return_path: bool = False, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) str | bytes | pathlib.Path | None

Reads a file from a local path or URL.

Args: file_path (FilePath): The path or URL to read from. decode (bool): Whether to decode bytes to string. Defaults to True. return_path (bool): If True, returns the resolved Path object instead of contents. charset (str): Character encoding for decoding. Defaults to “utf-8”. errors (str): Error handling for decoding. Defaults to “strict”. headers (Mapping[str, str] | None): HTTP headers for URL requests. tld (Path | None): Top-level directory for resolving relative paths.

Returns: str | bytes | Path | None: The file contents (str if decoded, bytes otherwise), the Path object if return_path=True, or None if the file doesn’t exist.

Raises: urllib.error.URLError: If the URL cannot be accessed. ValueError: If the URL scheme is not allowed (only http/https permitted).

extended_data.io.files.decode_file(file_data: str | memoryview | bytes | bytearray, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None, *, as_extended: bool = True) Any

Decodes file data based on file extension or explicit suffix.

Supports YAML, JSON, TOML, and HCL2 formats.

Args: file_data (str | memoryview | bytes | bytearray): The file contents to decode. This function does not read paths. file_path (FilePath | None): Optional file path to infer format from extension. suffix (str | None): Explicit format suffix (e.g., “yaml”, “json”, “toml”, “hcl”). Takes precedence over file_path extension. as_extended (bool): Wrap decoded values in Tier 2 Extended Data containers.

Returns: Any: The decoded data structure, or the original string if format is unknown.

extended_data.io.files.read_data_file(file_path: extended_data.io.files.FilePath, *, suffix: str | None = None, as_extended: bool = True, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) Any

Read and decode a local file or URL through the Tier 3 data boundary.

This composes read_file and decode_file for the common data-file workflow. Structured files are decoded from their suffix and promoted to Tier 2 containers by default. Missing local files fail loudly.

extended_data.io.files.write_file(file_path: extended_data.io.files.FilePath, data: Any, encoding: str | None = None, charset: str = 'utf-8', allow_empty: bool = False, tld: pathlib.Path | None = None) pathlib.Path | None

Writes data to a file with automatic format encoding.

Args: file_path (FilePath): The path to write to. data (Any): The data to write. Will be encoded based on file extension or encoding param. encoding (str | None): Explicit encoding format (“yaml”, “json”, “toml”, “hcl”, “raw”). If None, inferred from file extension. charset (str): Character encoding for the file. Defaults to “utf-8”. allow_empty (bool): Whether to allow writing empty data. Defaults to False. tld (Path | None): Top-level directory for resolving relative paths.

Returns: Path | None: The path that was written to, or None if data was empty and not allowed.

extended_data.io.files.delete_file(file_path: extended_data.io.files.FilePath, tld: pathlib.Path | None = None, missing_ok: bool = True) bool

Deletes a file at the given path.

Args: file_path (FilePath): The path to the file to delete. tld (Path | None): Top-level directory for resolving relative paths. missing_ok (bool): If True, return False when file doesn’t exist. If False, raise FileNotFoundError when file doesn’t exist. Defaults to True.

Returns: bool: True if the file was deleted, False if it didn’t exist (only when missing_ok=True).

Raises: FileNotFoundError: If the file doesn’t exist and missing_ok=False.