extended_data.io.files¶
File Data Type Utilities.
Module Contents¶
Classes¶
Decoded file or URL data with source metadata and export helpers. |
Functions¶
Return the normalized encoding used by a DataFile artifact. |
|
Return a source label safe for metadata and workflow steps. |
|
Return promoted artifact metadata for workflow and downstream handoff. |
|
Return Git environment config for GitHub token auth without URL credentials. |
|
Retrieves the Git repository object for a given path. |
|
Retrieves the name of the Git repository. |
|
Clones a Git repository to a temporary directory for file operations. |
|
Retrieves the top-level directory of a Git repository. |
|
Matches the file extension of a given path against allowed or denied extensions. |
|
Determines the encoding type based on the file extension. |
|
Calculates the depth of a given file path (the number of directories in the path). |
|
Constructs a relative path to the root directory from the given file path. |
|
Resolves a file path relative to a top-level directory. |
|
Check if a string is a valid and safe URL. |
|
Reads a file from a local path or URL. |
|
Decodes file data based on file extension or explicit suffix. |
|
Read and decode a local file or URL through the Tier 3 data boundary. |
|
Writes data to a file with automatic format encoding. |
|
Deletes a file at the given path. |
Data¶
Type alias for file paths that can be represented as strings or os.PathLike objects. |
API¶
- extended_data.io.files.FilePath: TypeAlias = None¶
Type alias for file paths that can be represented as strings or os.PathLike objects.
- class extended_data.io.files.DataFile¶
Decoded file or URL data with source metadata and export helpers.
- source: extended_data.containers.ExtendedString = None¶
- encoding: extended_data.containers.ExtendedString = None¶
- path: pathlib.Path | None = None¶
- metadata: extended_data.containers.ExtendedDict = 'field(...)'¶
- classmethod decode(file_data: str | memoryview | bytes | bytearray, *, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None, as_extended: bool = True, metadata: collections.abc.Mapping[str, Any] | None = None) extended_data.io.files.DataFile¶
Decode in-memory data into a first-class data file artifact.
- classmethod read(file_path: extended_data.io.files.FilePath, *, suffix: str | None = None, as_extended: bool = True, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) extended_data.io.files.DataFile¶
Read and decode a local file or URL into a first-class data artifact.
- to_export_safe(*, export_to_yaml: bool = False) Any¶
Return the artifact data converted to export-safe primitive values.
- wrap_for_export(allow_encoding: bool | str = True, **format_opts: Any) str¶
Return the artifact data wrapped as an encoded export string.
- workflow(*, as_extended: bool = True) extended_data.workflows.DataWorkflow¶
Start a DataWorkflow from this artifact’s decoded data.
- write(file_path: extended_data.io.files.FilePath | None = None, *, encoding: str | None = None, charset: str = 'utf-8', allow_empty: bool = False, tld: pathlib.Path | None = None) extended_data.io.files.DataFile¶
Write artifact data and return a new artifact for the output path.
- extended_data.io.files._resolve_data_file_encoding(*, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None) str¶
Return the normalized encoding used by a DataFile artifact.
- extended_data.io.files._safe_data_file_source(source: str) str¶
Return a source label safe for metadata and workflow steps.
- extended_data.io.files._data_file_metadata(*, source: str, encoding: str, path: pathlib.Path | None, data: Any, extra: collections.abc.Mapping[str, Any] | None = None) extended_data.containers.ExtendedDict¶
Return promoted artifact metadata for workflow and downstream handoff.
- extended_data.io.files._github_auth_header_env(github_token: str) dict[str, str]¶
Return Git environment config for GitHub token auth without URL credentials.
- extended_data.io.files.get_parent_repository(file_path: extended_data.io.files.FilePath | None = None, search_parent_directories: bool = True) git.Repo | None¶
Retrieves the Git repository object for a given path.
Args: file_path (FilePath | None): The path to a file or directory within the repository. If None, defaults to the current working directory. search_parent_directories (bool): Whether to search parent directories for the Git repository. Defaults to True.
Returns: Repo | None: The Git repository object if found, otherwise None if the path is not a Git repository.
- extended_data.io.files.get_repository_name(repo: git.Repo) str | None¶
Retrieves the name of the Git repository.
Args: repo (Repo): The Git repository object.
Returns: str | None: The name of the repository if found, otherwise None.
- extended_data.io.files.clone_repository_to_temp(repo_owner: str, repo_name: str, github_token: str, branch: str | None = None) tuple[pathlib.Path, git.Repo]¶
Clones a Git repository to a temporary directory for file operations.
Args: repo_owner (str): The owner of the GitHub repository. repo_name (str): The name of the GitHub repository to clone. github_token (str): The GitHub token to access the repository. branch (str | None): The branch to clone. If None, the default branch is cloned.
Returns: tuple[Path, Repo]: The path to the cloned repository’s top-level directory and the Repo object.
Raises: EnvironmentError: If errors occur while trying to clone a Git repository.
- extended_data.io.files.get_tld(file_path: extended_data.io.files.FilePath | None = None, search_parent_directories: bool = True) pathlib.Path | None¶
Retrieves the top-level directory of a Git repository.
Args: file_path (FilePath | None): The path to a file or directory within the repository. If None, defaults to the current working directory. search_parent_directories (bool): Whether to search parent directories for the Git repository. Defaults to True.
Returns: Path | None: The resolved top-level directory of the Git repository if found, otherwise None if the path is not a Git repository.
- extended_data.io.files.match_file_extensions(p: extended_data.io.files.FilePath, allowed_extensions: list[str] | None = None, denied_extensions: list[str] | None = None) bool¶
Matches the file extension of a given path against allowed or denied extensions.
Args: p (FilePath): The path of the file to check. allowed_extensions (list[str] | None): List of allowed file extensions (without leading dot). denied_extensions (list[str] | None): List of denied file extensions (without leading dot).
Returns: bool: True if the file’s extension is allowed and not denied, otherwise False.
- extended_data.io.files.get_encoding_for_file_path(file_path: extended_data.io.files.FilePath) str¶
Determines the encoding type based on the file extension.
Args: file_path (FilePath): The path of the file to check.
Returns: str: The encoding type as a string (e.g., “yaml”, “json”, “hcl”, “toml”, or “raw”).
- extended_data.io.files.file_path_depth(file_path: extended_data.io.files.FilePath) int¶
Calculates the depth of a given file path (the number of directories in the path).
Args: file_path (FilePath): The file path to calculate depth for.
Returns: int: The depth of the file path, excluding the root.
- extended_data.io.files.file_path_rel_to_root(file_path: extended_data.io.files.FilePath) str¶
Constructs a relative path to the root directory from the given file path.
Args: file_path (FilePath): The file path for which to construct the relative path.
Returns: str: A string representing the relative path to the root.
- extended_data.io.files.resolve_local_path(file_path: extended_data.io.files.FilePath, tld: pathlib.Path | None = None) pathlib.Path¶
Resolves a file path relative to a top-level directory.
If the path is absolute, it is returned as-is (resolved). If the path is relative and a tld is provided, it is resolved relative to tld. If the path is relative and no tld is provided, attempts to find the Git repository root.
Args: file_path (FilePath): The path to resolve. tld (Path | None): Optional top-level directory for relative paths. If None, attempts to use the Git repository root.
Returns: Path: The resolved absolute path.
Raises: RuntimeError: If the path is relative and no tld is available.
- extended_data.io.files.is_url(path: str) bool¶
Check if a string is a valid and safe URL.
Uses the validators library for robust URL validation, restricted to HTTP/HTTPS schemes only.
Args: path (str): The string to check.
Returns: bool: True if the string is a valid HTTP/HTTPS URL.
- extended_data.io.files.read_file(file_path: extended_data.io.files.FilePath, decode: bool = True, return_path: bool = False, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) str | bytes | pathlib.Path | None¶
Reads a file from a local path or URL.
Args: file_path (FilePath): The path or URL to read from. decode (bool): Whether to decode bytes to string. Defaults to True. return_path (bool): If True, returns the resolved Path object instead of contents. charset (str): Character encoding for decoding. Defaults to “utf-8”. errors (str): Error handling for decoding. Defaults to “strict”. headers (Mapping[str, str] | None): HTTP headers for URL requests. tld (Path | None): Top-level directory for resolving relative paths.
Returns: str | bytes | Path | None: The file contents (str if decoded, bytes otherwise), the Path object if return_path=True, or None if the file doesn’t exist.
Raises: urllib.error.URLError: If the URL cannot be accessed. ValueError: If the URL scheme is not allowed (only http/https permitted).
- extended_data.io.files.decode_file(file_data: str | memoryview | bytes | bytearray, file_path: extended_data.io.files.FilePath | None = None, suffix: str | None = None, *, as_extended: bool = True) Any¶
Decodes file data based on file extension or explicit suffix.
Supports YAML, JSON, TOML, and HCL2 formats.
Args: file_data (str | memoryview | bytes | bytearray): The file contents to decode. This function does not read paths. file_path (FilePath | None): Optional file path to infer format from extension. suffix (str | None): Explicit format suffix (e.g., “yaml”, “json”, “toml”, “hcl”). Takes precedence over file_path extension. as_extended (bool): Wrap decoded values in Tier 2 Extended Data containers.
Returns: Any: The decoded data structure, or the original string if format is unknown.
- extended_data.io.files.read_data_file(file_path: extended_data.io.files.FilePath, *, suffix: str | None = None, as_extended: bool = True, charset: str = 'utf-8', errors: str = 'strict', headers: collections.abc.Mapping[str, str] | None = None, tld: pathlib.Path | None = None) Any¶
Read and decode a local file or URL through the Tier 3 data boundary.
This composes
read_fileanddecode_filefor the common data-file workflow. Structured files are decoded from their suffix and promoted to Tier 2 containers by default. Missing local files fail loudly.
- extended_data.io.files.write_file(file_path: extended_data.io.files.FilePath, data: Any, encoding: str | None = None, charset: str = 'utf-8', allow_empty: bool = False, tld: pathlib.Path | None = None) pathlib.Path | None¶
Writes data to a file with automatic format encoding.
Args: file_path (FilePath): The path to write to. data (Any): The data to write. Will be encoded based on file extension or encoding param. encoding (str | None): Explicit encoding format (“yaml”, “json”, “toml”, “hcl”, “raw”). If None, inferred from file extension. charset (str): Character encoding for the file. Defaults to “utf-8”. allow_empty (bool): Whether to allow writing empty data. Defaults to False. tld (Path | None): Top-level directory for resolving relative paths.
Returns: Path | None: The path that was written to, or None if data was empty and not allowed.
- extended_data.io.files.delete_file(file_path: extended_data.io.files.FilePath, tld: pathlib.Path | None = None, missing_ok: bool = True) bool¶
Deletes a file at the given path.
Args: file_path (FilePath): The path to the file to delete. tld (Path | None): Top-level directory for resolving relative paths. missing_ok (bool): If True, return False when file doesn’t exist. If False, raise FileNotFoundError when file doesn’t exist. Defaults to True.
Returns: bool: True if the file was deleted, False if it didn’t exist (only when missing_ok=True).
Raises: FileNotFoundError: If the file doesn’t exist and missing_ok=False.