finmlkit.bar.io module¶

class finmlkit.bar.io.AddTimeBarH5(h5_path: str, keys: list[str] = None)[source]¶

Bases: object

Utility class for building and persisting 1-second time bars from trades data stored in HDF5 format.

This class provides a streamlined workflow for converting raw trades data into structured time bars, extending HDF5 stores created by TradesData with standardized OHLCV (Open, High, Low, Close, Volume) bars at 1-second intervals. It serves as a preprocessing component for financial analysis pipelines that require consistent temporal aggregation of high-frequency trading data.

The class operates on HDF5 files with monthly trade partitions (/trades/YYYY-MM) and creates corresponding time bar partitions (/klines/YYYY-MM) with associated metadata (/klines_meta/YYYY-MM). This approach maintains the same organizational structure while adding derived datasets optimized for time-series analysis and modeling.

Workflow and Data Organization:

The class follows this processing pipeline:

Discovery: Identify available monthly trade partitions in the source HDF5 file
Loading: Use TradesData.load_trades_h5() to retrieve preprocessed trades for each month
Aggregation: Apply TimeBarKit to construct 1-second time bars with full OHLCV features
Storage: Persist bars to /klines/ hierarchy with metadata for fast access
Validation: Track processing success/failure for each monthly partition

The resulting time bars provide a consistent temporal grid suitable for:

Technical analysis and indicator computation
Machine learning feature engineering
Risk management and portfolio analytics
But mainly for accessing coarse sampling from high-frequency data (e.g., accesing daily statistics quickly)

Considerations:

Processing is performed sequentially by month to manage memory usage for large datasets
Each month’s bars are stored as separate HDF5 tables for efficient partial loading
Metadata storage enables fast discovery without loading full datasets
Overwrite protection prevents accidental data loss during reprocessing

Important

This class assumes the source HDF5 file follows the structure created by finmlkit.bar.data_model.TradesData. The time bars are built using 1-second intervals, which provides a good balance between temporal resolution and data reduction for most financial analysis applications.

Tip

For very active trading pairs, 1-second bars may still contain significant noise. Consider further aggregation (e.g., 1-minute bars) for certain analysis types or implement alternative bar types (tick, volume, or imbalance bars) using BarBuilderBase subclasses.

Note

This enables the quick construction and retention of simple aggregated OHLCV bars with specified frequency. If you want more intra-bar features (e.g., directional, size, or footprint features) you should use finmlkit.bar.kit.TimeBarKit directly to build bars from trades data.

Parameters:

h5_path (str) – Path to the HDF5 file containing trades data. Must be readable and writable.
keys (list[str], optional) – Specific monthly keys to process (e.g., [“2022-01”, “2022-05”]). If None, processes all available monthly partitions in the file.

Raises:

KeyError – If specified keys are not found in the source HDF5 file.
FileNotFoundError – If the HDF5 file does not exist.
PermissionError – If the file cannot be accessed for reading or writing.

Examples

Process all months in an HDF5 file:

>>> 
>>> processor = AddTimeBarH5('trades_2023.h5')
>>> results = processor.process_all(overwrite=False)
>>> success_count = sum(results.values())
>>> print(f"Successfully processed {success_count}/{len(results)} months")

Process specific months with overwrite:

>>> 
>>> processor = AddTimeBarH5('trades_2023.h5', keys=['2023-03', '2023-04'])
>>> for key in processor.keys:
...     success = processor.process_key(key, overwrite=True)
...     print(f"{key}: {'Success' if success else 'Failed'}")

Batch processing workflow:

>>> 
>>> import pandas as pd
>>> processor = AddTimeBarH5('large_dataset.h5')
>>> results = processor.process_all()
>>>
>>> # Check results and identify any failures
>>> failed_keys = [k for k, success in results.items() if not success]
>>> if failed_keys:
...     print(f"Failed to process: {failed_keys}")

See also

class:: TimeBarReader: For reading and analyzing the time bars (generated by this class) from HDF5 files.

finmlkit.bar.data_model.TradesData: Creates the source HDF5 files with trades data. finmlkit.bar.data_model.TradesData.save_h5(): Saves trades data to HDF5 format which is used by this class to add time bars to it. finmlkit.bar.kit.TimeBarKit: The underlying time bar construction engine. finmlkit.bar.base.BarBuilderBase: Base class for bar construction strategies.

References

__init__(h5_path: str, keys: list[str] = None)[source]¶

Parameters:

h5_path – Path to trades h5 file
keys – Optional list of keys for which to add TimeBars (eg. [“2022-01”, “2022-05”]). If none, build all available month.

_check_keys(keys: list[str])[source]¶

_list_keys() → list[str][source]¶

List all available keys in the HDF5 file.

Returns:: List of keys.

process_all(overwrite: bool = False) → Dict[str, bool][source]¶

Process all configured monthly partitions to build and save 1-second time bars.

Iterates through all keys (either specified during initialization or auto-discovered) and processes each month sequentially. Provides comprehensive logging and error handling to ensure robust batch processing of large datasets.

Parameters:: overwrite – Whether to overwrite existing time bar data for all partitions. Default: False.
Returns:: Dictionary mapping partition keys to processing success status (True/False). Keys are in format ‘/trades/YYYY-MM’ and values indicate whether processing completed successfully.

Note

Processing is performed sequentially to manage memory usage. For very large datasets, monitor system resources and consider processing subsets if memory constraints arise. Failed partitions can be reprocessed individually using process_key().

Examples

>>> 
>>> processor = AddTimeBarH5('annual_trades.h5')
>>> results = processor.process_all(overwrite=False)
>>>
>>> # Analyze results
>>> total_processed = len(results)
>>> successful = sum(results.values())
>>> print(f"Processed {successful}/{total_processed} months successfully")
>>>
>>> # Identify and retry failed months
>>> failed_months = [k for k, success in results.items() if not success]
>>> for month in failed_months:
...     processor.process_key(month, overwrite=True)  # Retry with overwrite

process_key(key: str, overwrite: bool = False) → bool[source]¶

Process a single monthly partition to build and save 1-second time bars.

Loads trades data for the specified month, constructs time bars using TimeBarKit, and persists the results to the HDF5 file under the /klines/ hierarchy.

Parameters:

key – The trades key to process (format: ‘/trades/YYYY-MM’ or ‘YYYY-MM’).
overwrite – Whether to overwrite existing time bar data for this partition. Default: False.

Returns:

True if processing completed successfully, False if skipped or failed.

Note

Processing time scales with the number of trades in the month. For very active trading pairs, expect several minutes per month on typical hardware. Memory usage peaks during bar construction but is released after each month completes.

Examples

>>> 
>>> processor = AddTimeBarH5('trades.h5')
>>> success = processor.process_key('/trades/2023-06', overwrite=True)
>>> if success:
...     print("Time bars created successfully")

class finmlkit.bar.io.H5Inspector(filepath: str)[source]¶

Bases: object

Utility class for inspecting and analyzing HDF5 files containing trades data with comprehensive metadata access.

This class provides a complete toolkit for examining HDF5 stores created by TradesData, enabling users to explore available data, assess data quality, retrieve statistics, and identify potential issues across monthly partitioned trade datasets. It serves as an essential diagnostic tool for large-scale financial data management.

The inspector is designed to work with the HDF5 structure created by TradesData.save_h5(), where trades data is organized into monthly partitions under /trades/YYYY-MM groups, with corresponding metadata stored under /meta/YYYY-MM and integrity information under /integrity/YYYY-MM.

Key capabilities include:

Data Discovery: List all available monthly partitions and their temporal coverage
Metadata Access: Retrieve comprehensive metadata including record counts, timestamp ranges, and integrity flags
Integrity Analysis: Access detailed information about trade ID discontinuities, missing data percentages, and temporal gaps that may indicate data quality issues
Statistical Overview: Compute basic statistics for price and volume distributions across time periods
Gap Detection: Identify temporal discontinuities exceeding specified thresholds using multiprocessing
Integrity Reporting: Generate comprehensive summaries of data quality issues across entire datasets

The class leverages the metadata structure to provide fast operations without loading full datasets into memory, making it suitable for inspecting multi-terabyte trade databases. For gap analysis and integrity checks on large datasets, multiprocessing is employed to parallelize operations across monthly partitions.

Data Integrity Metrics: The inspector can identify several types of data quality issues:

Trade ID Gaps: Missing sequential trade IDs indicating potential data loss
Temporal Discontinuities: Time gaps exceeding normal market hours or trading halts
Missing Data Percentage: Quantitative measure of data completeness based on trade ID sequences

Note

This class assumes HDF5 files follow the structure created by TradesData. For files created with different schemas, some methods may not function correctly or may raise KeyError exceptions.

Note

Gap detection with multiprocessing can be memory-intensive for very large datasets. Consider adjusting the processes parameter based on available system resources and dataset sizes.

Parameters:

filepath (str) – Path to the HDF5 file containing trades data. File must be readable and follow the expected monthly partition structure.

Raises:

FileNotFoundError – If the specified HDF5 file does not exist.
PermissionError – If the file cannot be accessed due to permission restrictions.

See also

finmlkit.bar.data_model.TradesData: Creates the HDF5 files that this class inspects.
finmlkit.bar.data_model.TradesData.save_h5(): Method that creates the HDF5 structure.
finmlkit.bar.data_model.TradesData.load_trades_h5(): Complementary loading functionality.

References

__init__(filepath: str)[source]¶

Initialize the H5Inspector with the path to the HDF5 file.

Parameters:: filepath – Path to the HDF5 file.

get_integrity_info(key: str) → DataFrame | None[source]¶

Retrieve detailed data integrity information for a specific monthly partition.

Returns discontinuity details stored during preprocessing, including trade ID gaps, timestamps of missing data periods, and time intervals for each discontinuity.

Parameters:

key – HDF5 key for the target month (e.g., ‘/trades/2023-01’).

Returns:

DataFrame with columns:

’start_id’: Trade ID before the gap
’end_id’: Trade ID after the gap
’missing_ids’: Number of missing trade IDs
’pre_gap_time_str’: Timestamp before gap (string format)
’post_gap_time_str’: Timestamp after gap (string format)
’time_interval_str’: Duration of the gap (string format)

Returns None if no integrity issues were detected.

Raises:

KeyError – If the specified key does not exist in the store.

get_integrity_summary(verbose=True) → Dict[str, Dict][source]¶

Generate comprehensive summary of data integrity issues across the entire HDF5 store.

Analyzes all monthly partitions to identify data quality problems, providing both aggregate statistics and detailed discontinuity information where available.

Parameters:: verbose – If True, prints detailed summary to console. Default: True.
Returns:: Dictionary with month keys mapping to integrity information dictionaries: Each value contains: - ‘metadata’: Complete metadata including integrity flags and missing percentages - ‘discontinuities’: DataFrame with detailed gap information (if available) - ‘key’: Original HDF5 key for the partition Returns None if all data passes integrity checks.

Note

This method scans metadata for all partitions but only loads detailed discontinuity information for months with identified issues, making it efficient for large stores.

get_metadata(key: str) → Dict[str, any][source]¶

Retrieve comprehensive metadata for a specific monthly partition.

Returns metadata stored during the save process, including record counts, timestamp ranges, data integrity flags, and missing data percentages.

Parameters:

key – HDF5 key for the target month (e.g., ‘/trades/2023-02’).

Returns:

Dictionary containing metadata fields:

’record_count’: Number of trades in the partition
’first_timestamp’: Earliest timestamp (nanoseconds since epoch)
’last_timestamp’: Latest timestamp (nanoseconds since epoch)
’data_integrity_ok’: Boolean flag indicating data quality
’missing_pct’: Percentage of missing trades based on ID gaps

Raises:

KeyError – If the specified key does not exist in the store.

get_statistics(key: str) → Dict[str, any][source]¶

Compute basic statistical measures for a specific monthly partition.

Loads the trade data and calculates summary statistics including record counts, timestamp ranges, and price/volume distributions.

Parameters:

key – HDF5 key for the target month.

Returns:

Dictionary containing statistical measures:

’record_count’: Total number of trade records
’first_timestamp’: Earliest timestamp in the dataset
’last_timestamp’: Latest timestamp in the dataset
’price_range’: Tuple of (minimum_price, maximum_price)
’amount_range’: Tuple of (minimum_amount, maximum_amount)

Raises:

KeyError – If the specified key does not exist in the store.

Note

This method loads the full dataset into memory and may be slow for large partitions. Consider using get_metadata() for basic counts and ranges when available.

inspect_gaps(max_gap: Timedelta = Timedelta('0 days 00:01:00'), processes: int = 4) → Dict[str, list[tuple[Timestamp, Timedelta]]][source]¶

Identify temporal gaps exceeding specified thresholds across all monthly partitions.

Uses multiprocessing to parallelize gap detection across partitions. Gaps are identified by computing time differences between consecutive trades and flagging those exceeding the max_gap threshold.

Parameters:

max_gap – Maximum allowable gap between consecutive timestamps. Default: 1 minute.
processes – Number of worker processes for parallel processing. Default: 4.

Returns:

Dictionary mapping HDF5 group names to lists of gap information tuples: Each tuple contains (gap_timestamp, gap_duration) for gaps exceeding the threshold.

Raises:

ValueError – If max_gap is not a valid Timedelta or processes < 1.

Note

Gap detection loads full datasets into memory. For very large files, consider processing monthly partitions individually or increasing available system memory.

list_keys() → list[str][source]¶

List all available trade keys in the HDF5 file.

Scans the HDF5 store for all groups under the /trades/ hierarchy, returning a sorted list of available monthly partitions.

Returns:

List of trade keys in format ['/trades/YYYY-MM', ...].

Raises:

FileNotFoundError – If the HDF5 file does not exist.
KeyError – If the file exists but has no readable trade groups.

class finmlkit.bar.io.TimeBarReader(h5_path: str)[source]¶

Bases: object

Reader class for time bar data stored in HDF5 format with advanced resampling capabilities.

This class provides a comprehensive interface for accessing and transforming time bar data created by AddTimeBarH5, enabling efficient querying, filtering, and resampling of high-frequency financial time series. It serves as the primary access layer for time bar analysis workflows, supporting both raw 1-second bars and dynamically resampled timeframes for various analytical purposes.

The reader is designed to work seamlessly with the HDF5 structure created by the time bar processing pipeline, where 1-second bars are stored under /klines/YYYY-MM groups with metadata under /klines_meta/YYYY-MM. This organization enables efficient time-range queries across large datasets without loading unnecessary data into memory.

Core Functionalities:

Time Range Filtering: Efficiently identify and load only the monthly partitions intersecting with requested time ranges, minimizing memory usage and I/O operations.
Flexible Resampling: Transform 1-second bars into arbitrary timeframes (e.g., 5min, 1h, 1d) with mathematically correct aggregation of OHLCV data and volume-weighted recalculation of derived metrics.
Metadata-Driven Discovery: Leverage stored metadata for fast range queries without scanning full datasets, enabling sub-second response times for time range validation.

Performance Optimizations:

The reader employs several strategies for efficient large-scale data access:

Lazy Loading: Only relevant monthly partitions are identified and loaded based on time range intersection
Vectorized Operations: Resampling uses pandas’ optimized groupby operations with pre-computed time groupers
Memory Management: Data is processed in monthly chunks and concatenated only when necessary
Index Optimization: Time filtering leverages datetime indexes for fast range selection

Important

TimeBarReader assumes data integrity and proper temporal ordering within each monthly partition. The input H5 file must be generated with finmlkit.bar.data_model.TradesData and AddTimeBarH5.

Note

Tip

Resampling to very large timeframes (e.g., monthly) from 1-second data can be memory-intensive. For such cases, consider intermediate aggregation steps or processing smaller time ranges iteratively.

Parameters:

h5_path (str) – Path to the HDF5 file containing time bar data. Must be readable and contain data structure created by AddTimeBarH5.

Raises:

FileNotFoundError – If the specified HDF5 file does not exist.
PermissionError – If the file cannot be accessed due to permission restrictions.
KeyError – If the file exists but lacks the expected klines structure.

Examples

Basic time bar reading with range filtering:

>>> 
>>> from finmlkit.bar.io import TimeBarReader
>>> reader = TimeBarReader('trades_2023.h5')
>>>
>>> # Get all 1-second bars for a specific day
>>> bars_1s = reader.read('2023-01-15', '2023-01-15')
>>> len(bars_1s)  
86400
>>> 
>>> # Get 5-minute bars for a week
>>> bars_5min = reader.read('2023-01-15', '2023-01-21', timeframe='5min')
>>> bars_5min.columns.tolist()  
['open', 'high', 'low', 'close', 'volume', 'trades', 'vwap', 'median_trade_size']

Advanced resampling workflows:

>>> 
>>> # Get hourly bars with proper VWAP calculation
>>> hourly = reader.read('2023-01-01', '2023-01-31', timeframe='1h')
>>> # Verify VWAP is volume-weighted across the resampled period
>>> print(f"First hourly bar VWAP: {hourly['vwap'].iloc[0]:.2f}")  
First hourly bar VWAP: 16750.25

Data discovery and range validation:

>>> 
>>> available_months = reader.list_keys()
>>> print(f"Available months: {len(available_months)}")  
Available months: 12
>>>
>>> # Check overall time coverage
>>> start, end = reader._list_time_range()  
>>> print(f"Data spans from {start.date()} to {end.date()}")  
Data spans from 2023-01-01 to 2023-12-31

See also

AddTimeBarH5: Creates the HDF5 time bar files that this reader accesses.
finmlkit.bar.data_model.TradesData: Underlying trades data structure for the bar construction process.
finmlkit.bar.kit.TimeBarKit: Time bar construction engine used by the processing pipeline.
H5Inspector: Complementary utility for HDF5 file inspection and data quality assessment.

__init__(h5_path: str)[source]¶

Initialize the TimeBarReader with the path to the H5 file.

Parameters:: h5_path – Path to the H5 file containing time bars

_find_relevant_keys(start_time: Timestamp | None = None, end_time: Timestamp | None = None) → List[str][source]¶

Identify HDF5 keys containing data that intersects with the specified time range.

Uses metadata-driven discovery to minimize I/O by identifying only the monthly partitions that contain relevant data, avoiding unnecessary loading of non-intersecting partitions.

Parameters:

start_time – Start boundary for range intersection test. None indicates no lower bound.
end_time – End boundary for range intersection test. None indicates no upper bound.

Returns:

Sorted list of klines keys that intersect with the specified time range.

Note

The intersection logic is inclusive on both ends, ensuring that monthly partitions containing any data within the range are included in the result set.

_list_time_range() → Tuple[Timestamp, Timestamp][source]¶

Determine the overall temporal coverage available in the HDF5 file by scanning metadata.

Efficiently discovers the time span of available data without loading full datasets, enabling quick validation of data availability for query planning.

Returns:: Tuple of (earliest_timestamp, latest_timestamp) across all monthly partitions.
Raises:: ValueError – If no klines metadata is found in the file.

Note

This method relies on metadata stored during the bar creation process and provides sub-second response times even for very large datasets.

_resample(df: DataFrame, timeframe: str) → DataFrame[source]¶

Apply mathematically correct resampling aggregation to transform 1-second bars into target timeframe.

This internal method implements the core resampling logic, ensuring that volume-weighted metrics are properly recalculated and that statistical properties are preserved across time aggregation.

Aggregation Rules Applied:

OHLC: Uses first/max/min/last semantics appropriate for price series
Volume & Trades: Simple summation across the resampling period
VWAP: Volume-weighted recalculation maintaining accuracy across aggregation
Median Trade Size: Volume-weighted median computation from per-second medians

Parameters:

df – DataFrame containing 1-second time bars with standard OHLCV columns.
timeframe – Pandas offset string specifying target timeframe (e.g., ‘5min’, ‘1h’, ‘1D’).

Returns:

Resampled DataFrame with aggregated bars at the requested timeframe. Periods with no trading activity are automatically excluded.

Note

The volume-weighted median calculation uses numpy’s searchsorted for efficient percentile computation, making it suitable for high-frequency resampling operations.

list_keys() → List[str][source]¶

List all available time bar keys in the HDF5 file.

Scans the HDF5 store for all klines groups, providing visibility into available monthly partitions for time range planning and data discovery.

Returns:

List of klines keys in format ['/klines/YYYY-MM', ...], sorted chronologically.

Raises:

FileNotFoundError – If the HDF5 file does not exist.
PermissionError – If the file cannot be accessed for reading.

Examples

>>> 
>>> reader = TimeBarReader('data.h5')
>>> keys = reader.list_keys()
>>> print(f"Found {len(keys)} monthly partitions")
Found 12 monthly partitions

Read time bars from HDF5 storage with optional time filtering and resampling.

This method provides the primary interface for accessing time bar data, supporting flexible time range specification and dynamic resampling to arbitrary timeframes. The implementation optimizes for both small targeted queries and large-scale data processing workflows.

Time Range Handling:

Inclusive Ranges: Both start_time and end_time are treated as inclusive boundaries
Date Normalization: Date strings without time components are expanded to full day ranges
Boundary Correction: End dates are automatically extended to include the entire final day

Resampling Process:

When a timeframe is specified, the method applies mathematically correct aggregation:

Groups 1-second bars by the requested timeframe using vectorized floor operations
Applies OHLCV aggregation rules (first, max, min, last, sum)
Recalculates volume-weighted metrics (VWAP, median trade size) preserving statistical properties
Filters out empty periods to maintain data density

Parameters:

start_time – Start time for filtering (inclusive). Accepts string, Timestamp, or datetime. If None, starts from earliest available data.
end_time – End time for filtering (inclusive). If provided as date-only string, automatically extends to end of day. If None, includes all data through latest available.
timeframe – Target resampling timeframe using pandas offset aliases (e.g., ‘5min’, ‘1h’, ‘1D’, ‘1W’). If None, returns original 1-second bars.

Returns:

DataFrame with datetime index and columns: open, high, low, close, volume, trades, vwap, median_trade_size. Empty DataFrame if no data found in specified range.

Raises:

ValueError – If start_time > end_time or timeframe format is invalid.
KeyError – If required data partitions are missing from the HDF5 file.

Note

For daily or longer timeframes, incomplete final periods (e.g., partial trading days) are automatically excluded to prevent misleading aggregations in analysis workflows.

Examples

Reading specific time ranges:

>>> reader = TimeBarReader('crypto_data.h5')  
>>>
>>> # Get all 1-second bars for Bitcoin on January 15, 2023
>>> btc_1s = reader.read('2023-01-15', '2023-01-15')  
>>> print(f"Retrieved {len(btc_1s):,} 1-second bars")  
Retrieved 86,400 1-second bars
>>>
>>> # Get 5-minute bars for the first week of January
>>> btc_5min = reader.read('2023-01-01', '2023-01-07', timeframe='5min')  
>>> print(f"5-min bars: {len(btc_5min)}")  
5-min bars: 2016

Resampling to various timeframes:

>>> # Hourly bars with volume-weighted VWAP
>>> hourly = reader.read('2023-01-01', '2023-01-31', timeframe='1h')  
>>> print(f"VWAP range: {hourly['vwap'].min():.2f} - {hourly['vwap'].max():.2f}")  
VWAP range: 16420.50 - 17890.75
>>>
>>> # Daily bars for trend analysis
>>> daily = reader.read('2023-01-01', '2023-12-31', timeframe='1D')  
>>> daily_returns = daily['close'].pct_change()  

finmlkit.bar.io._find_gaps(key: str, filepath: str, max_gap: Timedelta) → Tuple[str, list[Tuple[Timestamp, Timedelta]]][source]¶

Find gaps in the trades data for a specific key.

Parameters:

key – HDF5 key to inspect.
filepath – Path to the HDF5 file.
max_gap – Maximum allowable gap between consecutive timestamps.

Returns:

Tuple containing the key and a list of tuples, each with (gap timestamp, gap size).