finmlkit.bar.base module

This module contains the functions to build candlestick bar and other intra-bar features (i.e. directional features, footprints) from raw trades data using the indexer functions outputs defined in the logic module.

class finmlkit.bar.base.BarBuilderBase(trades: TradesData)[source]

Bases: ABC

Abstract base class for building various types of bars (e.g., time, tick, volume, or information based bars) from raw trades data. This class serves as a template for subclasses that implement specific bar sampling strategies, enabling the transformation of high-frequency trade data into structured bar features suitable for financial analysis and machine learning.

In financial machine learning, raw trade data (ticks) is often aggregated into bars to reduce noise, capture market dynamics, and create features for modeling. This builder computes standard OHLCV (Open, High, Low, Close, Volume) bars, directional features (e.g., buy/sell volumes), trade size metrics, and footprint data (order flow imbalances at price levels). It is inspired by techniques from Marcos López de Prado’s work on sampling methods to address issues like uneven information arrival rates in high-frequency trading data.

Subclasses must implement the abstract method _comp_bar_close() to define how bar close timestamps and indices are determined (e.g., based on time intervals, tick counts, or volume thresholds). The builder uses these indices to aggregate trades efficiently via Numba and Pandas, ensuring performance for large datasets.

Key functionalities include:

  • build_ohlcv(): Computes OHLCV, VWAP (Volume-Weighted Average Price), trade count, and median trade size.

  • build_directional_features(): Calculates buy/sell splits for ticks, volume, dollar value, spreads, and cumulative metrics, revealing order flow directionality and market pressure.

  • build_trade_size_features(): Analyzes relative trade sizes, 95th percentile sizes, block trade percentages, and Gini coefficients

    for trade size distribution, useful for detecting large orders or market concentration.

  • build_footprints(): Generates detailed footprint data, discretizing price levels to compute volumes, ticks, imbalances, and metrics like volume profile skew and Gini, aiding in order flow and volume profile analysis.

Parameters:

trades (TradesData) – Object containing raw trades DataFrame with columns ‘timestamp’, ‘price’, and ‘amount’. TradesData ensures the data is preprocessed and ready for bar construction.

Raises:

ValueError – If required columns are missing from trades data or if data is not properly formatted.

See also

finmlkit.bar.kit.TimeBarKit: A concrete subclass for fixed-time interval bars.

finmlkit.bar.kit.TickBarKit: For bars based on tick counts.

finmlkit.bar.kit.VolumeBarKit: For volume-threshold bars.

__init__(trades: TradesData)[source]

Initialize the bar builder with raw trades data.

Parameters:

trades – TradesData object containing raw trades DataFrame with columns ‘timestamp’, ‘price’, and ‘amount’.

_abc_impl = <_abc._abc_data object>
abstract _comp_bar_close() Tuple[ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[int64]]][source]

Abstract method to generate bar close timestamps and indices.

Returns:

Tuple of close timestamps and their corresponding indices.

_set_bar_close()[source]

Calculate and sets the close timestamps and indices if not already calculated.

property bar_close_indices: ndarray[tuple[int, ...], dtype[int64]] | None

Return the bar close indices in the raw trades data.

Returns:

The bar close indices regarding the raw trades data as a numpy array of int64.

property bar_close_timestamps: ndarray[tuple[int, ...], dtype[int64]] | None

Return the bar close timestamps in the raw trades data.

Returns:

The bar close ns timestamps as a numpy array of int64.

build_directional_features() DataFrame[source]

Build the directional features using the generated indices and raw trades data.

Returns:

A dataframe containing the directional features: ticks_buy, ticks_sell, volume_buy, volume_sell, dollars_buy, dollars_sell, max_spread, cum_volumes_min, cum_volumes_max, cum_dollars_min, cum_dollars_max.

build_footprints(price_tick_size=None, imbalance_factor=3.0) FootprintData[source]

Build the footprint data using the generated indices and raw trades data.

Parameters:
  • price_tick_size – Optional tick size; inferred if None.

  • imbalance_factor – Multiplier for detecting imbalances. Default is 3.0.

Returns:

A FootprintData object containing the footprint data.

build_ohlcv() DataFrame[source]

Build the bar features using the generated indices and raw trades data.

Returns:

A dataframe containing the OHLCV + VWAP features with datetime index corresponding to the bar open timestamps.

build_trade_size_features(theta: ndarray[tuple[int, ...], dtype[float64]] | None, theta_mult: float = 5.0) DataFrame[source]

Build the trade size features using the generated indices and raw trades data. :param theta: Optional typical trade size (e.g., 30 day rolling median trade size). :param theta_mult: Multiplier for theta to define the block size threshold. Default is 5.0. :returns: A dataframe containing the trade size features:

mean_size_rel, size_95_rel, pct_block, size_gini.

finmlkit.bar.base.comp_bar_directional_features(prices: ndarray[tuple[int, ...], dtype[float64]], volumes: ndarray[tuple[int, ...], dtype[float64]], bar_close_indices: ndarray[tuple[int, ...], dtype[int64]], trade_sides: ndarray[tuple[int, ...], dtype[int8]]) tuple[ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]]][source]

Compute directional bar features such as tick counts, volumes, dollars, spreads, and cumulative flows.

Parameters:
  • prices – Trade prices.

  • volumes – Trade volumes.

  • bar_close_indices – Indices marking the end of each bar.

  • trade_sides – Trade direction (1 for market buy, -1 for market sell).

Returns:

Tuple containing:

  • ticks_buy: Number of buy trades per bar.

  • ticks_sell: Number of sell trades per bar.

  • volume_buy: Volume of buy trades per bar.

  • volume_sell: Volume of sell trades per bar.

  • dollars_buy: Dollar value of buy trades per bar.

  • dollars_sell: Dollar value of sell trades per bar.

  • mean_spread: Mean bid/ask spread within each bar.

  • max_spread: Maximum spread within each bar.

  • cum_ticks_min: Minimum cumulative tick imbalance.

  • cum_ticks_max: Maximum cumulative tick imbalance.

  • cum_volumes_min: Minimum cumulative volume imbalance.

  • cum_volumes_max: Maximum cumulative volume imbalance.

  • cum_dollars_min: Minimum cumulative dollar imbalance.

  • cum_dollars_max: Maximum cumulative dollar imbalance.

finmlkit.bar.base.comp_bar_footprints(prices: ndarray[tuple[int, ...], dtype[float64]], amounts: ndarray[tuple[int, ...], dtype[float64]], bar_close_indices: ndarray[tuple[int, ...], dtype[int64]], trade_sides: ndarray[tuple[int, ...], dtype[int8]], price_tick_size: float, bar_lows: ndarray[tuple[int, ...], dtype[float64]], bar_highs: ndarray[tuple[int, ...], dtype[float64]], imbalance_factor: float) tuple[List[ndarray[tuple[int, ...], dtype[int32]]], List[ndarray[tuple[int, ...], dtype[float32]]], List[ndarray[tuple[int, ...], dtype[float32]]], List[ndarray[tuple[int, ...], dtype[int32]]], List[ndarray[tuple[int, ...], dtype[int32]]], List[ndarray[tuple[int, ...], dtype[bool]]], List[ndarray[tuple[int, ...], dtype[bool]]], ndarray[tuple[int, ...], dtype[uint16]], ndarray[tuple[int, ...], dtype[uint16]], ndarray[tuple[int, ...], dtype[int32]], ndarray[tuple[int, ...], dtype[int16]], ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[float64]]][source]

Compute the footprint features for each bar, including buy/sell volumes and imbalances per price level. The price levels are calculated in (integer) price tick units to eliminate floating point errors.

Parameters:
  • prices – Trade prices.

  • amounts – Trade amounts.

  • bar_close_indices – Indices marking the end of each bar.

  • trade_sides – The side information of the market order (1 for market buy, -1 for market sell).

  • price_tick_size – Tick size used for price level quantization.

  • bar_lows – Lowest price per bar.

  • bar_highs – Highest price per bar.

  • imbalance_factor – Multiplier threshold for detecting imbalance.

Returns:

Tuple containing:

  • price_levels: List of price level arrays per bar.

  • buy_volumes: List of buy volumes per price level.

  • sell_volumes: List of sell volumes per price level.

  • buy_ticks: List of buy ticks per price level.

  • sell_ticks: List of sell ticks per price level.

  • buy_imbalances: List of boolean arrays indicating buy imbalances.

  • sell_imbalances: List of boolean arrays indicating sell imbalances.

  • buy_imbalances_sum: Total number of buy imbalances per bar.

  • sell_imbalances_sum: Total number of sell imbalances per bar.

  • cot_price_levels: Price level with highest total volume per bar.

  • imb_max_run_signed: Longest signed imbalance run for each bar.

  • vp_skew: Volume profile skew for each bar (positive = buy pressure above VWAP).

  • vp_gini: Volume profile Gini coefficient for each bar.

finmlkit.bar.base.comp_bar_ohlcv(prices: ndarray[tuple[int, ...], dtype[float64]], volumes: ndarray[tuple[int, ...], dtype[float64]], bar_close_indices: ndarray[tuple[int, ...], dtype[int64]]) tuple[ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float64]], ndarray[tuple[int, ...], dtype[int64]], ndarray[tuple[int, ...], dtype[float64]]][source]

Build the candlestick bar from raw trades data based on bar close indices.

Parameters:
  • prices – Trade prices.

  • volumes – Trade volumes.

  • bar_close_indices – Indices marking the end of each bar.

Returns:

Tuple containing:

  • open: Opening price of each bar.

  • high: Highest price of each bar.

  • low: Lowest price of each bar.

  • close: Closing price of each bar.

  • volume: Total traded volume in each bar.

  • vwap: Volume-weighted average price of each bar.

  • bar_trades: Number of trades in each bar.

  • bar_median_trade_size: Median trade size in each bar.

finmlkit.bar.base.comp_bar_trade_size_features(amounts: ndarray[tuple[int, ...], dtype[float64]], theta: ndarray[tuple[int, ...], dtype[float64]], bar_close_indices: ndarray[tuple[int, ...], dtype[int64]], theta_mult: float) tuple[ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]], ndarray[tuple[int, ...], dtype[float32]]][source]

Compute the size distribution features for each bar, including the mean, 95 percentile, pct_block relative to thehta and size_gini. Are there large trade block prints in the bar?

Parameters:
  • amounts – Array of trade amounts (raw trade sizes).

  • theta – The typical trade size (e.g., 30 day rolling median trade size).

  • bar_close_indices – Indices marking the end of each bar.

  • theta_mult – Multiplier for theta to define the block size threshold. (eg. 5 times the median trade size)

Returns:

A tuple containing:

  • mean_size_rel: Mean trade size relative to theta per bar: log1p(mean_size / theta)

  • size_95_rel: 95th percentile of trade sizes per bar relative to theta: log1p(size_95 / theta)

  • pct_block: Percentage of trades that are larger than theta per bar: SUM( size_i [ size_i>theta ] / volume )

  • size_gini: Gini coefficient of trade sizes per bar.

finmlkit.bar.base.comp_footprint_features(price_levels, buy_volumes, sell_volumes, imbalance_multiplier)[source]

Calculate footprint statistics such as buy/sell imbalances and Commitment of Traders (COT) level.

Parameters:
  • price_levels – Array of int64 tick unit price levels in ascending order.

  • buy_volumes – Array of buy volumes at each price level.

  • sell_volumes – Array of sell volumes at each price level.

  • imbalance_multiplier – Threshold multiplier to detect imbalance.

Returns:

Tuple containing:

  • buy_imbalances: Boolean array where True indicates buy imbalance at the level.

  • sell_imbalances: Boolean array where True indicates sell imbalance at the level.

  • imbalance_max_run_signed: Longest signed imbalance run (number of consecutive imbalanced level)

  • cot_price_level: Price level with the highest total volume.

  • vp_skew: Volume profile skew relative to vwap (positive = buy pressure above VWAP).

  • vp_gini: Volume profile Gini coefficient (0 = concentrated, →1 = even distribution).