Feature pipelines: Compose, FeatureKit and the computation graph¶
This tutorial shows how to build robust feature pipelines using Compose and FeatureKit, and how to visualize and leverage the underlying dependency graph (ComputationGraph) to compute features in a valid order automatically.
Prerequisites¶
import pandas as pd
import numpy as np
from finmlkit.feature.kit import Feature, FeatureKit, Compose
from finmlkit.feature.transforms import SMA, EWMA
Dataset¶
idx = pd.date_range("2024-01-01", periods=64, freq="D")
rng = np.random.default_rng(0)
df = pd.DataFrame({
"close": 100 + rng.normal(0, 1, len(idx)).cumsum(),
}, index=idx)
Chaining transforms with Compose¶
Compose lets you chain single-output transforms into a linear pipeline. The first transform determines the input column, and each subsequent transform consumes the previous output.
t1 = SMA(3, input_col="close")
t2 = EWMA(5, input_col=t1.output_name) # consume SMA output
pipeline = Compose(t1, t2)
# Wrap in a Feature for later use in FeatureKit or math ops
f_pipeline = Feature(pipeline)
result = f_pipeline(df, backend="pd")
print(result.name) # e.g. close_sma3_ewma5
Batch execution and caching with FeatureKit¶
FeatureKit runs multiple Feature objects against a DataFrame, incrementally caching results so that dependent features can reuse previously computed columns.
f_sma = Feature(SMA(5, input_col="close"))
f_ewma = Feature(EWMA(10, input_col="close"))
f_ratio = f_sma / f_ewma # depends on both above
kit = FeatureKit([f_ratio, f_sma, f_ewma], retain=["close"]) # intentionally unsorted
# Compute features in topological order inferred from dependencies
out = kit.build(df, backend="pd", order="topo")
print(out.columns)
Visualizing dependencies with ComputationGraph¶
FeatureKit can build a dependency graph from your features. Input nodes are
prefixed with input: and feature nodes are the output names of your
Feature objects.
g = kit.build_graph()
print(g.visualize())
# Example output (truncated):
# ComputationGraph:
# input:close -> [close_ewma10, close_sma5, div(close_sma5,close_ewma10)]
# close_ewma10 -> [div(close_sma5,close_ewma10)]
# close_sma5 -> [div(close_sma5,close_ewma10)]
# Topological order over features only (input nodes omitted):
print(kit.topological_order())
Reproducibility: save and load pipeline configurations¶
FeatureKit and Feature support JSON-serializable configurations. You can save a pipeline and reload it later to reproduce the same features.
kit.save_config("featurekit.json")
kit2 = FeatureKit.from_config("featurekit.json")
out2 = kit2.build(df, backend="pd", order="topo")
Tips¶
Use
order="topo"when your feature list isn’t already dependency-sorted.Compose is intended for single-output transforms. For multi-output steps, create intermediate Features or manage DataFrame columns explicitly.
Use the pandas backend (
backend="pd") when developing or debugging; switch to Numba (backend="nb") for performance once things work.
Integrating external libraries (e.g. TA-Lib) with ExternalFunction¶
You can integrate third-party Python libraries into your feature pipelines via
ExternalFunction. This allows you to call external functions (by object or
import path) as transforms while keeping consistent input/output handling and
full serialization support.
Key points:
Accepts a Callable (recommended) or an import path string (
"pkg.mod.func").pass_numpy=Truepasses NumPy arrays to the external function (useful for TA-Lib).Supports single or multiple outputs. For multi-output functions, provide
output_colswith matching length.Fully serializable: configurations round-trip via
FeatureKit.save_config/FeatureKit.load_config.
Example: TA-Lib SMA/RSI using callables¶
import talib
import numpy as np
from finmlkit.feature.kit import Feature, FeatureKit
from finmlkit.feature.transforms import ExternalFunction
# Wrap TA-Lib indicators; pass_numpy=True for ndarray inputs
ext_sma14 = ExternalFunction(talib.SMA, input_cols="close", output_cols="talib_sma14", args=[14], pass_numpy=True)
ext_rsi14 = ExternalFunction(talib.RSI, input_cols="close", output_cols="talib_rsi14", args=[14], pass_numpy=True)
f_sma14 = Feature(ext_sma14)
f_rsi14 = Feature(ext_rsi14)
kit = FeatureKit([f_sma14, f_rsi14], retain=["close"]) # compute both
out = kit.build(df, backend="pd", order="topo")
# Serialize and load back
kit.save_config("featurekit_talib.json")
kit2 = FeatureKit.from_config("featurekit_talib.json")
out2 = kit2.build(df, backend="pd", order="topo")
assert set(out.columns) == set(out2.columns)
Installation notes for TA-Lib¶
TA-Lib may require platform-specific setup. Try:
pip install TA-LibIf that fails, consider
pip install talib-binary(prebuilt wheels).
When using
pass_numpy=True, ensure your input columns are numeric and free of mixed types for best compatibility.