Feature pipelines: Compose, FeatureKit and the computation graph
================================================================

This tutorial shows how to build robust feature pipelines using Compose and
FeatureKit, and how to visualize and leverage the underlying dependency graph
(ComputationGraph) to compute features in a valid order automatically.

Prerequisites
-------------

.. code-block:: python

   import pandas as pd
   import numpy as np
   from finmlkit.feature.kit import Feature, FeatureKit, Compose
   from finmlkit.feature.transforms import SMA, EWMA

Dataset
-------

.. code-block:: python

   idx = pd.date_range("2024-01-01", periods=64, freq="D")
   rng = np.random.default_rng(0)
   df = pd.DataFrame({
       "close": 100 + rng.normal(0, 1, len(idx)).cumsum(),
   }, index=idx)

Chaining transforms with Compose
--------------------------------

Compose lets you chain single-output transforms into a linear pipeline. The
first transform determines the input column, and each subsequent transform
consumes the previous output.

.. code-block:: python

   t1 = SMA(3, input_col="close")
   t2 = EWMA(5, input_col=t1.output_name)  # consume SMA output
   pipeline = Compose(t1, t2)

   # Wrap in a Feature for later use in FeatureKit or math ops
   f_pipeline = Feature(pipeline)

   result = f_pipeline(df, backend="pd")
   print(result.name)  # e.g. close_sma3_ewma5

Batch execution and caching with FeatureKit
-------------------------------------------

FeatureKit runs multiple Feature objects against a DataFrame, incrementally
caching results so that dependent features can reuse previously computed
columns.

.. code-block:: python

   f_sma = Feature(SMA(5, input_col="close"))
   f_ewma = Feature(EWMA(10, input_col="close"))
   f_ratio = f_sma / f_ewma  # depends on both above

   kit = FeatureKit([f_ratio, f_sma, f_ewma], retain=["close"])  # intentionally unsorted

   # Compute features in topological order inferred from dependencies
   out = kit.build(df, backend="pd", order="topo")
   print(out.columns)

Visualizing dependencies with ComputationGraph
----------------------------------------------

FeatureKit can build a dependency graph from your features. Input nodes are
prefixed with ``input:`` and feature nodes are the output names of your
Feature objects.

.. code-block:: python

   g = kit.build_graph()
   print(g.visualize())

   # Example output (truncated):
   # ComputationGraph:
   #   input:close -> [close_ewma10, close_sma5, div(close_sma5,close_ewma10)]
   #   close_ewma10 -> [div(close_sma5,close_ewma10)]
   #   close_sma5 -> [div(close_sma5,close_ewma10)]

   # Topological order over features only (input nodes omitted):
   print(kit.topological_order())

Reproducibility: save and load pipeline configurations
------------------------------------------------------

FeatureKit and Feature support JSON-serializable configurations. You can save a
pipeline and reload it later to reproduce the same features.

.. code-block:: python

   kit.save_config("featurekit.json")
   kit2 = FeatureKit.from_config("featurekit.json")
   out2 = kit2.build(df, backend="pd", order="topo")

Tips
----

- Use ``order="topo"`` when your feature list isn’t already dependency-sorted.
- Compose is intended for single-output transforms. For multi-output steps,
  create intermediate Features or manage DataFrame columns explicitly.
- Use the pandas backend (``backend="pd"``) when developing or debugging; switch
  to Numba (``backend="nb"``) for performance once things work.


Integrating external libraries (e.g. TA-Lib) with ExternalFunction
------------------------------------------------------------------

You can integrate third-party Python libraries into your feature pipelines via
``ExternalFunction``. This allows you to call external functions (by object or
import path) as transforms while keeping consistent input/output handling and
full serialization support.

Key points:

- Accepts a Callable (recommended) or an import path string (``"pkg.mod.func"``).
- ``pass_numpy=True`` passes NumPy arrays to the external function (useful for TA-Lib).
- Supports single or multiple outputs. For multi-output functions, provide
  ``output_cols`` with matching length.
- Fully serializable: configurations round-trip via ``FeatureKit.save_config``/
  ``FeatureKit.load_config``.

Example: TA-Lib SMA/RSI using callables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import talib
   import numpy as np
   from finmlkit.feature.kit import Feature, FeatureKit
   from finmlkit.feature.transforms import ExternalFunction

   # Wrap TA-Lib indicators; pass_numpy=True for ndarray inputs
   ext_sma14 = ExternalFunction(talib.SMA, input_cols="close", output_cols="talib_sma14", args=[14], pass_numpy=True)
   ext_rsi14 = ExternalFunction(talib.RSI, input_cols="close", output_cols="talib_rsi14", args=[14], pass_numpy=True)

   f_sma14 = Feature(ext_sma14)
   f_rsi14 = Feature(ext_rsi14)

   kit = FeatureKit([f_sma14, f_rsi14], retain=["close"])  # compute both
   out = kit.build(df, backend="pd", order="topo")

   # Serialize and load back
   kit.save_config("featurekit_talib.json")
   kit2 = FeatureKit.from_config("featurekit_talib.json")
   out2 = kit2.build(df, backend="pd", order="topo")

   assert set(out.columns) == set(out2.columns)

Installation notes for TA-Lib
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- TA-Lib may require platform-specific setup. Try:

  - ``pip install TA-Lib``
  - If that fails, consider ``pip install talib-binary`` (prebuilt wheels).

- When using ``pass_numpy=True``, ensure your input columns are numeric and
  free of mixed types for best compatibility.