marketgoblin
Download, store, and load financial market data — fast, without fuss.
Features
-
Multi-dataset fetch & persist
Download OHLCV, shares-outstanding, or dividends by symbol and date range via a
Datasetenum. Whensave_pathis set, data is automatically sliced into monthly Parquet files on disk. -
Lazy by default
Built on Polars. Every data path returns a
pl.LazyFrame— nothing is computed until you call.collect(). -
Batch fetching
fetch_many()uses aThreadPoolExecutorwith a token-bucket rate limiter. Failed symbols are logged and skipped — they never crash the batch. -
Pluggable sources
Subclass
BaseSource, implement one method, register in one line. Ships withYahooSourceandCSVSourceout of the box. -
Reliable by default
YahooSourceretries transient failures with exponential backoff. Writes are atomic (.tmprename). JSON sidecars record metadata per slice. -
Predictable storage layout
Uniform across datasets:
{save_path}/{provider}/{dataset}/{SYMBOL}/{SYMBOL}_{YYYY-MM}.pq. OHLCV adjusted + raw share one file, distinguished by theis_adjustedcolumn. Every slice is inspectable and portable with any Parquet reader.
Quick start
import polars as pl
from marketgoblin import MarketGoblin
goblin = MarketGoblin(provider="yahoo", save_path="./data")
# OHLCV is a tidy stacked frame — each day appears twice
# (is_adjusted=True / False). Filter to pick a variant.
lf = goblin.fetch("AAPL", "2024-01-01", "2024-03-31", parse_dates=True)
print(lf.filter(pl.col("is_adjusted")).collect())
from marketgoblin import MarketGoblin
goblin = MarketGoblin(provider="yahoo", save_path="./data")
results = goblin.fetch_many(
["AAPL", "MSFT", "GOOGL", "NVDA"],
start="2024-01-01",
end="2024-03-31",
max_workers=4,
requests_per_second=2.0,
)
for symbol, lf in results.items():
print(f"{symbol}: {lf.collect().height} rows")
Data conventions
| Property | Value |
|---|---|
| Date on disk | int32 YYYYMMDD (e.g. 20240101) — use parse_dates=True to get pl.Date |
| OHLC columns | float32 |
| Volume column | int64 |
is_adjusted column |
bool (OHLCV only — distinguishes adjusted vs raw rows) |
| Shares column | int64 |
| Dividend column | float32 |
| Parquet path | {save_path}/{provider}/{dataset}/{SYMBOL}/{SYMBOL}_{YYYY-MM}.pq (uniform across datasets) |
| JSON sidecar | Same path, .json — row count, date range, per-dataset stats (OHLCV also records has_adjusted/has_raw and missing trading days) |
Datasets
| Dataset | Providers | Notes |
|---|---|---|
Dataset.OHLCV |
yahoo, csv |
Daily open/high/low/close/volume. Tidy stacked: each day appears twice with is_adjusted=True/False. Adjusted Open/High/Low are derived locally from Adj Close / Close (zero numerical drift vs auto_adjust=True, half the network calls). |
Dataset.SHARES |
yahoo |
Shares outstanding — sparse, corporate-action-driven cadence. Deduplicated to one row per day (last value wins). |
Dataset.DIVIDENDS |
yahoo |
Cash dividend events — typically quarterly. Filtered to the requested [start, end] range. |