Skill Index

ai-asset-pricing/

panel-data-rules

community[skill]

Rules and gotchas for CRSP, Compustat, and financial panel data: safe lagging/leading with date gap checks, look-ahead bias prevention, CCM linking, book equity construction, and missing value conventions. Auto-apply when working with CRSP, Compustat, or financial panel data.

$/plugin install ai-asset-pricing

details

Financial Panel Data Rules

Apply these rules whenever writing code that manipulates CRSP, Compustat, OptionMetrics, or any financial panel data. These are hard-won lessons — violating them causes silent data errors.

Rule 1: Safe Lagging/Leading (MANDATORY)

Every groupby().shift() and .diff() MUST be followed by a date gap check. Data is rarely contiguous — missing quarters, fiscal year changes, and delistings create gaps. A bare shift(1) will silently pair non-adjacent observations.

After any shift(k), auto-insert this validation:

df["_date_shifted"] = df.groupby(id_col)[date_col].shift(k)
day_gap = (df[date_col] - df["_date_shifted"]).dt.days
df.loc[(day_gap > max_gap) | (day_gap < min_gap), shifted_col] = np.nan
df.drop(columns=["_date_shifted"], inplace=True)

Thresholds by data source (validated against real WRDS data):

Sourceshift(1) max gapshift(1) min gapNotes
CRSP msf/msf_v2 (month-end normalized)31 days28 days2.26M obs: max gap = 31 days exactly (v2: mthcaldt)
CRSP msf/msf_v2 (raw trading dates)33 days28 daysRaw dates vary by last trading day
Compustat fundq100 days80 days99.89% in [90,92]; >100 = missing quarter
Compustat funda380 days350 days<350 = FYE change (correctly flagged)
CRSP dsf/dsf_v25 days0 daysWeekends/holidays (v2: dlycaldt)

For shift(k): max_gap = k * single_max, min_gap = k * single_min.

Infer frequency from data source (table name, column names). Columns ending in q (ibq, atq, seqq) → quarterly. If source unknown, ask the user.

CRSP month-end normalization (do this FIRST, before any gap checks or merges):

df["date"] = df["date"].dt.to_period("M").dt.to_timestamp("M")

This converts raw trading dates to calendar month-end. Required for: (1) gap checks to use the 31-day threshold, (2) matching CRSP to Compustat/JKP data, (3) consistent date joins across datasets.

Rule 2: Accounting Data Timing (Compustat Only)

Does NOT apply to JKP/Global Factor Data — JKP already has point-in-time alignment built in.

  • Annual: Available 4 months after fiscal year end: datadate + DateOffset(months=4)
  • Quarterly earnings (IBQ, Roe, dRoe): Available after RDQ; if RDQ missing, fallback to datadate + DateOffset(months=4). RDQ timing is specific to earnings/income variables.
  • Staleness filter: Ask the user before applying. Some methodologies set earnings variables to NaN if fiscal quarter end is >6 months before panel date.
  • merge_asof staleness: direction='backward' has no max-age limit — it will match data from years ago if nothing newer exists. In the BM battle test, some stocks carried BE data 11+ years old. Consider capping staleness (e.g., 18-24 months for annual signals), but ask the user first.
  • Compustat required filters: Every comp.funda/comp.fundq query MUST include indfmt='INDL' AND datafmt='STD' AND popsrc='D' AND consol='C'. Omitting these creates duplicate gvkey-datadate rows silently.
  • merge_asof pattern:
    # CORRECT — sort by ON key only (hard requirement)
    left = left.sort_values('date').reset_index(drop=True)
    right = right.sort_values('date').reset_index(drop=True)
    pd.merge_asof(left, right, by='gvkey', on='date', direction='backward')
    
    Sorting by [gvkey, date] raises ValueError: left keys must be sorted because dates restart at each new gvkey, breaking global monotonicity.

Rule 3: Panel Identifiers & CCM Linking

ID columns:

ContextPrimary IDGroup-by for shift/diff
CRSP onlypermnopermno
Compustat onlygvkeygvkey
Merged CRSP-Compustatpermnopermno

CCM linking:

  • Join on permno, filter linktype IN ('LC','LU'), filter date >= linkdt AND date <= linkenddt
  • SQL column name: CCM table column is lpermno (double precision), not permno. Always alias: l.lpermno::int AS permno
  • Set null linkenddt to '9999-12-31' sentinel before filtering
  • Pandas gotcha: pd.to_datetime('9999-12-31') overflows nanosecond timestamps. Replace sentinel with '2099-12-31' before converting: ccm["linkenddt"] = ccm["linkenddt"].astype(str).str.replace("9999-12-31", "2099-12-31")
  • Validated: zero duplicate (permno, date) with linktype LC/LU filter

Always assert no duplicates after merge:

assert not df.duplicated(subset=['permno','date']).any()

Multi-class firms: Firms with multiple PERMNOs per GVKEY exist (2-4 share classes). For firm-level ME: groupby(['gvkey','date'])['me'].transform('sum').

Rule 4: Stock Universe Filters — ALWAYS ASK USER

Never apply filters by default. Present options:

  • SHRCD in (10,11) — common stocks
  • EXCHCD in (1,2,3) — NYSE, AMEX, NASDAQ
  • SIC 6000-6999 — financials
  • BE > 0, price/mcap minimums

v2 equivalents (string codes):

  • Common stock: sharetype='NS' AND securitytype='EQTY' AND securitysubtype='COM'
  • US incorporated: usincflg='Y'
  • Major exchanges: primaryexch IN ('N','A','Q')
  • Active: tradingstatusflg='A'

Rule 5: Missing Values (Compustat-Specific)

Fill with 0 (standard convention):

ColumnTableReason
txditc, txditcqfunda/fundqDeferred taxes — missing = zero
xrdfundaR&D — missing = no R&D
drc, drltfundaDeferred revenue — missing = zero
dltt, dlcfundaDebt — missing in ratio = no debt
dvpsxqfundqDividends per share — missing = no dividend
rect, invt, xpp, ap, xaccfundaWorking capital — missing CHANGE = 0

Set to NaN: Ratios with AT ≤ 0 or BEQ ≤ 0 denominators.

Rule 6: Book Equity Hierarchies

Annual BE (Davis-Fama-French 2000):

SHE = SEQ → (CEQ + PSTK) → (AT - LT)
PS  = PSTKRV → PSTKL → PSTK
BE  = SHE + TXDITC(fillna 0) - PS(fillna 0)

Quarterly BEQ:

SHEQ = SEQQ → (CEQQ + PSTKQ) → (ATQ - LTQ)
PSQ  = PSTKRQ → PSTKQ          ← PSTKRQ first!
BEQ  = SHEQ + TXDITCQ(fillna 0) - PSQ(fillna 0)

Q4 supplement: If FQTR == 4 and BEQ missing, use annual BE from matching FYEARQ.

Rule 7: Winsorization — ASK USER FIRST

Do NOT winsorize by default. Confirm: whether to winsorize, which variables, cutoffs (1/99 standard), cross-sectional by month vs pooled.

Standard pattern when confirmed:

def winsorize_by_month(df, col, date_col="date", lo=0.01, hi=0.99):
    def _w(s):
        x = s.dropna()
        if x.empty: return s
        return s.clip(np.nanpercentile(x, lo*100), np.nanpercentile(x, hi*100))
    df[f"{col}_w"] = df.groupby(date_col)[col].transform(_w)

Rule 8: Quarterly Data Gotchas

  • Q4 supplements: CSHOQ/AJEXQ missing in Q4 → use annual CSHO/AJEX
  • Clean surplus imputation: Impute missing BEQ forward up to 4 quarters: BEQ_t = BEQ_{t-j} + sum(IBQ) - sum(DVQ) for j in 1..4
  • Fiscal quarter: Use FQTR column, not calendar month
  • Dividends: DVQ = DVPSXQ * CSHOQ * AJEXQ (units: millions)

technical

github
Alexander-M-Dickerson/ai-asset-pricing
stars
49
license
MIT
contributors
1
last commit
2026-04-19T07:58:01Z
file
.claude/skills/panel-data-rules/SKILL.md

related