Metrics Analysis Research: Architecture and Tooling Strategy¶
Date: 2025-12-05 Status: Active Research Decision Point: Choose between building custom Python analysis vs OpenTelemetry integration
Executive Summary¶
After extensive research across modern Python data tools, OpenTelemetry observability stacks, and existing LLM agent monitoring solutions, a hybrid approach emerges as the optimal strategy:
Recommendation: Start with lightweight Python analysis (DuckDB + Parquet), design for future OpenTelemetry integration
Key Findings:
-
Current metrics extraction is sufficient - Our 60+ metrics from agent transcripts cover token usage, git operations, pre-commit runs, and quality indicators. OpenTelemetry would provide traces and spans but wouldn't fundamentally change the insights available for weekly analysis.
-
DuckDB + Parquet = Sweet Spot - DuckDB excels at analytical queries on Parquet files (the exact use case for weekly metrics analysis), requiring minimal code changes from our current JSONL approach. 5-10x faster than pandas, with SQL familiarity.
-
AI-assisted EDA tools can accelerate insight discovery - Libraries like Sweetviz, D-Tale, and RATH can generate automated reports and identify patterns in our metrics data with 2-3 lines of code, ideal for weekly analysis without building custom dashboards.
-
OpenTelemetry adds complexity for uncertain benefit - While Claude Code supports OTel export, the homelab infrastructure (collector, Prometheus, Grafana, storage) requires significant setup and maintenance. The payoff is stronger for real-time monitoring (not our use case) than periodic analysis.
-
No Claude Code community metrics tools exist yet - Claude Code is relatively new, and the community is focused on plugins, agents, and IDE integrations. We're pioneering the metrics analysis space, but can leverage general LLM observability patterns.
Phased Approach:
- Phase 1 (Complete): Hook-based metrics extraction to JSONL ✅
- Phase 2 (Next 2-4 weeks): Migrate to Parquet storage, build DuckDB query library, integrate automated EDA
- Phase 3 (3-6 months): Add OpenTelemetry export alongside Parquet (dual-tracking), experiment with homelab Grafana
- Phase 4 (6+ months): Evaluate whether OTel provides enough value to replace custom analysis
This approach maximizes learning and utility in the short term while keeping the path open for full observability infrastructure later.
Research Questions¶
1. Modern Python Data Stack¶
- DuckDB vs Polars vs Pandas for metrics analysis
- Parquet vs JSONL for storage
- AI-assisted EDA libraries (automated insight generation)
- Cost/benefit of custom analysis infrastructure
2. OpenTelemetry Integration¶
- What data does OTel provide beyond our current metrics?
- Would OTel replace or complement current metrics extraction?
- Open-source analysis tools: Prometheus, Grafana, Jaeger, etc.
- Infrastructure requirements for homelab OTel setup
- Migration path from current JSONL metrics
3. Existing Solutions¶
- Open-source Claude Code analysis tools
- LLM agent metrics/observability libraries
- Pre-built dashboards or analysis frameworks
- Community best practices
Goals and Constraints¶
Usage Pattern: Weekly or less frequent analysis (not real-time dashboards)
Primary Objectives:
- Identify optimization opportunities (token usage, cache efficiency)
- Understand Claude Code feature effectiveness
- Discover usage patterns and workflow improvements
Decision Criteria:
- Time/effort to implement vs value gained
- Risk of building infrastructure that becomes obsolete
- Ability to leverage existing open-source work
- Support for both simple and sophisticated analysis approaches
Research Area 1: Modern Python Data Libraries¶
DuckDB¶
Overview: SQL-based analytical database optimized for OLAP queries, designed to run embedded within applications. Combines cost-based optimizer with vectorized execution and late materialization for consistent performance.
Strengths:
- Fastest for aggregations: 9.4× faster than pandas for GROUP BY operations (codecentric)
- Minimal memory footprint: Uses least memory with automatic spill-to-disk for datasets larger than RAM
- Handles massive datasets: Can process 50GB datasets on laptop by streaming through memory (motherduck)
- SQL familiarity: No new API to learn, write standard SQL queries
- Direct file querying: Can query Parquet files without loading into memory
- Seamless integration: Works directly with Pandas and Polars dataframes
Weaknesses:
- Slower for filtering operations (22.18s vs Polars 1.89s due to SQL overhead)
- Requires SQL knowledge (though this is a strength for many users)
- Not ideal for complex data transformations requiring chaining operations
Fit for Metrics Analysis:
Excellent fit - Our use case is exactly what DuckDB excels at: analytical queries (aggregations, time-series analysis, trend detection) on structured data. Weekly metrics analysis involves GROUP BY date ranges, calculating averages, identifying outliers - all operations DuckDB is optimized for.
Polars¶
Overview: DataFrame library optimized for high-performance multithreaded computing on single nodes. Built in Rust with Python bindings, focuses on speed and memory efficiency with a lazy execution model.
Strengths:
- Fastest for I/O and joins: 7.7× faster than pandas for CSV loading, 5× faster for joins (codecentric)
- Best for ETL pipelines: Excels at filtering (1.89s vs pandas 9.47s), data transformations, chained operations
- Query optimization: Lazy execution enables projection pushdown and predicate filtering
- Modern API: Cleaner, more consistent API than pandas
- Memory efficient: Substantially better memory usage than pandas
Weaknesses:
- Slightly slower than DuckDB for aggregations (8.7× vs 9.4× faster than pandas)
- Requires learning new API (though cleaner than pandas)
- Less mature ecosystem than pandas (though growing rapidly)
Fit for Metrics Analysis:
Good fit, but secondary to DuckDB - Polars shines for ETL and data transformations. If we were building complex data pipelines with filtering and reshaping, Polars would be ideal. For our use case (primarily analytical queries on structured metrics), DuckDB's SQL interface and aggregation performance are more aligned.
Pandas (Baseline Comparison)¶
Overview: Industry standard DataFrame library, ubiquitous in data science but showing its age in 2025.
Why User Prefers to Avoid:
- Performance: 5-10× slower than DuckDB/Polars for operations >1M rows
- Memory inefficiency: Loads entire dataset into RAM, no streaming capabilities
- API inconsistencies: Mix of functional, object-oriented, and numpy-style operations
- Legacy design: Not designed for modern multi-core CPUs
When Pandas Still Makes Sense:
- Datasets under 1M rows where simplicity > performance
- Heavy ecosystem dependencies (many libraries still expect pandas DataFrames)
- Quick exploratory work where familiarity reduces friction
Our Case: With weekly metrics analysis and potential growth to thousands of commits, we're in the "DuckDB/Polars sweet spot" where performance gains justify the learning curve.
Parquet Storage Format¶
Overview: Columnar storage format designed for efficient data analytics. Open-source, widely supported across big data ecosystems.
Advantages over JSONL:
- Storage efficiency: 2-5× smaller than JSON/CSV due to columnar compression (towardsdatascience)
- Query performance: Column pruning allows reading only needed columns, not entire rows (stackshare)
- Analytics optimization: Columnar format ideal for aggregations, group-by operations
- Type preservation: Stores schema and types, no parsing overhead
- Compression: Per-column compression with high compression ratios
- Ecosystem support: DuckDB, Polars, Spark, Athena all read Parquet natively
Migration Considerations:
- Backward compatibility: Keep JSONL as append-only log, generate Parquet for analysis
- Simple conversion: PyArrow can convert JSONL to Parquet in ~10 lines of code (medium)
- Incremental approach: Convert historical data once, new metrics to both formats during transition
Migration Code Example:
import pyarrow.json as pj
import pyarrow.parquet as pq
# Read JSONL, write Parquet
table = pj.read_json('commit-metrics-2025-12-05.jsonl')
pq.write_table(table, 'commit-metrics-2025-12-05.parquet')
AI-Assisted EDA Libraries¶
Tools Researched:
- Sweetviz - Automated EDA with dataset comparison capabilities (kanaries)
- D-Tale - Interactive GUI for data analysis in browser
- RATH - AI-powered automated discovery of patterns, insights, causal relationships with auto-generated visualizations
- Dora - Generates visual summaries, statistical tests, feature engineering ideas
Capabilities:
- Automated profiling: Generate comprehensive HTML reports with distributions, correlations, missing values
- Pattern discovery: AI-powered identification of trends, anomalies, relationships
- Visual summaries: Interactive charts and graphs without manual coding
- Statistical tests: Automated hypothesis testing and significance analysis
- Comparison mode: Side-by-side analysis of different time periods
Integration Potential for Our Use Case:
High value, low effort - For weekly metrics analysis, tools like Sweetviz can generate comprehensive reports with 2-3 lines of code:
import sweetviz as sv
import duckdb
# Query last week's metrics with DuckDB
last_week = duckdb.sql("SELECT * FROM 'metrics.parquet' WHERE date >= '2025-11-28'").df()
this_week = duckdb.sql("SELECT * FROM 'metrics.parquet' WHERE date >= '2025-12-05'").df()
# Generate comparison report
report = sv.compare([last_week, "Last Week"], [this_week, "This Week"])
report.show_html('weekly_metrics_analysis.html')
This provides automated insights (cache hit rate trends, token usage patterns, pre-commit success rates) without building custom dashboards.
Research Area 2: OpenTelemetry Integration¶
What is OpenTelemetry?¶
Overview: Vendor-neutral, open-source observability framework providing standardized APIs, SDKs, and tools for collecting telemetry data (traces, metrics, logs) from applications. Industry standard supported by CNCF, eliminates vendor lock-in.
Key Components:
- OpenTelemetry SDK: Instruments code to emit telemetry
- OpenTelemetry Collector: Receives, processes, and exports telemetry to backends
- Semantic Conventions: Standardized attribute names and meanings
- Exporters: Plugins for sending data to various backends (Prometheus, Jaeger, Grafana, etc.)
Data Available from OTel¶
Traces:
- Distributed tracing: Track requests across multiple services/agents
- Spans: Individual operations with start time, duration, parent/child relationships
- Context propagation: Follow execution flow through complex systems
- Exemplars: Link traces to metrics (e.g., "this slow query" → trace showing why)
For Claude Code: Traces would show the full execution path of an agent run: tool calls, API requests to Claude, context switches, thinking time. Valuable for debugging complex multi-agent workflows.
Metrics:
- Counter: Cumulative values (total requests, errors)
- Gauge: Point-in-time values (current memory usage, active connections)
- Histogram: Distribution of values (request latency buckets)
- Time-series data: Metrics over time for trending and alerting
For Claude Code: OTel metrics would capture token usage per request, API latency, cache hit rates, tool invocation counts - similar to what we already extract from transcripts.
Logs:
- Structured logging: JSON-formatted logs with consistent fields
- Log correlation: Link logs to traces via trace ID
- Severity levels: ERROR, WARN, INFO, DEBUG with filtering
For Claude Code: Logs would capture agent messages, errors, warnings - again, similar to what we parse from transcripts.
Comparison to Current Metrics:
| Capability | Current Transcript Parsing | OpenTelemetry |
|---|---|---|
| Token metrics | ✅ 13 fields (total, input, output, cache rates) | ✅ Similar data |
| Tool usage | ✅ Tool types, counts, timing | ✅ Similar data |
| Git operations | ✅ Commits, diffs, status checks | ✅ Would need custom instrumentation |
| Pre-commit metrics | ✅ Runs, failures, logsift output | ✅ Would need custom instrumentation |
| Distributed tracing | ❌ Single agent view | ✅ Multi-agent correlation |
| Real-time monitoring | ❌ Post-hoc analysis | ✅ Live dashboards |
| Span-level detail | ❌ Message-level granularity | ✅ Operation-level spans |
| Standard ecosystem | ❌ Custom parsers | ✅ Standard tools (Grafana, etc.) |
Bottom line: OTel provides richer tracing and real-time capabilities, but for weekly analysis of single-agent runs, our current metrics capture the essential data. OTel's value proposition is stronger for complex multi-agent systems and real-time debugging.
Open-Source Analysis Tools¶
Prometheus¶
Overview: Time-series database designed for monitoring metrics. Pull-based architecture scrapes metrics from HTTP endpoints using PromQL query language.
Strengths:
- Flexible query language (PromQL) for data analysis
- Built-in alerting based on metric thresholds
- Efficient time-series storage with compression
- De facto standard for metrics in Kubernetes/cloud-native ecosystems
- Integrates seamlessly with Grafana for visualization
Fit for Our Use Case:
Moderate fit - Prometheus excels at real-time metrics scraping and alerting (e.g., "alert if cache hit rate < 80%"). For weekly analysis, we don't need pull-based scraping or real-time alerts. We could push metrics to Prometheus, but it adds infrastructure overhead for uncertain benefit.
Grafana¶
Overview: Open-source platform for visualizing time-series data from multiple sources (Prometheus, Loki, Jaeger, etc.). Provides pre-built dashboards, alerting, and correlation across metrics/logs/traces.
Strengths:
- Unified visualization: Single pane of glass for metrics, logs, traces
- Pre-built dashboards: Community templates for common use cases
- AI Observability: Native support for LLM monitoring with OpenLIT integration (grafana blog)
- Cost tracking: Built-in cost analysis for LLM usage
- Self-hosted option: Can run Grafana locally or in homelab
Fit for Our Use Case:
Good fit for long-term - If we set up a homelab OTel stack, Grafana provides polished dashboards for visualizing trends. Grafana's LLM observability features (token usage, cost tracking, performance monitoring) align with our metrics. However, setup requires Prometheus/Loki/Jaeger backends first, making it a Phase 3+ initiative.
Jaeger¶
Overview: Distributed tracing system for microservices. Visualizes traces, spans, and service dependencies to identify performance bottlenecks.
Strengths:
- Tailored for distributed tracing and understanding request flows
- Web UI for visualizing traces and drilling into latency sources
- Pluggable storage backends (Cassandra, Elasticsearch, memory)
- Root cause analysis by tracing execution across services
Fit for Our Use Case:
Low priority - Jaeger shines for debugging complex multi-service architectures. For single commit-agent runs, traces provide limited value over our current message-level transcript parsing. Becomes more relevant if we build multi-agent orchestration systems.
Other Tools¶
OpenLLMetry/Traceloop (traceloop):
- 2-line setup:
pip install traceloop-sdk+Traceloop.init(app_name="...") - LLM-specific instrumentation: Automatically captures prompts, responses, token usage
- Supports 20+ providers: OpenAI, Anthropic, Bedrock, Ollama, etc.
- Export destinations: Traceloop (SaaS), Dynatrace, SigNoz, OTel Collector
Fit: High potential for Phase 3 - OpenLLMetry provides turnkey LLM observability without custom instrumentation. Could run alongside our transcript parsing to compare data fidelity.
OpenLIT (openlit.io):
- One-line integration: Automatic tracing for LLM apps
- Pre-built Grafana dashboards: Import ready-made visualizations
- Cost optimization: Track spending across models
- Covers LLM stack: LLMs, vector DBs (Pinecone, Chroma), and GPUs
Fit: Similar to OpenLLMetry - turnkey solution for Phase 3 experimentation.
Homelab OTel Server Architecture¶
Infrastructure Requirements:
Typical self-hosted stack (based on heinrichhartmann.com and wildcat.io):
Services (Docker Compose):
- OpenTelemetry Collector (Central hub, accepts gRPC/HTTP on ports 4317/4318)
- Prometheus (Metrics storage, ~500MB RAM for small deployments)
- Loki (Log aggregation, ~300MB RAM)
- Jaeger (Trace storage, ~400MB RAM + database)
- Grafana (Visualization frontend, ~200MB RAM)
- Storage backend (optional): Cassandra, Elasticsearch, or ClickHouse for scale
Total Resources: ~1.5-2GB RAM, 10-20GB disk for 30 days retention
Setup Complexity:
- Initial setup: 4-8 hours to configure collector pipelines, storage backends, Grafana datasources
- Network configuration: Expose ports, configure exporters, set up authentication
- Data pipelines: Map OTel signals to backend storage formats
- Dashboard creation: Build custom visualizations for Claude Code metrics
Maintenance Overhead:
- Updates: Keep 5+ services updated (security patches, feature releases)
- Scaling: Monitor disk usage, tune retention policies, manage data growth
- Debugging: Troubleshoot collector pipelines, storage issues, missing data
- Backup: Ensure metrics/traces backed up if infrastructure fails
Cost-Benefit Analysis:
- Self-hosted cost: Electricity + hardware ~$5-10/month, plus ~5-10 hours/month maintenance (grafana blog)
- SaaS cost (Grafana Cloud): ~$50-100/month for moderate usage, zero maintenance
- Our use case: Weekly analysis doesn't justify real-time infrastructure overhead
Recommendation: Phase 3 experiment with lightweight setup (OTel Collector + Grafana + file-based storage) to evaluate value before committing to full stack.
Migration Path¶
Incremental Adoption:
Phase 1 (Completed): Hook-based transcript parsing → JSONL metrics ✅
Phase 2 (Recommended Next): JSONL → Parquet conversion, DuckDB analytics
Phase 3 (Experimental): Dual-track approach
- Keep transcript parsing → Parquet (primary)
- Add Claude Code OTel export → OTel Collector → Prometheus
- Run both systems in parallel for 2-4 weeks
- Compare data fidelity, insight quality, operational overhead
Phase 4 (Decision Point): Evaluate whether OTel provides sufficient incremental value:
- If yes → Migrate to OTel as primary, deprecate transcript parsing
- If no → Continue with Parquet/DuckDB, use OTel only for real-time debugging
Data Compatibility:
Current JSONL metrics are already structured JSON - can be converted to OTel format:
# Pseudo-code for OTel export from existing metrics
from opentelemetry import metrics
meter = metrics.get_meter("claude-code-metrics")
token_counter = meter.create_counter("tokens.total")
cache_gauge = meter.create_gauge("cache.hit_rate")
# Read JSONL metrics
for entry in read_jsonl_metrics():
token_counter.add(entry["tokens"]["total_tokens"])
cache_gauge.set(entry["tokens"]["cache_hit_rate"])
This allows backfilling historical data into OTel backends.
Effort Estimate:
- Parquet migration: 2-4 hours (write conversion script, test with historical data)
- DuckDB query library: 8-12 hours (build common queries, test with real metrics)
- EDA integration: 2-4 hours (experiment with Sweetviz/D-Tale, create weekly report script)
- OTel setup (experimental): 16-24 hours (homelab infrastructure, Claude Code integration, Grafana dashboards)
Total Phase 2 effort: ~12-20 hours over 2-3 weeks Total Phase 3 effort: ~20-30 hours over 4-6 weeks
Research Area 3: Existing Solutions¶
Claude Code Specific Tools¶
Findings: Claude Code is relatively new (launched 2024), and the community is still in early stages. No dedicated metrics/observability tools found.
What Exists:
- Official Documentation: Claude Code supports OpenTelemetry metrics and events (docs.claude.com)
- Analytics Admin API: Provides daily aggregated usage metrics (sessions, LOC added/removed, commits, PRs) (docs.claude.com)
- Grafana Integration: Anthropic integration for Grafana Cloud provides real-time cost/performance dashboards (grafana blog)
- Community Projects: Focus on plugins, slash commands, and IDE extensions, not metrics analysis (awesome-claude-code)
Gap Analysis:
- ✅ Infrastructure: OTel export supported by Claude Code
- ✅ High-level metrics: Admin API provides session/commit counts
- ❌ Agent-level analysis: No tools for analyzing individual agent performance
- ❌ Transcript mining: No libraries for extracting insights from agent transcripts
- ❌ Cost optimization: No tools for analyzing token efficiency or cache hit rates
Opportunity: We're pioneering agent-level metrics analysis. Our transcript parsing approach fills a gap that doesn't exist in the ecosystem yet.
LLM Agent Observability Libraries¶
Findings: Mature ecosystem exists for general LLM observability, but focused on production API usage rather than development workflows.
Key Libraries:
LangSmith (LangChain)¶
- Purpose: Observability for LangChain applications (langchain.com)
- Capabilities: Tracing, real-time monitoring, alerting, prompt evaluation
- Best Practices: Set up observability from the start, use metadata/tags for filtering (docs.smith.langchain.com)
- Limitations: Pre-built dashboards track success/error rates, not agent-specific workflows
Fit: Not directly applicable - LangSmith is for production LangChain apps, not Claude Code agents.
OpenLLMetry/Traceloop¶
- Purpose: OpenTelemetry extensions for LLM applications (traceloop)
- Capabilities: Automatic tracing of prompts, responses, token usage for 20+ providers
- 2-line setup: Minimal instrumentation overhead
- Export options: SaaS (Traceloop), self-hosted (OTel Collector, Dynatrace, SigNoz)
Fit: Could instrument Claude Code if it uses standard LLM APIs, but Claude Code's agent framework may not integrate cleanly. Worth experimenting in Phase 3.
OpenLIT¶
- Purpose: Monitoring framework for AI stack (LLMs, vector DBs, GPUs) (openlit.io)
- Capabilities: One-line integration, Grafana dashboards, cost tracking
- Pre-built dashboards: Ready-made visualizations for common metrics
Fit: Similar to OpenLLMetry - useful if we adopt OTel export from Claude Code.
Bottom Line: These libraries provide infrastructure for LLM observability but don't solve our specific problem (analyzing Claude Code agent efficiency). We'd still need custom analysis layer on top.
Pre-Built Dashboards¶
GitHub/GitLab Projects:
- OpenLIT Team's Grafana Dashboard: Pre-built dashboard for LLM performance (grafana blog)
- Awesome Claude lists: Curated collections of Claude tools, but focused on development, not analytics (awesome-claude-code)
Commercial Solutions:
- Grafana Cloud: Anthropic integration with pre-built dashboards for usage/cost (grafana.com)
- Datadog: Cloud Cost Management for Claude usage tracking (datadoghq.com)
- Traceloop: LLM observability platform (SaaS) (traceloop.com)
Our Position: Commercial dashboards focus on org-wide usage and costs, not individual developer optimization. We need more granular analysis (cache efficiency, pre-commit success rates, phase execution patterns) which these tools don't provide.
Comparative Analysis¶
Option 1: Custom Python Analysis (DuckDB + Parquet + EDA)¶
Description: Build lightweight Python analysis stack on top of current JSONL metrics. Convert to Parquet, use DuckDB for queries, integrate Sweetviz/D-Tale for automated reports.
Pros:
- ✅ Minimal infrastructure: No services to run, just Python scripts
- ✅ Fast implementation: 12-20 hours total effort, working solution in 2-3 weeks
- ✅ Leverages existing work: Uses current 60+ metrics from transcript parsing
- ✅ Low maintenance: File-based storage, no servers to update
- ✅ Performance gains: 5-10× faster queries with DuckDB vs current ad-hoc parsing
- ✅ Perfect fit for use case: Weekly analysis doesn't need real-time infrastructure
- ✅ Learning value: Hands-on experience with DuckDB, Parquet, modern data stack
- ✅ AI-assisted insights: Sweetviz generates comprehensive reports in 3 lines of code
Cons:
- ❌ No real-time dashboards: Can't see live metrics during agent runs
- ❌ Manual analysis workflow: Run scripts weekly, view HTML reports
- ❌ Limited visualization: HTML reports vs interactive Grafana dashboards
- ❌ Single-machine: Can't share dashboards with team (not relevant for solo dev)
- ❌ Custom solution: Not industry-standard observability stack
Effort Estimate: 12-20 hours
- Parquet conversion script: 2-4 hours
- DuckDB query library: 8-12 hours
- EDA integration: 2-4 hours
Value Delivered:
- Immediate: Weekly insights into token usage, cache efficiency, pre-commit patterns
- Short-term: Identify optimization opportunities (e.g., "cache hit rate drops when X")
- Learning: Modern data stack skills (DuckDB, Parquet) applicable to future projects
Option 2: OpenTelemetry + Grafana (Full Observability Stack)¶
Description: Set up homelab OTel Collector + Prometheus + Loki + Jaeger + Grafana. Export telemetry from Claude Code, build custom dashboards.
Pros:
- ✅ Industry standard: Uses widely-adopted observability tools
- ✅ Real-time monitoring: See metrics as agents run
- ✅ Rich visualizations: Professional dashboards with Grafana
- ✅ Distributed tracing: Track multi-agent workflows with Jaeger
- ✅ Alerting: Get notified when cache hit rate drops, errors spike
- ✅ Scalable: Infrastructure can grow with usage
- ✅ Pre-built integrations: OpenLIT/OpenLLMetry provide turnkey LLM dashboards
Cons:
- ❌ High infrastructure overhead: 5+ services to manage (Collector, Prometheus, Loki, Jaeger, Grafana)
- ❌ Significant setup time: 16-24 hours initial setup, plus learning curve
- ❌ Ongoing maintenance: ~5-10 hours/month updating, debugging, scaling
- ❌ Resource usage: ~1.5-2GB RAM, 10-20GB disk continuous usage
- ❌ Uncertain ROI: Weekly analysis doesn't require real-time infrastructure
- ❌ Complexity: Many moving parts, more surface area for failures
- ❌ Overlapping data: OTel metrics mostly duplicate what we extract from transcripts
Effort Estimate: 16-24 hours initial + 5-10 hours/month maintenance
- Homelab infrastructure: 8-12 hours
- Claude Code OTel integration: 4-6 hours
- Grafana dashboards: 4-6 hours
- Troubleshooting/tuning: Ongoing
Value Delivered:
- Immediate: Real-time dashboards (limited value for weekly analysis)
- Medium-term: Professional observability if building multi-agent systems
- Long-term: Reusable infrastructure for future LLM projects
Option 3: Hybrid Approach (Recommended)¶
Description: Start with lightweight Python analysis (Option 1), add OTel export in Phase 3 for experimentation without replacing existing metrics.
Phased Implementation:
Phase 2 (Weeks 1-3):
- Migrate JSONL → Parquet
- Build DuckDB query library
- Integrate Sweetviz for weekly reports
- Deliverable: Weekly metrics analysis workflow operational
Phase 3 (Months 2-4):
- Set up lightweight OTel stack (Collector + Grafana + file storage)
- Enable Claude Code OTel export
- Run dual-tracking (Parquet + OTel) for 4-6 weeks
- Deliverable: Side-by-side comparison of data fidelity and insights
Phase 4 (Month 5+):
- Evaluate which system provides better insights
- Decision Point: Keep both, consolidate to one, or use OTel only for real-time debugging
Pros:
- ✅ Incremental investment: Pay-as-you-go effort based on value
- ✅ Risk mitigation: Don't bet everything on unproven approach
- ✅ Learn by doing: Gain experience with both modern data analysis and observability
- ✅ Optionality: Keep best parts of each system
- ✅ No wasted work: Parquet analysis useful regardless of OTel decision
Cons:
- ❌ Dual maintenance: Running two systems during Phase 3
- ❌ Delayed full benefits: OTel advantages not realized until Phase 3
- ❌ Potential redundancy: May end up deprecating one system
Effort Estimate:
- Phase 2: 12-20 hours
- Phase 3: 20-30 hours
- Total: 32-50 hours over 4-6 months
Value Delivered:
- Week 3: Working metrics analysis with DuckDB + EDA
- Month 3: Comparative data on OTel vs custom analysis
- Month 5: Informed decision with real-world experience
Option 4: Leverage Existing Solutions (OpenLLMetry + Grafana Cloud)¶
Description: Use turnkey LLM observability tools (OpenLLMetry or OpenLIT) with SaaS backend (Traceloop, Grafana Cloud).
Pros:
- ✅ Fastest setup: 2-4 hours to get basic observability working
- ✅ Zero infrastructure: SaaS handles all backend complexity
- ✅ Pre-built dashboards: Professional visualizations out of the box
- ✅ Proven solution: Battle-tested by LLM companies
- ✅ Low maintenance: No self-hosted services to manage
Cons:
- ❌ Monthly cost: $50-100/month for Grafana Cloud or Traceloop
- ❌ Claude Code integration unknown: May not instrument cleanly
- ❌ Limited customization: Can't analyze pre-commit runs, phase execution, git operations
- ❌ Generic LLM metrics: Not tailored to Claude Code agent workflows
- ❌ Data upload: Sending transcript data to third-party services
Effort Estimate: 2-4 hours setup + $50-100/month
Value Delivered:
- Immediate: Standard LLM metrics (tokens, latency, costs)
- Limited: Missing agent-specific insights (cache efficiency, pre-commit success, phases)
Recommendations¶
Primary Recommendation: Option 3 (Hybrid Approach)
Start with lightweight Python analysis, keep path open for OpenTelemetry experimentation without committing infrastructure resources prematurely.
Short-term (Next 2-3 weeks) - Phase 2¶
Goal: Operational weekly metrics analysis workflow
Tasks:
- Parquet Migration (2-4 hours)
- Write conversion script: JSONL → Parquet
- Convert historical metrics (5 existing JSONL files)
- Update metrics extraction hook to write both formats during transition
-
Validation: Ensure no data loss, schema preservation
-
DuckDB Query Library (8-12 hours)
- Install DuckDB:
pip install duckdb - Create
.claude/lib/analyze_metrics.pywith common queries:- Weekly aggregations (avg tokens, cache hit rate, commits created)
- Trend detection (week-over-week changes)
- Outlier identification (high token usage, failed pre-commits)
- Time-series analysis (metrics over time)
- Document query patterns in docstrings
-
Test against historical data
-
EDA Integration (2-4 hours)
- Install Sweetviz:
pip install sweetviz - Create weekly report script:
.claude/scripts/generate_weekly_report.py - Automate comparison of current week vs previous week
- Generate HTML report with automated insights
- Test report quality with real metrics
Deliverable: Run python .claude/scripts/generate_weekly_report.py → Get comprehensive HTML report analyzing past week's Claude Code usage
Success Criteria:
- ✅ Can query metrics 5-10× faster than JSONL parsing
- ✅ Weekly report highlights key trends (token usage, cache efficiency, pre-commit success rates)
- ✅ Automated insights identify optimization opportunities (e.g., "cache hit rate dropped 15% this week")
Medium-term (Months 2-4) - Phase 3 (Optional)¶
Goal: Evaluate OpenTelemetry for incremental value beyond Parquet analysis
Tasks:
- Lightweight OTel Stack (8-12 hours)
- Set up Docker Compose with OTel Collector + Grafana
- Use file-based storage (avoid Prometheus/Loki/Jaeger initially)
- Minimal configuration to reduce complexity
-
Document setup for reproducibility
-
Claude Code Integration (4-6 hours)
- Enable OTel export in Claude Code (environment variables)
- Configure exporters to send to local collector
- Verify metrics flowing through pipeline
-
Troubleshoot any integration issues
-
Dual-Tracking Period (4-6 weeks runtime, minimal hands-on)
- Run both Parquet and OTel collection simultaneously
- Compare data fidelity (are metrics identical?)
- Evaluate insight quality (does Grafana surface patterns Sweetviz misses?)
-
Measure operational overhead (time spent maintaining OTel)
-
Comparative Analysis (4-6 hours)
- Document findings in learnings
- Decide: Keep OTel, deprecate it, or use hybrid
- Update planning document with decision rationale
Success Criteria:
- ✅ OTel captures same metrics as transcript parsing
- ✅ Grafana dashboards provide actionable insights
- ✅ Maintenance overhead is acceptable (< 2 hours/month)
Decision Point: If OTel doesn't provide 2× better insights for 3× more effort, stick with Parquet analysis.
Long-term (Months 5+) - Continuous Improvement¶
Based on Phase 3 Outcomes:
If OTel provides clear value:
- Migrate fully to OTel + Grafana
- Deprecate custom transcript parsing
- Invest in homelab infrastructure
If Parquet analysis is sufficient:
- Enhance DuckDB query library with advanced analytics
- Add more automated insights (anomaly detection, predictive trends)
- Build custom visualization layer (Plotly dashboards if needed)
Regardless of path:
- Publish learnings to docs/research/ai
- Share metrics extraction library as open-source
- Contribute Claude Code observability patterns to community
Quick Decision Matrix¶
| Factor | Weight | Option 1 (Custom) | Option 2 (OTel) | Option 3 (Hybrid) |
|---|---|---|---|---|
| Time to value | High | ⭐⭐⭐ 2-3 weeks | ⭐ 4-6 weeks | ⭐⭐⭐ 2-3 weeks |
| Effort required | High | ⭐⭐⭐ 12-20h | ⭐ 30-40h | ⭐⭐ 32-50h |
| Maintenance burden | Medium | ⭐⭐⭐ Minimal | ⭐ 5-10h/month | ⭐⭐ Low |
| Fits use case | High | ⭐⭐⭐ Perfect | ⭐⭐ Good | ⭐⭐⭐ Perfect |
| Learning value | Medium | ⭐⭐⭐ High | ⭐⭐ Medium | ⭐⭐⭐ Highest |
| Future-proofing | Low | ⭐⭐ Good | ⭐⭐⭐ Best | ⭐⭐⭐ Best |
| Real-time monitoring | Low | ❌ No | ⭐⭐⭐ Yes | ⭐⭐⭐ Phase 3 |
| Industry standard | Low | ❌ No | ⭐⭐⭐ Yes | ⭐⭐⭐ Both |
| Total Score | - | 18/24 | 14/24 | 20/24 |
Winner: Option 3 (Hybrid) - Best balance of fast value delivery, learning, and future optionality.
Resources and References¶
Modern Python Data Stack¶
DuckDB vs Polars vs Pandas Comparisons:
- DuckDB vs. Polars vs. Pandas: Benchmark & Comparison
- DuckDB vs Pandas vs Polars for Python Developers - MotherDuck
- Pandas vs. Polars vs. DuckDB vs. PySpark: Benchmarking with Real Experiments
- DuckDB vs Polars for Data Engineering
Parquet Format:
- Apache Parquet vs JSON Comparison
- Converting JSONL to Parquet: Technical Guide
- Data Lake - Comparing Performance of Big Data Formats
AI-Assisted EDA Libraries:
- Top 10 Python Libraries for Automated Data Analysis
- Top Python Libraries for Data Science and AI in 2025
- Top Automated EDA Python Packages for Efficient Data Analysis
OpenTelemetry & Observability¶
OpenTelemetry for LLM/AI Agents:
- Introduction to Observability for LLM-based Applications
- AI Agent Observability - Evolving Standards
- OpenLLMetry: Open-source Observability for GenAI/LLM Applications
- TruLens + OpenTelemetry for Agentic World
- OpenLIT - OpenTelemetry-native GenAI Observability
Observability Stack Comparisons:
- Jaeger vs Prometheus Comparison [2025]
- Jaeger vs Prometheus [2025]
- Top 10 Open Source Observability Tools in 2025
Self-Hosted OTel Setup:
- Home-Lab Observability with OpenTelemetry
- Self Hosted Telemetry Solution Based on Open Telemetry
- Homelab Monitoring GitHub - Monitoring tools and Open Telemetry
Cost Analysis:
- Why Companies Choose Grafana Cloud Over Self-Hosted OSS Stacks
- OpenTelemetry vs CloudWatch: Choosing What Fits Your Stack
Claude Code Specific¶
Official Documentation:
Grafana/Datadog Integrations:
- Monitor Claude Usage with Anthropic Integration for Grafana Cloud
- Complete Guide to LLM Observability with OpenTelemetry and Grafana Cloud
- Monitor Claude Usage with Datadog Cloud Cost Management
Community Projects:
LLM Agent Observability¶
LangSmith/LangChain:
- LangSmith - Observability
- Add Observability to LLM Application
- LangChain Observability: Zero to Production in 10 Minutes
Traceloop/OpenLLMetry:
- DIY Observability for LLMs with OpenTelemetry
- What is OpenLLMetry?
- OpenLLMetry: Mastering Observability in LLM Applications
Migration & Best Practices¶
JSONL to Parquet:
OpenTelemetry Migration:
- Getting Started with OpenTelemetry Custom Metrics
- Migrating JVM Application from Prometheus to OpenTelemetry
Decision Log¶
Date: 2025-12-05
Decision: Proceed with Option 3 (Hybrid Approach) - Start with DuckDB + Parquet + EDA, evaluate OpenTelemetry in Phase 3
Rationale:
- Matches use case: Weekly analysis doesn't require real-time infrastructure overhead
- Fast time-to-value: Working solution in 2-3 weeks vs 4-6 weeks for OTel
- Low risk: Can always add OTel later without wasting current work
- Learning maximization: Gain experience with modern data stack AND observability tools
- Resource efficiency: ~15 hours effort vs ~40 hours for full OTel stack
- Current metrics are sufficient: 60+ fields from transcript parsing cover essential insights
- Community gap: No existing Claude Code metrics tools, so custom solution has value
- Experimentation over commitment: Phase 3 evaluation prevents premature infrastructure investment
Next Steps (Phase 2 - Next 2-3 weeks):
- ✅ Create Parquet conversion script (
.claude/scripts/convert_metrics_to_parquet.py) - ✅ Convert historical JSONL metrics to Parquet format
- ✅ Build DuckDB query library (
.claude/lib/analyze_metrics.py) with common analytics - ✅ Integrate Sweetviz for automated weekly reports
- ✅ Test report generation with real metrics data
- ✅ Document usage in
.claude/README.md - ✅ Update planning document with Phase 2 completion notes
- ⏸️ Evaluate Phase 3 (OTel) after 4-6 weeks of Parquet analysis experience
Review Date: 2025-02-01 (after 8 weeks of Phase 2 usage) - Decide whether to proceed with Phase 3 OTel experimentation