McpExtract: Static Analysis for MCP Training Data

Training small models on MCP tool capabilities shouldn’t require running potentially untrusted code.

The Problem

When building training datasets for models that need to understand MCP (Model Context Protocol) capabilities, the traditional approach has serious drawbacks:

Security Risks: You need to run each MCP server to extract its schema
Dependency Hell: Every server has its own runtime dependencies
Environmental Complexity: Different servers require different execution contexts
Scale Issues: Extracting from hundreds of servers becomes a logistical nightmare

The usual workflow looks like this:

Run MCP Server → Extract schemas at runtime → Deal with security risks & dependencies

This works for one or two servers, but it doesn’t scale. And it definitely doesn’t work when you’re dealing with code you don’t fully trust.

A Better Approach

What if we could extract MCP metadata without executing any code?

That’s exactly what McpExtract does. Instead of runtime extraction, it uses static analysis of .NET assemblies:

Analyze Assembly → Extract Metadata → Clean Training Data

Key Benefits

No Code Execution Required
The biggest win: you never have to run the MCP server. This eliminates entire classes of security concerns and makes the extraction process deterministic and safe.

Clean Metadata Extraction
Static analysis gives you structured, reliable metadata without the noise of runtime variability. The tool understands .NET’s type system and can extract precise information about:

Tool definitions and signatures
Parameter types and constraints
Return types
Documentation and descriptions

Dependency-Free Analysis
Don’t need to install dependencies, set up environments, or manage runtime configurations. Just point at the assembly and extract.

Scale Without Pain
Want to analyze 100 MCP servers? No problem. No need to orchestrate 100 different execution environments—just run the analysis tool.

Technical Deep Dive

McpExtract leverages .NET’s reflection and metadata APIs to inspect compiled assemblies. It understands MCP-specific attributes and conventions, mapping them to clean training data formats.

The tool specifically looks for:

Methods decorated with MCP tool attributes
Parameter metadata including types, defaults, and descriptions
Return type information
Documentation XML comments when available

This approach works because .NET assemblies contain rich metadata that’s perfect for static analysis. The compiled code preserves all the type information and attributes we need.

Real-World Usage

Here’s how I’m using it:

Training Data Collection: Build comprehensive datasets of MCP capabilities for fine-tuning smaller models
Discovery: Quickly understand what tools are available across multiple MCP servers
Validation: Verify that tool definitions match expected patterns without execution
Documentation: Auto-generate tool catalogs from assembly metadata

Limitations and Trade-offs

Static analysis has boundaries:

Runtime Behavior: Can’t determine dynamic behavior or side effects
.NET Only: Currently limited to .NET MCP implementations
Metadata Dependency: Relies on developers following MCP conventions and including good documentation

For training data extraction, these trade-offs are worth it. We care more about tool signatures and interfaces than runtime behavior.

The Broader Pattern

This tool represents a broader shift in how we approach AI tooling: prefer static analysis over dynamic execution when possible.

As we build more AI-integrated systems, we’ll increasingly need to extract and understand code capabilities programmatically. Static analysis gives us a safer, more scalable way to do that.

Get Started

McpExtract is open source and available on GitHub. If you’re working with MCP in .NET or building training datasets for tool-aware models, give it a try.

I’m particularly interested in hearing from folks who are:

Building MCP servers in other languages (could we extend this approach?)
Creating training datasets for smaller models
Working on tool discovery and cataloging problems

Looking Forward

The Model Context Protocol is still young, but it’s becoming a de facto standard for agent-tool communication. As the ecosystem grows, we’ll need better tooling for discovery, analysis, and understanding.

Static analysis is one piece of that puzzle. What other approaches should we be exploring?

McpExtract grew out of a practical need while building training datasets at scale. If you’re facing similar challenges, I’d love to hear your approach.

Share on

X Facebook LinkedIn Bluesky