Training small models on MCP tool capabilities shouldn’t require running potentially untrusted code.
The Problem
When building training datasets for models that need to understand MCP (Model Context Protocol) capabilities, the traditional approach has serious drawbacks:
- Security Risks: You need to run each MCP server to extract its schema
- Dependency Hell: Every server has its own runtime dependencies
- Environmental Complexity: Different servers require different execution contexts
- Scale Issues: Extracting from hundreds of servers becomes a logistical nightmare
The usual workflow looks like this:
Run MCP Server → Extract schemas at runtime → Deal with security risks & dependencies
This works for one or two servers, but it doesn’t scale. And it definitely doesn’t work when you’re dealing with code you don’t fully trust.
A Better Approach
What if we could extract MCP metadata without executing any code?
That’s exactly what McpExtract does. Instead of runtime extraction, it uses static analysis of .NET assemblies:
Analyze Assembly → Extract Metadata → Clean Training Data
Key Benefits
No Code Execution Required
The biggest win: you never have to run the MCP server. This eliminates entire classes of security concerns and makes the extraction process deterministic and safe.
Clean Metadata Extraction
Static analysis gives you structured, reliable metadata without the noise of runtime variability. The tool understands .NET’s type system and can extract precise information about:
- Tool definitions and signatures
- Parameter types and constraints
- Return types
- Documentation and descriptions
Dependency-Free Analysis
Don’t need to install dependencies, set up environments, or manage runtime configurations. Just point at the assembly and extract.
Scale Without Pain
Want to analyze 100 MCP servers? No problem. No need to orchestrate 100 different execution environments—just run the analysis tool.
Technical Deep Dive
McpExtract leverages .NET’s reflection and metadata APIs to inspect compiled assemblies. It understands MCP-specific attributes and conventions, mapping them to clean training data formats.
The tool specifically looks for:
- Methods decorated with MCP tool attributes
- Parameter metadata including types, defaults, and descriptions
- Return type information
- Documentation XML comments when available
This approach works because .NET assemblies contain rich metadata that’s perfect for static analysis. The compiled code preserves all the type information and attributes we need.
Real-World Usage
Here’s how I’m using it:
- Training Data Collection: Build comprehensive datasets of MCP capabilities for fine-tuning smaller models
- Discovery: Quickly understand what tools are available across multiple MCP servers
- Validation: Verify that tool definitions match expected patterns without execution
- Documentation: Auto-generate tool catalogs from assembly metadata
Limitations and Trade-offs
Static analysis has boundaries:
- Runtime Behavior: Can’t determine dynamic behavior or side effects
- .NET Only: Currently limited to .NET MCP implementations
- Metadata Dependency: Relies on developers following MCP conventions and including good documentation
For training data extraction, these trade-offs are worth it. We care more about tool signatures and interfaces than runtime behavior.
The Broader Pattern
This tool represents a broader shift in how we approach AI tooling: prefer static analysis over dynamic execution when possible.
As we build more AI-integrated systems, we’ll increasingly need to extract and understand code capabilities programmatically. Static analysis gives us a safer, more scalable way to do that.
Get Started
McpExtract is open source and available on GitHub. If you’re working with MCP in .NET or building training datasets for tool-aware models, give it a try.
I’m particularly interested in hearing from folks who are:
- Building MCP servers in other languages (could we extend this approach?)
- Creating training datasets for smaller models
- Working on tool discovery and cataloging problems
Looking Forward
The Model Context Protocol is still young, but it’s becoming a de facto standard for agent-tool communication. As the ecosystem grows, we’ll need better tooling for discovery, analysis, and understanding.
Static analysis is one piece of that puzzle. What other approaches should we be exploring?
McpExtract grew out of a practical need while building training datasets at scale. If you’re facing similar challenges, I’d love to hear your approach.