2fb1568d97500ffc11ff1496e5829a584752d7ba
Updated project to use Bun runtime, including changes to build and development scripts, plugin additions, and configuration updates in tsconfig.json and package.json. Enhanced README documentation and migrated source files to TypeScript.
code-tokenizer-md
Created to push creative limits.
Process git repository files into markdown with token counting and sensitive data redaction.
Overview
code-tokenizer-md
is a TypeScript/Bun tool that processes git repository files, cleans code, redacts sensitive information, and generates markdown documentation with token counts.
graph TD
Start[Start] -->|Read| Git[Git Files]
Git -->|Clean| TC[TokenCleaner]
TC -->|Redact| Clean[Clean Code]
Clean -->|Generate| MD[Markdown]
MD -->|Count| Results[Token Counts]
style Start fill:#000000,stroke:#FFFFFF,stroke-width:4px,color:#ffffff
style Git fill:#222222,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style TC fill:#333333,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Clean fill:#444444,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style MD fill:#555555,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Results fill:#666666,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
Features
Data Processing
- Reads files from git repository
- Removes comments and unnecessary whitespace
- Redacts sensitive information (API keys, tokens, etc.)
- Counts tokens using llama3-tokenizer
Analysis Types
- Token counting per file
- Total token usage
- File content analysis
- Sensitive data detection
Data Presentation
- Markdown formatted output
- Code block formatting
- Token count summaries
- File organization hierarchy
Requirements
- Node.js (>=14.0.0)
- Bun runtime
- Git repository
Installation
Usage
CLI
npx code-tokenizer-md
Library
npm install code-tokenizer-md
Programmatic Usage
import { MarkdownGenerator } from 'code-tokenizer-md';
const generator = new MarkdownGenerator({
dir: './project',
outputFilePath: './output.md',
});
const result = await generator.createMarkdownDocument();
Project Structure
src/
├── index.ts # Main exports
├── TokenCleaner.ts # Code cleaning and redaction
├── MarkdownGenerator.ts # Markdown generation logic
├── cli.ts # CLI implementation
├── fileExclusions.ts # File exclusion patterns
└── fileTypeExclusions.ts # File type exclusions
Dependencies
{
"dependencies": {
"llama3-tokenizer-js": "^1.0.0",
"micromatch": "^4.0.8"
},
"peerDependencies": {
"node": ">=14.0.0"
},
"devDependencies": {
"@eslint/js": "^9.14.0",
"eslint": "^9.14.0",
"globals": "^15.12.0",
"prettier": "^3.3.3",
"bun": "latest",
"@types/bun": "latest",
"@types/node": "^22.9.1",
"@types/micromatch": "^4.0.9"
}
}
Development
This project uses bun for it's toolchain. You should be able to use whatever you want as a consumer of the library.
Building
npm run build
Testing
npm test
Linting and Formatting
# Lint
npm run lint
# Fix linting issues
npm run lint:fix
# Format code
npm run format
# Fix all (format + lint)
npm run fix
Extending
Adding Custom Patterns
const generator = new MarkdownGenerator({
customPatterns: [{ regex: /TODO:/g, replacement: '' }],
customSecretPatterns: [{ regex: /mySecret/g, replacement: '[REDACTED]' }],
});
Configuration Options
interface MarkdownGeneratorOptions {
dir?: string; // Project directory
outputFilePath?: string; // Output markdown file path
fileTypeExclusions?: Set<string>; // File types to exclude
fileExclusions?: string[]; // File patterns to exclude
customPatterns?: Record<string, any>; // Custom cleaning patterns
customSecretPatterns?: Record<string, any>; // Custom redaction patterns
verbose?: boolean; // Enable verbose logging
}
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Contribution Guidelines
- Write TypeScript code following the project's style
- Include appropriate error handling
- Add documentation for new features
- Include tests for new functionality
- Update the README for significant changes
License
MIT © 2024 Geoff Seemueller
Note
This tool requires a git repository to function properly.
Languages
TypeScript
90.5%
JavaScript
8%
CSS
1.5%