This commit is contained in:
2024-10-30 11:59:30 -04:00
commit 17031d8be8
8 changed files with 342 additions and 0 deletions

136
README.md Normal file
View File

@@ -0,0 +1,136 @@
# code-tokenizer-md
Process git repository files into markdown with token counting and sensitive data redaction.
## Overview
`code-tokenizer-md` is a Node.js tool that processes git repository files, cleans code, redacts sensitive information, and generates markdown documentation with token counts.
```mermaid
graph TD
Start[Start] -->|Read| Git[Git Files]
Git -->|Clean| TC[TokenCleaner]
TC -->|Redact| Clean[Clean Code]
Clean -->|Generate| MD[Markdown]
MD -->|Count| Results[Token Counts]
style Start fill:#000000,stroke:#FFFFFF,stroke-width:4px,color:#ffffff
style Git fill:#222222,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style TC fill:#333333,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Clean fill:#444444,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style MD fill:#555555,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
style Results fill:#666666,stroke:#FFFFFF,stroke-width:2px,color:#ffffff
```
## Features
### Data Processing
- Reads files from git repository
- Removes comments and unnecessary whitespace
- Redacts sensitive information (API keys, tokens, etc.)
- Counts tokens using llama3-tokenizer
### Analysis Types
- Token counting per file
- Total token usage
- File content analysis
- Sensitive data detection
### Data Presentation
- Markdown formatted output
- Code block formatting
- Token count summaries
- File organization hierarchy
## Requirements
- Node.js (>=14.0.0)
- Git repository
- npm or npx
## Installation
```shell
npm install -g code-tokenizer-md
```
## Usage
### Quick Start
```shell
npx code-tokenizer-md
```
### Programmatic Usage
```javascript
import { MarkdownGenerator } from 'code-tokenizer-md';
const generator = new MarkdownGenerator({
dir: './project',
outputFilePath: './output.md'
});
const result = await generator.createMarkdownDocument();
```
## Project Structure
```
src/
├── index.js # Main exports
├── TokenCleaner.js # Code cleaning and redaction
├── MarkdownGenerator.js # Markdown generation logic
└── cli.js # CLI implementation
```
## Dependencies
```json
{
"dependencies": {
"llama3-tokenizer-js": "^1.0.0"
},
"peerDependencies": {
"node": ">=14.0.0"
}
}
```
## Extending
### Adding Custom Patterns
```javascript
const generator = new MarkdownGenerator({
customPatterns: [
{ regex: /TODO:/g, replacement: '' }
],
customSecretPatterns: [
{ regex: /mySecret/g, replacement: '[REDACTED]' }
]
});
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
### Contribution Guidelines
- Follow Node.js best practices
- Include appropriate error handling
- Add documentation for new features
- Include tests for new functionality (this project needs a suite)
- Update the README for significant changes
## License
MIT © 2024 Geoff Seemueller
## Note
This tool requires a git repository to function properly.