I released wp2txt version 2.1. wp2txt is a command-line toolkit for extracting text from Wikipedia dump files. I first wrote it in 2012, and have been maintaining it since.
The main competitor in this space is WikiExtractor, a Python tool with a much larger user base. With this release, I aimed to match or exceed WikiExtractor's processing speed while offering capabilities it does not have. Here are the changes and new features in this version.
- SQLite-based caching for parsed data, category hierarchies, and multistream indexes. What used to take minutes to parse on every run now loads in seconds.
- Ractor parallel processing on Ruby 4.0+, achieving roughly 2x speedup with a lower memory footprint than process-based parallelism.
- Template expansion: common Wikipedia templates like dates, unit conversions, and coordinates are now resolved into readable text.
- Category-based extraction: pull all articles belonging to a Wikipedia category, with configurable subcategory depth.
- Incremental downloads: large dump files can be resumed from where they left off.
If you are interested in working with Wikipedia text data for research or experiments, try gem install wp2txt.