User:Schiste/what-now/Wiki Economics

Wiki Economics

Wiki Economics is a project that applies economic indicators to Wikimedia activity data. Every edit is a unit of production, every editor is a worker, every namespace is a sector of the economy. By borrowing well-understood frameworks from economics (GDP, labor statistics, inequality metrics, quality control) the project offers a complementary lens on the health, resilience, and dynamics of wiki communities.

The goal is not to reduce editors to numbers, but to surface structural patterns that are difficult to see in raw activity logs: Is output concentrated in too few hands? Are newcomers being retained? Is quality control keeping pace with content production? These are questions every wiki community asks — economics provides a shared vocabulary to answer them.

Scope

The pipeline is designed to process any Wikimedia wiki. Once the metrics are properly validated, the intention is to cover all projects where the data is available through Wikimedia dumps.

Indicators

The project computes four families of indicators, each inspired by a branch of economics.

Edit Distribution

How evenly are edits distributed among editors? High concentration may signal efficiency (experienced editors are productive) or fragility (the community depends on a few individuals). Four complementary measures capture different facets of this question:

Indicator	Description
Gini coefficient	Measures overall edit inequality on a 0-1 scale. 0 = every editor contributes equally; 1 = one editor does everything. Wikimedua communities typically show high Gini values (0.8-0.95), reflecting the well-known pattern where a small core produces most content.
Theil index	An entropy-based inequality measure. Unlike Gini, Theil is decomposable: it can separate within-group and between-group inequality (e.g., across namespaces or user types).
Palma ratio	The ratio of edits by the top 10% of editors to edits by the bottom 40%. Focuses on the extremes of the distribution rather than the middle, making it sensitive to changes in the most and least active segments.
Fragility (bus factor)	The minimum number of top editors whose combined output accounts for 50% of all edits. A low number means the wiki's output depends on very few people — a bus factor risk. Tracked both as an absolute count and as a percentage of the editor base.

Community

Wikimedia's workforce can be studied like a labor market: people arrive, contribute for a time, and eventually leave. Understanding these flows helps answer whether a community is growing, shrinking, or churning.

Indicator	Description
Active editors	Count of unique contributors who made at least one edit in a given period. The most basic measure of community size.
Arrivals	Editors whose first recorded edit falls within the period. Measures the inflow of new contributors.
Departures	Editors who were active in the previous period but not in the current one. Measures attrition.
Arrival & departure rates	Arrivals (or departures) divided by active editors. Normalizes flows relative to community size, making wikis of different sizes comparable.
Cohort retention	For each yearly cohort (editors who made their first edit in year Y), tracks what fraction remain active in subsequent years. Reveals whether a wiki retains its newcomers or loses them quickly.
User type breakdown	Splits the workforce into registered editors, anonymous (IP) editors, temporary accounts, and bots. Each group has different behaviors, motivations, and policy implications.

Content Production

In national economics, GDP measures the total value of goods and services produced. On Wikimedia, output is the content produced by editors, measured in bytes and edits.

Indicator	Description
Gross output	Total bytes added across all edits. Counts every byte contributed, regardless of whether it survives.
Net output	Bytes added minus bytes removed. What actually remains in the encyclopedia — the closest analogue to GDP.
Content churn	The gap between gross and net output. High churn means a large share of work is undone by reverts, deletions, or rework.
Revert rate	Fraction of edits that are identity-reverted. A proxy for unproductive or contested labor.
Productivity	Net bytes per edit. Measures how much lasting content each edit produces on average.
Productivity per capita	Net bytes per active editor. The wiki equivalent of GDP per capita.
Activity tiers	Editors grouped by monthly edit count (1, 2-4, 5-24, 25-99, 100+). Shows how output and editor counts distribute across levels of engagement.
Sectoral output	Breakdown by namespace (article, talk, user, project, template, etc.). Each namespace is an economic sector with different dynamics.
User type share	Percentage of total edits contributed by registered editors, anonymous editors, temporary accounts, and bots. Tracks how the composition of the workforce evolves.

Patrol

Patrolling is Wikimedia quality-control mechanism: experienced editors review new edits and pages to catch vandalism, errors, and policy violations. These indicators measure whether the community's "immune system" is keeping up.

Indicator	Description
Patrol volume	Total number of patrol actions per period, split between new-page patrols and diff (edit) patrols.
Unique patrollers	Count of distinct editors who performed at least one patrol action. Measures the size of the quality-control workforce.
Patrol coverage	Percentage of revisions that were manually patrolled. Low coverage means edits go unreviewed.
Adjusted coverage	Includes revisions that were autopatrolled (automatically marked as reviewed for trusted editors). Gives a fuller picture of how much content is effectively reviewed.
Patrol latency	Median and 90th-percentile time (in hours) between an edit being created and being patrolled. Measures how quickly the community responds.
Patroller concentration	The minimum number of patrollers needed to account for 50% of all patrol actions — a fragility measure for the quality-control function, analogous to the edit distribution bus factor.

All patrol indicators can be filtered by the author type of the patrolled edit (registered, anonymous, temporary, bot). This enables analysis such as: "Do anonymous edits get patrolled faster than registered ones?"

Data sources

The project consumes publicly available Wikimedia data:

MediaWiki History dumps — the primary data source. These TSV files contain one row per revision event with 76 columns covering editor metadata, page state, and revision details. The project filters to revision-creation events and retains 10 analytical columns.
MediaWiki logging dumps — XML dumps of the logging table, used specifically for patrol events (log_type=patrol) and user-rights changes (to determine autopatrol membership).

No private data, CheckUser information, or non-public APIs are used. All inputs are available to anyone from dumps.wikimedia.org.

Technical overview

The pipeline has four stages:

Fetch — downloads dumps from Wikimedia, streaming to disk without buffering entire files in memory. Supports resume for interrupted downloads.
Ingest — converts raw TSV dumps into Parquet files, filtering to revision-creation events and normalizing columns. Produces a compact analytical layer (~10 columns) suitable for fast aggregation.
Compute — reads Parquet partitions one month at a time and produces per-wiki metric files. Most metrics are computed independently per month; only cohort tracking and churn rates require cross-month state.
Merge — concatenates per-wiki outputs into combined Parquet files for cross-wiki analysis.

The compute engine is written in Rust using Polars for dataframe operations. Patrol metrics use a Python pipeline. The interactive dashboard is built with Observable Framework and uses DuckDB compiled to WebAssembly for client-side queries directly on Parquet files.

Filtering dimensions

All indicators support consistent filtering by:

Wiki — any processed Wikimedia project
User type — registered, anonymous, temporary, bot (classification follows MediaWiki's own flags)
Namespace — any MediaWiki namespace present in the data
Time range — arbitrary start/end month (YYYY-MM)
Granularity — month, quarter, or year aggregation

Why "economics"?

The economic metaphor is not decorative. Each indicator family maps to a real branch of economics:

Edit Distribution ← inequality economics (Gini, Theil, Palma are standard measures used by the World Bank and OECD)
Community ← labor economics (arrival/departure rates, cohort analysis, workforce composition)
Content Production ← national accounting (gross vs. net output, productivity, sectoral breakdown)
Patrol ← regulatory economics / quality inspection (coverage rates, response times, inspector concentration)

This framing provides two advantages: (1) the metrics are well-defined and peer-reviewed in the economics literature, avoiding ad-hoc definitions; (2) they are immediately legible to anyone familiar with economic reporting, which includes most policy-makers and institutional stakeholders.

Current status

This project is in early development. The pipeline runs, the dashboard is functional, and the core indicators are computed. However:

The metric definitions would benefit from community review — are they measuring the right things? Are there edge cases specific to wiki communities that the standard economic definitions miss?
The indicator set is not final. Additional metrics may be added, and existing ones may be refined based on feedback.
Cross-wiki validation is ongoing. The pipeline supports all Wikimedia wikis, but careful attention is needed to ensure metrics behave sensibly across communities of very different sizes and cultures.

Feedback

This project is explicitly seeking feedback from the Wikimedia community. Areas where input is especially valuable:

Metric relevance — Which indicators are most useful for your community? Which are missing?
Metric definitions — Do the economic analogies hold? Where do wiki-specific dynamics break the metaphor?
Interpretation — What contextual knowledge would help interpret the numbers? (e.g., known bot campaigns, policy changes, content drives that create spikes)
Naming — Are the indicator names clear and appropriate for a volunteer community?

Please leave feedback on the talk page.

Stack & Data Sources

Data sources

All data comes from publicly available Wikimedia dumps. No private APIs, CheckUser data, or non-public datasets are used.

MediaWiki History dumps

The primary data source. These are tab-separated files published by the Wikimedia Foundation at [dumps.wikimedia.org/other/mediawiki_history](https://dumps.wikimedia.org/other/mediawiki_history/). Each row represents a revision event and contains 76 columns covering:

Event metadata*: timestamp, type (create/delete/restore), entity (revision/page/user)
Editor state*: user ID, registration date, edit count at event time, bot flag, anonymous flag, temporary account flag, user groups
Page state*: page ID, title, namespace, creation timestamp, whether the page is a redirect
Revision details*: byte length before/after, SHA1, minor edit flag, deleted/suppressed flags, revert information

The project filters these to **revision-creation events only** and retains 10 analytical columns: timestamp, user ID, user text, page namespace, byte diff, minor flag, bot flag, anonymous flag, temporary flag, and revert indicator.

Dumps are partitioned yearly for most wikis and monthly for the largest projects (English Wikipedia, Wikidata, Commons).

MediaWiki logging dumps

XML dumps of the `logging` table, fetched from `dumps.wikimedia.org/<wiki>/latest/<wiki>-latest-pages-logging.xml.gz`. Used specifically for:

Patrol events* (`log_type=patrol`): records of editors reviewing new pages and edits
User rights changes* (`log_type=rights`): used to reconstruct which editors held autopatrol permissions at any given time

The XML is streamed and parsed on-the-fly without loading the full file into memory.

MediaWiki API

A single lightweight query to the [MediaWiki siteinfo API](https://www.mediawiki.org/wiki/API:Siteinfo) fetches which user groups grant the `autopatrol` right (typically sysop and bot). This is combined with the rights-change log to build per-editor intervals of autopatrol membership.

Stack

Rust for compute engine

The core pipeline is a Rust CLI (`wiki-econ`) that handles fetching, ingesting, computing, and merging. Key dependencies:

Rust Crates and Roles
Crate	Role
Polars 0.53	Dataframe operations — lazy evaluation, CSV/Parquet I/O, aggregations, joins
Rayon	Parallel iteration for multi-wiki processing
Reqwest	HTTP client for downloading dumps, with retry and resume support
bzip2	Streaming decompression of `.tsv.bz2` dump files
Clap	CLI argument parsing (subcommands: `fetch`, `ingest`, `compute`, `merge`, `run`, `bench`)
Tracing	Structured logging with stable fields (wiki, metric, rows, bytes, elapsed_ms)
Anyhow	Error handling

The pipeline processes data in four stages:

Fetch: streams dumps from Wikimedia to disk, supports resume on range-capable servers, bounded to 4 concurrent downloads
Ingest: decompresses bz2 into 32 MB in-memory chunks, parses CSV with Polars, writes Parquet partitions directly (no intermediate TSV on disk). Produces two layers: a wider warehouse layer and a slim analytical layer
Compute: reads one monthly Parquet partition at a time, computes metrics per month. Only cohort tracking, churn rates, and funnel state are carried across months. Outputs per-wiki Parquet files
Merge: concatenates per-wiki metric files into combined cross-wiki Parquet files

Python for the patrol pipeline

Two scripts handle patrol-specific data that comes from logging dumps rather than revision history:

`scripts/fetch_patrol.py` — downloads and parses XML logging dumps, extracts patrol events and user rights changes
`scripts/compute_patrol.py` — joins patrol logs with revision data to compute latency, coverage, and concentration metrics. Classifies each patrolled revision by author type (registered/anonymous/temporary/bot) and namespace

Observable Framework for dashboard dashboarding

The interactive dashboard is built with [Observable Framework](https://observablehq.com/framework/) (v1.13). Each page is a Markdown file with embedded JavaScript that renders charts using [Observable Plot](https://observablehq.com/plot/).

DuckDB as query layer

DuckDB serves two roles:

Build-time (shell scripts): the `.json.sh` data loaders use the DuckDB CLI to aggregate Parquet files into pre-computed JSON defaults
Client-side (browser): DuckDB-WASM runs SQL queries on Parquet files when users apply non-default filters, enabling interactive exploration without a backend server

Storage layout

data/
  raw/<wiki>/              ← downloaded .tsv.bz2 dumps
  warehouse/<wiki>/        ← wide normalized Parquet (for future metrics)
    year=YYYY/
      year_month=YYYY-MM/
  parquet/<wiki>/           ← slim analytical Parquet (compute input)
    year=YYYY/
      year_month=YYYY-MM/
    _markers/               ← ingest completion markers
output/
  <wiki>/                   ← per-wiki metric Parquet files
 *.parquet                 ← merged cross-wiki files
site/
  src/
    *.md                    ← Observable pages
    components/             ← shared JS (filters, charts)
    data/
      *.parquet             ← symlinked or copied from output/
      defaults_*.json.sh    ← build-time data loaders

Wiki Economics

Scope

Indicators

Edit Distribution

Community

Content Production

Patrol

Data sources

Technical overview

Filtering dimensions

Why "economics"?

Current status

Feedback

Stack & Data Sources

Data sources

MediaWiki History dumps

MediaWiki logging dumps

MediaWiki API

Stack

Rust for compute engine

Python for the patrol pipeline

Observable Framework for dashboard dashboarding

DuckDB as query layer

Storage layout

See also