Jump to content

User:Schiste/what-now/Wiki Economics

From Meta, a Wikimedia project coordination wiki

Wiki Economics

[edit]

Wiki Economics is a project that applies economic indicators to Wikimedia activity data. Every edit is a unit of production, every editor is a worker, every namespace is a sector of the economy. By borrowing well-understood frameworks from economics (GDP, labor statistics, inequality metrics, quality control) the project offers a complementary lens on the health, resilience, and dynamics of wiki communities.

The goal is not to reduce editors to numbers, but to surface structural patterns that are difficult to see in raw activity logs: Is output concentrated in too few hands? Are newcomers being retained? Is quality control keeping pace with content production? These are questions every wiki community asks — economics provides a shared vocabulary to answer them.

Scope

[edit]

The pipeline is designed to process any Wikimedia wiki. Once the metrics are properly validated, the intention is to cover all projects where the data is available through Wikimedia dumps.

Indicators

[edit]

The project computes four families of indicators, each inspired by a branch of economics.

Edit Distribution

[edit]

How evenly are edits distributed among editors? High concentration may signal efficiency (experienced editors are productive) or fragility (the community depends on a few individuals). Four complementary measures capture different facets of this question:

Indicator Description
Gini coefficient Measures overall edit inequality on a 0-1 scale. 0 = every editor contributes equally; 1 = one editor does everything. Wikimedua communities typically show high Gini values (0.8-0.95), reflecting the well-known pattern where a small core produces most content.
Theil index An entropy-based inequality measure. Unlike Gini, Theil is decomposable: it can separate within-group and between-group inequality (e.g., across namespaces or user types).
Palma ratio The ratio of edits by the top 10% of editors to edits by the bottom 40%. Focuses on the extremes of the distribution rather than the middle, making it sensitive to changes in the most and least active segments.
Fragility (bus factor) The minimum number of top editors whose combined output accounts for 50% of all edits. A low number means the wiki's output depends on very few people — a bus factor risk. Tracked both as an absolute count and as a percentage of the editor base.

Community

[edit]

Wikimedia's workforce can be studied like a labor market: people arrive, contribute for a time, and eventually leave. Understanding these flows helps answer whether a community is growing, shrinking, or churning.

Indicator Description
Active editors Count of unique contributors who made at least one edit in a given period. The most basic measure of community size.
Arrivals Editors whose first recorded edit falls within the period. Measures the inflow of new contributors.
Departures Editors who were active in the previous period but not in the current one. Measures attrition.
Arrival & departure rates Arrivals (or departures) divided by active editors. Normalizes flows relative to community size, making wikis of different sizes comparable.
Cohort retention For each yearly cohort (editors who made their first edit in year Y), tracks what fraction remain active in subsequent years. Reveals whether a wiki retains its newcomers or loses them quickly.
User type breakdown Splits the workforce into registered editors, anonymous (IP) editors, temporary accounts, and bots. Each group has different behaviors, motivations, and policy implications.

Content Production

[edit]

In national economics, GDP measures the total value of goods and services produced. On Wikimedia, output is the content produced by editors, measured in bytes and edits.

Indicator Description
Gross output Total bytes added across all edits. Counts every byte contributed, regardless of whether it survives.
Net output Bytes added minus bytes removed. What actually remains in the encyclopedia — the closest analogue to GDP.
Content churn The gap between gross and net output. High churn means a large share of work is undone by reverts, deletions, or rework.
Revert rate Fraction of edits that are identity-reverted. A proxy for unproductive or contested labor.
Productivity Net bytes per edit. Measures how much lasting content each edit produces on average.
Productivity per capita Net bytes per active editor. The wiki equivalent of GDP per capita.
Activity tiers Editors grouped by monthly edit count (1, 2-4, 5-24, 25-99, 100+). Shows how output and editor counts distribute across levels of engagement.
Sectoral output Breakdown by namespace (article, talk, user, project, template, etc.). Each namespace is an economic sector with different dynamics.
User type share Percentage of total edits contributed by registered editors, anonymous editors, temporary accounts, and bots. Tracks how the composition of the workforce evolves.

Patrol

[edit]

Patrolling is Wikimedia quality-control mechanism: experienced editors review new edits and pages to catch vandalism, errors, and policy violations. These indicators measure whether the community's "immune system" is keeping up.

Indicator Description
Patrol volume Total number of patrol actions per period, split between new-page patrols and diff (edit) patrols.
Unique patrollers Count of distinct editors who performed at least one patrol action. Measures the size of the quality-control workforce.
Patrol coverage Percentage of revisions that were manually patrolled. Low coverage means edits go unreviewed.
Adjusted coverage Includes revisions that were autopatrolled (automatically marked as reviewed for trusted editors). Gives a fuller picture of how much content is effectively reviewed.
Patrol latency Median and 90th-percentile time (in hours) between an edit being created and being patrolled. Measures how quickly the community responds.
Patroller concentration The minimum number of patrollers needed to account for 50% of all patrol actions — a fragility measure for the quality-control function, analogous to the edit distribution bus factor.

All patrol indicators can be filtered by the author type of the patrolled edit (registered, anonymous, temporary, bot). This enables analysis such as: "Do anonymous edits get patrolled faster than registered ones?"

Data sources

[edit]

The project consumes publicly available Wikimedia data:

  1. MediaWiki History dumps — the primary data source. These TSV files contain one row per revision event with 76 columns covering editor metadata, page state, and revision details. The project filters to revision-creation events and retains 10 analytical columns.
  2. MediaWiki logging dumps — XML dumps of the logging table, used specifically for patrol events (log_type=patrol) and user-rights changes (to determine autopatrol membership).

No private data, CheckUser information, or non-public APIs are used. All inputs are available to anyone from dumps.wikimedia.org.

Technical overview

[edit]

The pipeline has four stages:

  1. Fetch — downloads dumps from Wikimedia, streaming to disk without buffering entire files in memory. Supports resume for interrupted downloads.
  2. Ingest — converts raw TSV dumps into Parquet files, filtering to revision-creation events and normalizing columns. Produces a compact analytical layer (~10 columns) suitable for fast aggregation.
  3. Compute — reads Parquet partitions one month at a time and produces per-wiki metric files. Most metrics are computed independently per month; only cohort tracking and churn rates require cross-month state.
  4. Merge — concatenates per-wiki outputs into combined Parquet files for cross-wiki analysis.

The compute engine is written in Rust using Polars for dataframe operations. Patrol metrics use a Python pipeline. The interactive dashboard is built with Observable Framework and uses DuckDB compiled to WebAssembly for client-side queries directly on Parquet files.

Filtering dimensions

[edit]

All indicators support consistent filtering by:

  • Wiki — any processed Wikimedia project
  • User type — registered, anonymous, temporary, bot (classification follows MediaWiki's own flags)
  • Namespace — any MediaWiki namespace present in the data
  • Time range — arbitrary start/end month (YYYY-MM)
  • Granularity — month, quarter, or year aggregation

Why "economics"?

[edit]

The economic metaphor is not decorative. Each indicator family maps to a real branch of economics:

  • Edit Distributioninequality economics (Gini, Theil, Palma are standard measures used by the World Bank and OECD)
  • Communitylabor economics (arrival/departure rates, cohort analysis, workforce composition)
  • Content Productionnational accounting (gross vs. net output, productivity, sectoral breakdown)
  • Patrolregulatory economics / quality inspection (coverage rates, response times, inspector concentration)

This framing provides two advantages: (1) the metrics are well-defined and peer-reviewed in the economics literature, avoiding ad-hoc definitions; (2) they are immediately legible to anyone familiar with economic reporting, which includes most policy-makers and institutional stakeholders.

Current status

[edit]

This project is in early development. The pipeline runs, the dashboard is functional, and the core indicators are computed. However:

  • The metric definitions would benefit from community review — are they measuring the right things? Are there edge cases specific to wiki communities that the standard economic definitions miss?
  • The indicator set is not final. Additional metrics may be added, and existing ones may be refined based on feedback.
  • Cross-wiki validation is ongoing. The pipeline supports all Wikimedia wikis, but careful attention is needed to ensure metrics behave sensibly across communities of very different sizes and cultures.

Feedback

[edit]

This project is explicitly seeking feedback from the Wikimedia community. Areas where input is especially valuable:

  • Metric relevance — Which indicators are most useful for your community? Which are missing?
  • Metric definitions — Do the economic analogies hold? Where do wiki-specific dynamics break the metaphor?
  • Interpretation — What contextual knowledge would help interpret the numbers? (e.g., known bot campaigns, policy changes, content drives that create spikes)
  • Naming — Are the indicator names clear and appropriate for a volunteer community?

Please leave feedback on the talk page.

Stack & Data Sources

[edit]

Data sources

[edit]

All data comes from publicly available Wikimedia dumps. No private APIs, CheckUser data, or non-public datasets are used.

MediaWiki History dumps

[edit]

The primary data source. These are tab-separated files published by the Wikimedia Foundation at [dumps.wikimedia.org/other/mediawiki_history](https://dumps.wikimedia.org/other/mediawiki_history/). Each row represents a revision event and contains 76 columns covering:

  • Event metadata*: timestamp, type (create/delete/restore), entity (revision/page/user)
  • Editor state*: user ID, registration date, edit count at event time, bot flag, anonymous flag, temporary account flag, user groups
  • Page state*: page ID, title, namespace, creation timestamp, whether the page is a redirect
  • Revision details*: byte length before/after, SHA1, minor edit flag, deleted/suppressed flags, revert information

The project filters these to **revision-creation events only** and retains 10 analytical columns: timestamp, user ID, user text, page namespace, byte diff, minor flag, bot flag, anonymous flag, temporary flag, and revert indicator.

Dumps are partitioned yearly for most wikis and monthly for the largest projects (English Wikipedia, Wikidata, Commons).

MediaWiki logging dumps

[edit]

XML dumps of the `logging` table, fetched from `dumps.wikimedia.org/<wiki>/latest/<wiki>-latest-pages-logging.xml.gz`. Used specifically for:

  • Patrol events* (`log_type=patrol`): records of editors reviewing new pages and edits
  • User rights changes* (`log_type=rights`): used to reconstruct which editors held autopatrol permissions at any given time

The XML is streamed and parsed on-the-fly without loading the full file into memory.

MediaWiki API

[edit]

A single lightweight query to the [MediaWiki siteinfo API](https://www.mediawiki.org/wiki/API:Siteinfo) fetches which user groups grant the `autopatrol` right (typically sysop and bot). This is combined with the rights-change log to build per-editor intervals of autopatrol membership.

Stack

[edit]

Rust for compute engine

[edit]

The core pipeline is a Rust CLI (`wiki-econ`) that handles fetching, ingesting, computing, and merging. Key dependencies:

Rust Crates and Roles
Crate Role
Polars 0.53 Dataframe operations — lazy evaluation, CSV/Parquet I/O, aggregations, joins
Rayon Parallel iteration for multi-wiki processing
Reqwest HTTP client for downloading dumps, with retry and resume support
bzip2 Streaming decompression of .tsv.bz2 dump files
Clap CLI argument parsing (subcommands: fetch, ingest, compute, merge, run, bench)
Tracing Structured logging with stable fields (wiki, metric, rows, bytes, elapsed_ms)
Anyhow Error handling

The pipeline processes data in four stages:

  1. Fetch: streams dumps from Wikimedia to disk, supports resume on range-capable servers, bounded to 4 concurrent downloads
  2. Ingest: decompresses bz2 into 32 MB in-memory chunks, parses CSV with Polars, writes Parquet partitions directly (no intermediate TSV on disk). Produces two layers: a wider warehouse layer and a slim analytical layer
  3. Compute: reads one monthly Parquet partition at a time, computes metrics per month. Only cohort tracking, churn rates, and funnel state are carried across months. Outputs per-wiki Parquet files
  4. Merge: concatenates per-wiki metric files into combined cross-wiki Parquet files

Python for the patrol pipeline

[edit]

Two scripts handle patrol-specific data that comes from logging dumps rather than revision history:

  1. `scripts/fetch_patrol.py` — downloads and parses XML logging dumps, extracts patrol events and user rights changes
  2. `scripts/compute_patrol.py` — joins patrol logs with revision data to compute latency, coverage, and concentration metrics. Classifies each patrolled revision by author type (registered/anonymous/temporary/bot) and namespace

Observable Framework for dashboard dashboarding

[edit]

The interactive dashboard is built with [Observable Framework](https://observablehq.com/framework/) (v1.13). Each page is a Markdown file with embedded JavaScript that renders charts using [Observable Plot](https://observablehq.com/plot/).

DuckDB as query layer

[edit]

DuckDB serves two roles:

  1. Build-time (shell scripts): the `.json.sh` data loaders use the DuckDB CLI to aggregate Parquet files into pre-computed JSON defaults
  2. Client-side (browser): DuckDB-WASM runs SQL queries on Parquet files when users apply non-default filters, enabling interactive exploration without a backend server
Storage layout
[edit]
data/
  raw/<wiki>/              ← downloaded .tsv.bz2 dumps
  warehouse/<wiki>/        ← wide normalized Parquet (for future metrics)
    year=YYYY/
      year_month=YYYY-MM/
  parquet/<wiki>/           ← slim analytical Parquet (compute input)
    year=YYYY/
      year_month=YYYY-MM/
    _markers/               ← ingest completion markers
output/
  <wiki>/                   ← per-wiki metric Parquet files
 *.parquet                 ← merged cross-wiki files
site/
  src/
    *.md                    ← Observable pages
    components/             ← shared JS (filters, charts)
    data/
      *.parquet             ← symlinked or copied from output/
      defaults_*.json.sh    ← build-time data loaders

See also

[edit]