Community Notes–Style TSV Export: Full-Scan Analysis of Notes, Ratings & Status

Public Community Notes–style programs produce rich tabular exports: every note on a post, every rating of a note, status metadata for moderation and modeling, participant enrollment, and bat signal rows that describe when tweets become eligible for note requests in different product surfaces. This post analyzes a large local snapshot of the Community Notes download data (stored under the repository’s x/ directory)—on the order of tens of gigabytes across sharded TSV files—using DuckDB to run full scans for exact counts and aggregates, while keeping pandas only for small result tables when plotting.

The goal is not to identify individuals or quote user text at scale, but to describe structure and scale: how activity grew over time, how notes are labeled, how raters use helpfulness dimensions, how statuses cluster, and how contributors are distributed.

Dataset scale summary

Executive summary

Using one canonical notes file and one canonical status file (duplicate exports exist in the folder; see Methodology), the measured row counts are:

Table	Rows
Notes	2,494,377
Note status / history	2,690,287
Ratings (8 shards combined)	206,052,876
User enrollment	1,406,498
Bat signals (tweet-level)	394,052

Time span (epoch milliseconds in source data): notes run from 2021-01-28 through 2026-03-17; ratings from 2021-01-23 through 2026-03-18 (UTC, from export min/max timestamps). These are snapshot bounds, not a claim about program start dates elsewhere.

At the classification level, MISINFORMED_OR_POTENTIALLY_MISLEADING notes dominate the labeled set relative to NOT_MISLEADING in this export (see chart below)—about 78.5% (~1.96M) misleading vs 21.5% (~536K) not misleading. The overall ratings-to-notes ratio is ~82× (206M ratings for 2.5M notes). At the rating level, the helpfulnessLevel field carries most of the signal—HELPFUL and NOT_HELPFUL account for the vast majority of rows—while legacy agree / helpful binary columns are sparse in this extract (detailed in Ratings and helpfulness).

Methodology

Engine: DuckDB reads tab-separated files with read_csv_auto, delim='\t', ignore_errors=true for robustness on edge rows.
Ratings: All eight ratings-*.tsv shards are combined in SQL as a single logical table for aggregates.
No whole-file pandas loads: Aggregates are written to analysis/community_notes/out/*.csv; plotting reads only those small files.
Duplicates: Two pairs of note files share MD5 checksums (notes-00000.tsv with notes-00000 2.tsv, and notes-00000 3.tsv with notes-00000 4.tsv), but the two pairs differ from each other—they are different snapshots. Analysis uses notes-00000.tsv only. Both noteStatusHistory-00000.tsv files match; one is used.
Privacy: We report histograms and sums only; note text is not reproduced here. Participant identifiers are not listed.
Runtime: A full pass on this machine took about 5.4 minutes for all aggregates (see out/metadata.json); ratings shard scans dominate.

Community activity: growth and intensity

Notes over time

Monthly new-note volume rises sharply in the early years and remains high through the export window, with cumulative growth reaching millions of unique note rows in this snapshot.

Notes per month

Cumulative notes

Peak monthly volume is sustained through the export window, with growth in the early years leveling into a high ongoing rate.

Ratings over time

Rating volume is larger by orders of magnitude than note volume—consistent with many raters per note in aggregate.

Ratings per month

Cumulative ratings

Ratings per note (monthly ratio)

Dividing monthly rating rows by monthly note rows gives a rough intensity measure: how “thick” the rating stream is relative to new notes each month (not identical to “ratings per note” at the tweet level, but useful for comparing eras).

Ratings per note monthly ratio

Note content and labels

Classification

The export’s top-level classification field is dominated by MISINFORMED_OR_POTENTIALLY_MISLEADING versus NOT_MISLEADING rows in this file. The bar chart below is the source for the ~78% / ~21% split cited in the executive summary.

Notes by classification

Misleading / not-misleading sub-indicators

Separate boolean-style columns capture why a note was judged misleading or not. Summing those flags across all note rows shows which reasons appear most often in the structured fields (not mutually exclusive per row).

Misleading flag sums

Media and collaborative notes

Media and collaborative flags are a minority share of rows but materially nonzero—useful context when interpreting note mix.

Media and collaborative share

Summary length

Bucketed character counts of the summary field show where most notes sit in terms of length (without exposing text).

Summary length buckets

Ratings and helpfulness

Helpfulness level vs legacy binary columns

In this export, helpfulnessLevel is populated for almost all rating rows (HELPFUL, NOT_HELPFUL, SOMEWHAT_HELPFUL). By contrast, the older agree / disagree / helpful / notHelpful binary columns have very few positive entries—only ~22K each for agree and helpful out of 206M rating rows—so descriptive work should lean on helpfulnessLevel and the detailed sub-dimension flags.

Helpfulness level and binary columns

Form version and source bucket

Rating version strings shift over time as forms evolve; ratingSourceBucketed describes where the rating was submitted.

Ratings by version

Ratings by source

Sub-dimension totals

Finer “why helpful / why not” categories are aggregated with log1p scaling in the heatmaps so large totals remain comparable visually.

Helpfulness sub heatmaps

Joined view: helpfulness mix by note classification

Joining ratings to notes on noteId lets us compare how HELPFUL vs NOT_HELPFUL ratings distribute for notes tagged misleading versus not misleading at the classification level—again at aggregate scale only. Misleading notes receive both more ratings and a different mix: ~93M HELPFUL vs ~52M NOT_HELPFUL for misleading, vs ~21M HELPFUL vs ~14.5M NOT_HELPFUL for not-misleading.

Helpfulness mix by classification

Rating volume by classification

Most rating rows link to MISINFORMED_OR_POTENTIALLY_MISLEADING notes in this snapshot, consistent with both higher volume and more contested content attracting ratings.

Rating rows by classification

Lifecycle: status fields

Status tables provide current, locked, and core status dimensions plus who decided the status in modeled flows. Top categories are shown below (full distributions are in out/*.csv). The vast majority of status rows are NEEDS_MORE_RATINGS (~2.35M); CURRENTLY_RATED_HELPFUL and CURRENTLY_RATED_NOT_HELPFUL each sit in the hundreds of thousands.

Current status top

Locked status top

Core status top

Status decided by (top)

What the status dimensions mean

current: Reflects the display/outcome state of the note—e.g. whether it still needs more ratings or has been decided as currently rated helpful or not helpful.
locked and core: Separate dimensions (e.g. lock state and “core” modeling status). The charts above show the top categories; full value sets and counts are in out/status_by_locked.csv and out/status_by_core.csv.

Participant ecosystem

Enrollment rows describe state (e.g., onboarding vs earned-in), modeling population, and how often participants earn in multiple times.

Enrollment by state

Modeling population

Earn-out buckets

Contributor concentration

A binned histogram of how many notes each author has written shows heavy concentration among one-note contributors and a long tail of highly active authors—reported as bins, not identities.

Notes per author buckets

Bat signals and eligibility

Bat signal rows are tweet-level: several timestamp columns describe eligibility for note requests and API feed tiers, plus optional source links. Counts below are row-level, not unique-global tweet cardinality.

Bat signals eligibility

Limitations

Snapshot only: Files reflect an export at a point in time, not live X.
Schema evolution: Older ratings may use different form versions; compare with care.
Join semantics: Joins are on noteId; multiple notes per tweet and multiple ratings per note are expected.
Interpretation of sparse columns: Binary agree/helpful fields are not the primary signal in this extract—use helpfulnessLevel and sub-flags.
Runtime: Full pipeline on this snapshot is ~5.4 minutes (from metadata.json); ratings shards dominate.
Bat signals: Counts are row-level; no tweet-level deduplication is applied, so unique tweet cardinality is not stated.

Appendix: reproducibility

Data source: Community Notes download data (X)
Code: analysis/community_notes/pipeline.py, plot_figures.py, config.py
Commands: From analysis/community_notes/: run python pipeline.py to regenerate all aggregates, then python plot_figures.py to regenerate figures. For a dry run on fewer ratings shards, use python pipeline.py --ratings-limit-shards 1.
Aggregates: analysis/community_notes/out/metadata.json and companion CSVs
Figures: public/images/blog/x-community-notes-export/*.png (25 images)

If you expand this to other snapshots, re-run deduplication checks on any new notes-*.tsv pairs before combining totals.

Key fields reference

Values below are from this export’s aggregates; other snapshots may differ.

Classification (notes): MISINFORMED_OR_POTENTIALLY_MISLEADING, NOT_MISLEADING
helpfulnessLevel (ratings): HELPFUL, NOT_HELPFUL, SOMEWHAT_HELPFUL, and a small null share
Status dimensions:
- current: NEEDS_MORE_RATINGS, CURRENTLY_RATED_HELPFUL, CURRENTLY_RATED_NOT_HELPFUL
- locked: NEEDS_MORE_RATINGS, CURRENTLY_RATED_HELPFUL, CURRENTLY_RATED_NOT_HELPFUL, (null)
- core (currentCoreStatus): NEEDS_MORE_RATINGS, FIRM_REJECT, (null), CURRENTLY_RATED_HELPFUL, CURRENTLY_RATED_NOT_HELPFUL, NEEDS_YOUR_HELP

Full value sets and counts are in out/status_by_current.csv, out/status_by_locked.csv, and out/status_by_core.csv.

Inside a Multi‑Billion‑Row Community Notes–Style Export: Notes, Ratings, Status, and Eligibility