Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

arXiv cs.CL Papers

Summary

This paper introduces a linked multimodal dataset of official speeches from the Russian government, including text, images, metadata, and topic annotations, designed to support social science research and LLM applications in political domains.

arXiv:2605.15886v1 Announce Type: new Abstract: This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:34 AM

# Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches
Source: [https://arxiv.org/html/2605.15886](https://arxiv.org/html/2605.15886)
Daria BlinovaGayathri EmuruUniversity of Delaware, Masters of Science in Data Science Program, Newark, DE, USARakesh EmuruUniversity of Delaware, Masters of Science in Data Science Program, Newark, DE, USAKushagradheer Shridheer SrivastavaUniversity of Delaware, Masters of Science in Data Science Program, Newark, DE, USAMina RulisUniversity of Pennsylvania, Department of Political Science, Philadelphia, PA, USASunita ChandrasekaranUniversity of Delaware, Department of Computer & Information Sciences, Newark, DE, USABenjamin E\. BagozziUniversity of Delaware, Department of Political Science & International Relations, Newark, DE, USAcorresponding author: Benjamin E\. Bagozzi \(bagozzib@udel\.edu\)

###### Abstract

This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text\- and image\-based data for authoritarian politics contexts\. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades\. For each speech, we provide Russian\- and English\-language texts, associated images and captions where available, and harmonized metadata including \(e\.g\.\) dates, speakers, \(geo\)locations, and official government content tags\. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts\. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer\-based multimodal topic modeling and refined by a Russian politics expert\. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of \(authoritarian\) political communication and offer a valuable testbed for social science research and large language model \(LLM\) applications in political domains\.

## Background & Summary

Over the past decade, social scientists have increasingly turned to text\-as\-data and, more recently, image\-as\-data approaches to study political, social, and economic phenomena at scale\. Advances in natural language processing \(NLP\), computer vision, and representation learning have enabled researchers to analyze vast corpora of speeches, news articles, social media posts, and visual content that were previously inaccessible to systematic empirical analysis\. In political science, this shift has reshaped the study of elite behavior, public opinion, and international relations, with influential contributions appearing inPolitical Analysis,American Political Science Review,American Journal of Political Science, and theAnnual Review of Political Science\[[25](https://arxiv.org/html/2605.15886#bib.bib1),[70](https://arxiv.org/html/2605.15886#bib.bib4),[42](https://arxiv.org/html/2605.15886#bib.bib7),[6](https://arxiv.org/html/2605.15886#bib.bib3),[63](https://arxiv.org/html/2605.15886#bib.bib77),[65](https://arxiv.org/html/2605.15886#bib.bib2)\]\. Parallel developments are evident across sociology\[[61](https://arxiv.org/html/2605.15886#bib.bib46),[13](https://arxiv.org/html/2605.15886#bib.bib45)\], communications\[[16](https://arxiv.org/html/2605.15886#bib.bib48),[8](https://arxiv.org/html/2605.15886#bib.bib47)\], management and organizational research\[[34](https://arxiv.org/html/2605.15886#bib.bib44),[59](https://arxiv.org/html/2605.15886#bib.bib38)\], and psychology\[[49](https://arxiv.org/html/2605.15886#bib.bib43),[9](https://arxiv.org/html/2605.15886#bib.bib42)\], underscoring trends towards disciplinary convergence around computational approaches to social inquiry\. Concurrently, these methods have become central to computer science and data science research itself, where socially grounded text and image corpora are increasingly used to evaluate models, study bias, and develop multimodal learning techniques\[[37](https://arxiv.org/html/2605.15886#bib.bib40),[53](https://arxiv.org/html/2605.15886#bib.bib194),[72](https://arxiv.org/html/2605.15886#bib.bib41)\]\.

Despite this progress, scholars still lack comprehensive and systematically linked social text\- and image\-based datasets\. This gap is particularly acute for datasets that combine linked multimodal content—that is, text and images associated with the same political communication—and for datasets that include multilingual text, which have each been shown to be critical for understanding political messaging, framing, and signaling\[[32](https://arxiv.org/html/2605.15886#bib.bib16),[4](https://arxiv.org/html/2605.15886#bib.bib13),[50](https://arxiv.org/html/2605.15886#bib.bib14),[35](https://arxiv.org/html/2605.15886#bib.bib15)\]\. The absence of such data is especially consequential in authoritarian settings, where conventional sources of quantitative information—such as economic statistics, public opinion surveys, or administrative records—are often unavailable, selectively released, or strategically manipulated\[[29](https://arxiv.org/html/2605.15886#bib.bib22),[66](https://arxiv.org/html/2605.15886#bib.bib21),[58](https://arxiv.org/html/2605.15886#bib.bib11)\]\. Existing exceptions that rely on either text or images alone—such as analyses of authoritarian rhetoric in online media or speech or studies of state propaganda imagery—have nonetheless demonstrated substantial analytic value, illuminating patterns of elite signaling, policy priorities, and regime legitimation strategies\[[15](https://arxiv.org/html/2605.15886#bib.bib19),[57](https://arxiv.org/html/2605.15886#bib.bib12),[58](https://arxiv.org/html/2605.15886#bib.bib11),[41](https://arxiv.org/html/2605.15886#bib.bib39),[33](https://arxiv.org/html/2605.15886#bib.bib18),[74](https://arxiv.org/html/2605.15886#bib.bib17)\]\. In short, these latter studies suggest that richer, multimodal, and multilingual data could further deepen our understandings of authoritarian politics\.

Motivated by this potential, we collect and curate a novel set of interlinked multimodal datasets combining text, images, and metadata from two distinct collections of Russian government speeches\. Our overarching data resource includes linked Russian\- and English\-language versions of official speech texts, associated speech images, and contextual metadata for speeches delivered by high\-level political actors \(most commonly, the Russian president\) under the auspices of the Kremlin during the period 31 December 1999 to 20 September 2025, as well as speeches delivered by high\-level political actors \(most commonly, the Minister of Foreign Affairs of the Russian Federation\) under the auspices of the Russian Ministry of Foreign Affairs from 18 March 2004 to 7 October 2025\. Across these two corpora, the data encompass 15,610 total English\-language speeches and 19,396 total Russian\-language speeches, along with associated sets of 42,782, and 49,277 images, providing rich multi\-modal content over considerable time spans covering key periods in contemporary Russian domestic politics and international relations\.

In total, we release cleaned Russian\- and English\-language texts, all associated images, and harmonized metadata for both these Kremlin and Ministry of Foreign Affairs speech collections\. Each dataset includes unique identifiers linking images to specific speeches, as well as identifiers linking Russian and English versions of the same speech\. The Kremlin and Russian Ministry of Foreign Affairs translate and provide these separate Russian and English translated versions of our collected speeches themselves\. However, as discussed further below, these parallel English and Russian speech versions are not necessarily identical\. This ensures that our collected and linked data will uniquely allow future scholars to study variation in official Russian government decisions concerning which content to include or omit within each language version of a given speech, and the potential causes and effects of such decisions\. Where available, our extracted metadata for each speech corpus also include the date of each speech, official indexing tags assigned by the Russian government \(e\.g\., for theme, region, and speaker\), speech titles and summaries, image captions, the named speech location, speaker names, and our own extracted and validated speech geolocations—alongside additional \(meta\)data to aid researchers\.

Beyond these textual, image, and metadata attributes, we also enrich each corpora with our own substantively meaningful topic annotations that we develop via a human\-in\-the\-loop framework\. Using transformer\-based multimodal topic modeling—specifically BERTopic\[[26](https://arxiv.org/html/2605.15886#bib.bib96)\]—we estimate latent topics separately for each \(language\-specific\) text corpus and its associated image data\. A Russian politics subject\-matter expert then labels the core topics and groups them into higher\-level thematic categories\. We validate these topics through extensive comparisons with official Kremlin\-provided thematic labels \(where available\) and secondary qualitative reviews by political science subject matter experts\. Together our topic annotations provide transparently\-extracted and near\-complete coverage of all speeches in our Kremlin and Ministry of Foreign Affairs corpus, in contrast to the Russian government’s own thematic topic labels of a limited portion of Kremlin\-specific speeches\.

Altogether, resulting data offer a wide range of applications for social scientists, computer scientists, and data scientists alike\. For social scientists, the linked texts, images, and topic labels enable systematic analyses of Russian domestic and foreign policy priorities over time and potentially space, extending prior work that has relied almost exclusively on single\-language, text\-only corpora from a single ministry or executive actor\[Mölder2023,[12](https://arxiv.org/html/2605.15886#bib.bib79),[73](https://arxiv.org/html/2605.15886#bib.bib81)\]\. Together these data accordingly allow social science researchers to study the political origins and/or effects of \(i\) divergences in speech \(text and or image\) content across Russia’s Kremlin and Ministry of Foreign Affairs and \(ii\) divergences in the English versus Russian language versions of each speech released by these respective political units\. These potential comparisons speak directly to a growing literature on authoritarian signaling and audience differentiation, including work on China showing how regimes tailor messages to foreign and domestic publics\[[69](https://arxiv.org/html/2605.15886#bib.bib83),[68](https://arxiv.org/html/2605.15886#bib.bib84),[19](https://arxiv.org/html/2605.15886#bib.bib25),[36](https://arxiv.org/html/2605.15886#bib.bib24),[40](https://arxiv.org/html/2605.15886#bib.bib23)\]\.

Likewise, many data science, AI, and computer science researchers rely on large test beds of annotated texts \(and images\) for developing or validating new machine learning or AI methods\. Our expert\-labeled and validated topic variables, and linked English\- and Russian\-datasets more generally, provide ready\-to\-use inputs for such tasks, especially when these tasks intend to explore qualities of Russian\-language text and/or multi\-lingual or multimodel content\. And finally for government officials—as well as researchers interested in political forecasting—the text, image, and topic information in our data can be used to develop inputs for regression analyses, time\-series models, or forecasting and early\-warning systems related to conflict, foreign policy behavior, and state stability\[[44](https://arxiv.org/html/2605.15886#bib.bib85),[10](https://arxiv.org/html/2605.15886#bib.bib8),[18](https://arxiv.org/html/2605.15886#bib.bib86),[42](https://arxiv.org/html/2605.15886#bib.bib7)\]\.

In what follows, we first review the project scope and documents, followed by an overview of our webscraping, data cleaning, and measurement strategies for our texts, images, and metadata\. Next, we provide details on the structure of our dataset and data records\. Lastly, we discuss our validation exercises and conclude with the usage notes and details on data and code availability\.

## Methods

### Project Scope

In this section, we describe the sets of text and image corpora that we extract from the Kremlin \(kremlin\.ru\) and the Russian Ministry of Foreign Affairs \(mid\.ru\) official websites\. The Kremlin is the official representation of the Russian President, who is the executive head of the Russian state\. The Ministry of Foreign Affairs is a federal executive authority implementing foreign policy and operating under the jurisdiction of the Russian President\. Both of these Russian executive branch institutions serve complementary roles in shaping the direction of the Russian state’s internal and/or external affairs\. At the same time, they serve as a source of important political information, allowing outside observers to assess the dynamics of the state’s official rhetoric\. Since both the Kremlin and the Ministry of Foreign Affairs operate their own official websites, they separately archive speeches, press releases, interviews, and other relevant content involving texts and images\. Though these speeches and images are also at times disseminated in other mediums, such as through television media, these two governmental websites are unique in the extent to which they archive these data in a systematic manner over extended periods\. Our extraction and released materials focus on the textual content of each item and any associated still images; we do not extract, store, or analyze video content\. Below, we first describe the English and Russian Kremlin corpora \(and accompanying images\)\. We then turn to describe the English and Russian versions of the Ministry of Foreign Affairs corpora \(and its images\)\.

### 0\.1Kremlin Texts and Images

The Kremlin text and image corpora are extracted from the official Kremlin website \(kremlin\.ru\)\. This website archives transcripts of all Kremlin speeches \(and their visuals\) given on different occasions from 31 December 1999 onward, which for the purposes of our data collection covers up to and including 20 September 2025\. The Kremlin website stores speech transcripts both in Russian and English, alongside images that reflect the speech setting corresponding to the speech content\. As discussed in more detail above and below, we extract these English and Russian speech texts and their visuals in separate datasets and provide additional metadata accompanying these extracted texts and images\.

Given the nature of the Kremlin and its official website, a majority of the Russian and English speeches on this website are given by the President of Russia\. During the date range of our dataset, Russia had only two presidents: Vladimir Putin \(2000\-2008; 2012\-Present\) and Dmitry Medvedev \(2008\-2012\)\. These actors represent a majority of the speakers recorded across the speeches associated with the Kremlin and its official website\. Yet in addition to presidential speeches, the Kremlin’s available transcripts also include a smaller share of speeches from domestic or international political representatives with whom these two presidents interacted\. In such cases, these transcripts typically include speech content from the Russian president as part of a collaboration \(e\.g\.\., meeting or joint speech\) with such leaders\.

As noted previously, the occasions on which these public speeches are given vary, as does the format of the speech \(and accompanying image\) itself and its thematic focus\. Such variation spans interviews or official communication between the president and domestic or international leaders to nationwide announcements and transcriptions of international forum performances\. At the same time, the events where such speeches are given are equally diverse and include bilateral meetings, multilateral arrangements, domestic events, presidential greetings and salutations, addresses for national holidays, and others\.

Each archived speech on the Kremlin’s website includes a speech title, the speech text itself, a date \(down to calendar day\) and time of day, a location, and \(for more recent speeches\) a set of thematic tags assigned by the Kremlin itself\. In some cases, as noted earlier, speeches can also contain images with captions\. As discussed in more detail below, much of the above content is incorporated as metadata alongside the main text and image data for our Kremlin corpora\. Finally, we can note that while some prior research has analyzed the English versions of the Kremlin’s speech text transcripts\[[12](https://arxiv.org/html/2605.15886#bib.bib79),Mölder2023\], no work to our knowledge has extracted a comprehensive set of both Russian and English versions of these speech text transcripts, nor of the accompanying images\.

### 0\.2The Ministry of Foreign Affairs of the Russian Federation Texts and Images

The Russian Ministry of Foreign Affairs \(hereafter abbreviated as MID, given the Russian name of the ministry,Ministerstvo Inostrannyh Del\) represents a distinct source of official Russian government rhetoric\. Relative to the earlier Kremlin discussion, the MID more narrowly administers Russian foreign policy priorities rather than international and domestic Russian policy concerns\. The official website of the MID \(mid\.ru\), which we used to extract our MID corpora and visuals, collects various Ministry speeches and media made by representatives of the agency\. Our primary speeches and images of interest from this website primarily relate to the statements and speeches made by the Minister of Foreign Affairs, Sergey Lavrov, who assumed office in March 2004 and served until the time of writing\. As in the case of the Kremlin website, the MID archives these texts in both Russian and English with corresponding metadata\. Since Lavrov’s speech records coincide with his presence in office, the date range for our datasets runs from 18 March 2004 to 7 October 2025\.

Like the Kremlin corpora, the nature of the textual speeches \(and their associated images\) that are made by the Russian Foreign Minister and at times other parties that he engages with during these speeches is remarkably diverse\. To this end, we can note that the MID’s speeches encompass settings related to press conferences that the Minister has held both within and outside the country, interviews that the mass media ask for as a result of summits or bilateral meetings, including personal interviews such as with Tucker Carlson and other prominent commentators, as well as occasional meetings within Russia’s consulates and other diplomatic exchanges\.

Similar to the Kremlin speeches discussed further above, all MID speeches are systematically formatted and consistently archived in both Russian and English on the MID’s official website\. For each speech therein, this archived content includes a speech title, the speech text itself, any corresponding images, as well as information on the speech’s recorded date, time, and location\. That being said, and in comparison to the Kremlin speeches discussed earlier, this particular set of archived speeches does not contain any ministry\-assigned thematic tags\. Moreover, location information for the MID speeches is not always clearly stored as meta\-data and hence is not as easily extractible as is the case for our Kremlin speeches\. These caveats aside, the MID speech \(text and image\) data and available metadata altogether represent the main focus of our extraction efforts for this particular ministry as discussed further below\. To the best of our knowledge, no research has extracted or considered these particular speech data across both their English and Russian language text content, nor with regards to their associated images\.

### Webscraping

Our first objective was to webscrape the websites discussed above in order to collected all inputs needed for the construction of four parallel corpora of official speeches and their associated images, spanning the Kremlin \(Russian and English\) and the Russian Ministry of Foreign Affairs \(MID; Russian and English\)\. For each source\-language pair this requires that we systematically collect the page markup \(HTML\) for every speech, extract core text fields \(identifier, URL, title, date, and main body\), enumerate and download all images associated with that page, and write an analysis\-ready CSV that references a per–speech image directory\. We now describe these steps in further detail\. Throughout this discussion, we refer to the downloaded page content as “HTML files” and to the tabular outputs as “CSV files” to keep the terminology precise and consistent\.

From an implementation perspective, all webscraping is carried out inPython 3\[[51](https://arxiv.org/html/2605.15886#bib.bib217)\]using therequests\[[55](https://arxiv.org/html/2605.15886#bib.bib195)\]library for HTTP retrieval,BeautifulSoup\[[56](https://arxiv.org/html/2605.15886#bib.bib216)\]for HTML parsing, andpandas\[[39](https://arxiv.org/html/2605.15886#bib.bib201)\]and the Python standard library \(csv,pathlib,os\)\[[51](https://arxiv.org/html/2605.15886#bib.bib217)\]for input/output and file system management\. To keep the process modular, we implement separate webscraping scripts for each source \(Kremlin vs\. MID\), parameterized by language\. Each script follows the same two\-stage pipeline shown in Figure[1](https://arxiv.org/html/2605.15886#Sx2.F1)\. First, an index builder traverses the site’s public listings, records the canonical speech URLs and their numeric identifiers, and writes an index CSV\. We rely on this index rather than naive “next page” crawling because listings can reorder, contain gaps, or include unpublished identifiers, and numeric IDs are not guaranteed to be contiguous\. The index CSV therefore defines a stable target universe, enables resumable runs, and allows us to verify coverage\.

Second, a page fetcher\-parser consumes the index CSV, downloads each speech page, applies conservative selectors to recover title, date, full text, some other metadata, and in the same pass discovers image sources exposed on the page or linked first\-party photo subpages\. For each candidate image we request the largest available rendition \(falling back to smaller variants when necessary\) and record both the number of images advertised on the page and the number successfully saved for that speech in the CSV so that any mismatch is explicit and auditable\. Figure[1](https://arxiv.org/html/2605.15886#Sx2.F1)summarizes this two\-stage workflow from site listings through to speech\-level CSVs and image folders\.

All crawlers are single\-threaded and use fixed headers, bounded retries, and randomized inter\-request delays to reduce load on the origin servers and to remain within expected norms of polite scraping\. Because the index already enumerates the target universe, the page stage can safely skip any speech that is already present in the corresponding site\-language CSV\. This minimizes redundant requests and allows clean resumption after interruptions\. For both sources \(Kremlin and MID\) we treat the numeric identifier embedded in each speech URL as the primary key within that site\. This identifier is shared across that site’s Russian and English mirrors, which allows us to link Russian and English versions of the same speech by a simple equality join onIDrather than using fuzzy matching on speech titles or dates\. Images are stored in per\-speech folders named by this identifier, and files within each folder follow a simple sequence\-based naming convention\. Given a single CSV row, the corresponding image paths are therefore direct and unambiguous\.

We keep the text body as close to the source as possible\. The parser targets the main content region for each site and \(Russian or English\) language but otherwise avoids heavy boilerplate removal or aggressive normalization, preserving original punctuation, orthography, and capitalization in both Russian and English\. Where source pages contain empty bodies or unusually long speeches, we retain these edge cases in the CSV rather than filtering or truncating them, so that corpus coverage remains transparent\. For each site\-language pair, the final outputs consist of a single speech\-level CSV and an accompanying images root in which one subdirectory per speech identifier holds the associated image files\. Within each source \(Kremlin or MID\), Russian and English speech tables can be aligned exactly onID, and the same identifier is used as the folder name for the corresponding per–speech image directory\. Figure[2](https://arxiv.org/html/2605.15886#Sx2.F2)illustrates how the Russian\- and English\-language datasets \(texts and images\) are linked via this shared identifier scheme\.

The above discussion outlines the shared, reproducible pipeline from indexing to page and image capture to unified outputs across all four corpora\. That being said, there are several unique steps associated with our Kremlin and MID webpages and corresponding speech and image content\. The following subsections describe these source–specific details for the Kremlin and MID collections in turn\.

Site listings\(Kremlin / MID, RU & EN\)Index builderIndex CSV\(ID, URL\)Page fetcher & parserSpeech\-level CSV\(one row per speech\)Image folders\(one directory per ID\)URLsID, URLFigure 1:Two\-stage webscraping workflow\. For each source \(Kremlin, MID\) and language \(Russian, English\), an index builder first traverses the site listings and writes an index CSV of speech IDs and URLs\. A page fetcher\-parser then consumes this index, downloads each page, extracts structured text and metadata into a speech\-level CSV, and saves all associated images into per\-ID folders\.Source–EN CSV\(e\.g\. Kremlin–EN\)ID, date, title\_en, full\_text\_en, …Source–RU CSV\(e\.g\. Kremlin–RU\)ID, date, title\_ru, full\_text\_ru, …Images root directoryIMAGES\_ROOT/ID/with sequence\-numbered filesExample:ID = 1185EN and RU rows join onIDImages stored inIMAGES\_ROOT/1185/IDIDFigure 2:Cross\-lingual linkage within a source\. For each source \(Kremlin or MID\), Russian and English speech tables share the same numericIDcolumn, which serves as the primary key and is also used as the per\-speech image folder name\. Aligning Russian and English speeches, and linking them to their images, is therefore a simple equality join onID, with no need for fuzzy matching on titles or dates\.#### Webscraping the Kremlin Corpus

Complementing the shared pipeline outlined above, we use several Kremlin corpus\-specific steps to construct parallel corpora of Russian\- and English\-language presidential speeches for the Kremlin by webscraping the official Kremlin transcript archives\.

For Russian\-language content, we scraped the Kremlin transcript archive atkremlin\.ru/events/president/transcripts/; for English\-language content, we scraped the parallel archive aten\.kremlin\.ru/events/president/transcripts/\. In both cases, data collection followed the same two\-stage pipeline: \(i\) constructing a stable index of transcript identifiers and URLs, and \(ii\) harvesting speech\-level metadata, full text, and associated images for each indexed item\.

##### Index construction\.

For each language, we first crawl the Kremlin transcript listings and write an index CSV of the formpage number, id, url\. Each index entry records the listing page number, a numeric transcript identifier, and the canonical absolute URL for the transcript\. The indexer uses a realistic browser user\-agent andAccept\-Languageheaders, enforces request timeouts, and implements conservative back\-off logic for HTTP 403/429/5xx error responses\. We retain only entries whose URLs match the expected transcript pattern, so that every downstream row in the final corpus can be keyed by a stable numeric identifier via theidcolumn\.

##### HTTP session management and politeness\.

The transcript harvester reads the specific Kremlin \(English and Russian\) index file and iterates over the set of unique identifiers per language\. All HTTP requests are routed through a session wrapper that centralizes headers, timeouts, and pacing\. For each transcript we issue a lightweightHEADrequest against the main transcript URL; if the server returns a 404 error, the identifier is marked as unavailable and skipped entirely\. For surviving IDs,GETrequests are spaced by 9–10 seconds between transcript pages and photo gallery pages, with exponential backoff and \(for the Russian corpus\) optional proxy rotation when encountering temporary blocks\. We also check response headers and body size to ensure that only well\-formed HTML pages are processed\.

##### Speech\-level metadata and text extraction\.

For each successfully retrieved transcript page, we parse a consistent set of speech\-level metadata and map these fields directly into the final Kremlin CSV schema\.

TableLABEL:tab:kremlin\_schemadocuments the column names used in our processed Kremlin \(English and Russian\) metadata CSV files and their meanings\. List\-valued fields \(e\.g\., declared tags, speakers, image filepaths, image\-topic IDs\) are stored as serialized list strings \(e\.g\.,\["…","…"\]\) and are blank when the information is unavailable\.

Language note:The*Kremlin English*metadata CSV does not contain all translation columns; fields ending in\_englishappear only in the*Kremlin Russian*metadata CSV and store English translations of the corresponding Russian\-language variables\.

Probability storage note:In our processed files,curated\_topic\_probabilityis stored as a single scalar \(top\-1 confidence for the assigned text topic\), whilecurated\_image\_topic\_probabilitiesis a list with one scalar confidence value per image\.

Caption availability note:When images are present, image\-topic lists are aligned withstored\_image\_filepaths; image captions are extracted when available and may be missing for some images, resulting in occasional length mismatches between caption lists and the stored image list\.

Table 1:Kremlin speech\-level columns in the processed metadata CSV files\.ColumnDescriptionidNumeric transcript identifier associated with the speech page\.urlFinal resolved URL of the transcript page\.titleSpeech title/headline as displayed on the transcript page\.title\_englishEnglish translation oftitle\(Kremlin Russian CSV only\)\.full\_textFull extracted speech body text\.full\_text\_englishEnglish translation offull\_text\(Kremlin Russian CSV only\)\.full\_text\_word\_countWord count of the transcript text \(computed as the number of whitespace\-separated tokens from the extracted speech body\)\.dateHuman\-readable date string as displayed on the webpage\.yearCalendar year extracted from the parsed date/time metadata when available\.monthMonth name \(stored as a full English month name, e\.g\.,*January*\)\.dayDay of month \(numeric\)\.timeClock time parsed from time metadata \(blank if not provided\)\.locationLocation string as provided on the transcript page \(raw/original form\)\.location\_englishEnglish translation oflocation\(Kremlin Russian CSV only\)\.latitudeLatitude in decimal degrees obtained by geocodinglocation\(blank if unresolved\)\.longitudeLongitude in decimal degrees obtained by geocodinglocation\(blank if unresolved\)\.page\_summaryShort summary/lead text \(from meta description or page intro blocks when available\)\.page\_summary\_englishEnglish translation ofpage\_summary\(Kremlin Russian metadata CSV only\)\.speakersExtracted speaker name\(s\) associated with the transcript \(serialized list string\)\.declared\_geographyGeography\-related tags declared on the page \(serialized list string\)\.declared\_geography\_englishEnglish translation ofdeclared\_geography\(Kremlin Russian metadata CSV only\)\.declared\_topicsTopic tags declared on the page \(serialized list string\)\.declared\_topics\_englishEnglish translation ofdeclared\_topics\(Kremlin Russian metadata CSV only\)\.declared\_personsPerson/entity tags declared on the page \(serialized list string\)\.declared\_persons\_englishEnglish translation ofdeclared\_persons\(Kremlin Russian metadata CSV only\)\.curated\_topic\_idFinal \(curated\) text\-topic identifier assigned to the transcript\.curated\_text\_topic\_labelHuman\-readable label forcurated\_topic\_id\.curated\_text\_topic\_groupHigher\-level group/category for the curated text topic\.curated\_topic\_probabilityTop\-1 text\-topic probability \(single scalar confidence value for the assignedcurated\_topic\_id\)\.stored\_image\_filepathsLocal filepaths of downloaded images linked to the transcript \(serialized list string\)\.saved\_images\_countNumber of images successfully saved locally for the transcript\.declared\_images\_countNumber of images declared on the transcript web page\.missing\_images\_countNumber of missing images \(declared\_images\_countminussaved\_images\_count\)\.image\_captionsCaptions extracted for the transcript images in the original language when available \(serialized list string; may be missing for some images\)\.image\_captions\_englishEnglish translation ofimage\_captions\(Kremlin Russian metadata CSV only\)\.curated\_image\_topic\_idsAssigned image\-topic IDs \(serialized list; aligned withstored\_image\_filepathswhen images are present\)\.curated\_image\_topic\_labelsHuman\-readable labels for the image topics \(serialized list\)\.curated\_image\_group\_namesHigher\-level group names for the image topics \(serialized list\)\.curated\_image\_topic\_probabilitiesPer\-image top\-1 probabilities \(serialized list of floats; one probability score per image, aligned withstored\_image\_filepaths\)\.
##### Image discovery, de\-duplication, and completeness\.

A core goal of the Kremlin corpus data collection stage is to pair each speech with the set of photographs displayed alongside it on the official site\. To achieve this we use a combination of HTML pattern matching, content\-based de\-duplication, and cross\-checking against the site’s own photo counters\.

For each transcript we inspect both the main transcript page and, when present, the corresponding photo gallery at the derived/photosURL\. On each page we search for images using several strategies: \(i\) dedicated Kremlin slideshow containers, \(ii\) standard<img\>tags \(including lazy\-loading attributes andsrcsetvariants\), \(iii\) images embedded in<picture\>and<noscript\>blocks, \(iv\) direct links whosehreftargets an image file, \(v\) inline background images declared in style attributes, and \(vi\) hero images exposed via Twitter/OpenGraph meta tags\. For each candidate, we normalize its URL relative to the page base and construct a small set of plausible size variants \(favoring high\-resolution “big2x” or “big” versions, then smaller renditions, and finally the original URL as a fallback\)\. URLs that clearly point to thumbnails \(e\.g\., those containing/thumbor/preview\) are discarded unless they are the only available version of the image\.

We then download the binary content for each surviving candidate and compute a SHA\-256\[[43](https://arxiv.org/html/2605.15886#bib.bib218)\]\. Within a transcript\-level image directory we treat matching hashes as duplicates and retain only a single copy\. This prevents multiple size variants or repeated gallery images from contaminating the corpus\. Successful downloads are saved under language\-specific image roots \(e\.g\.,kremlin\_russian\_images/\) in subdirectories named byid\. Within each subdirectory, filenames follow a consistent pattern of the form<id\>\_<seq\>\.ext, where<id\>is a Numeric transcript identifier associated with the speech page andseqis a sequential counter\. The final CSV columnstored\_image\_filepathsstores a JSON\-encoded list of these local paths, whileimage\_captionsstores the corresponding captions extracted from surrounding<figcaption\>elements, caption blocks, or, when necessary, imagealt/titleattributes\.

The Kremlin interface displays an explicit photo count for each speech \(e\.g\., in a photo tab labeled with a number\)\. Whenever such a counter is present, we parse it into thedeclared\_photoscolumn\. For all Kremlin English and Russian speeches in our final corpus,declared\_photosis non\-empty, andmissing\_photos\_countis identically zero, indicating that our harvested image counts exactly match the Kremlin interface wherever the site exposes a photo counter\.

##### Final coverage and descriptive statistics\.

After scraping, cleaning, and merging, the final Kremlin English file contains10,553rows \(one per speech\), and the final Kremlin Russian file contains13,340rows\. Both archives cover the same temporal span, from 31 December 1999 through 20 September 2025\. All rows in both files have parseable dates: in the Russian corpus, 13,340/13,340 rows have validdatevalues, and in the English corpus, 10,553/10,553 rows have validdatevalues as well\.

Text coverage is nearly complete\. In the Russian file only six rows have an emptyfull\_text\(and henceword\_count= 0\), corresponding to purely image\-based notices; the remaining 13,334 rows contain non\-empty text\. In the English file only four rows have emptyfull\_text\. Across non\-empty texts, Russian speeches have a maximumword\_countof 33,352, a mean of 1,904\.67 words, and a median of 763\. English speeches reach a maximum of 39,898 words, with a mean of 1,453\.19 and a median of 806 words\. In both languages, therefore, the corpus contains a mix of short statements and very long, policy\-heavy addresses\.

Image coverage is also substantial\. In the Russian file, the total number of saved images is 44,779, with a minimum of 0, a maximum of 104, a mean of 3\.36, and a median of 2 images per speech\. A total of 3,605 Russian speeches have no associated images \(images\_count= 0\), while 9,735 speeches have at least one local image\. In the English file,images\_countis non\-null for all 10,553 rows; the total number of saved images is 38,637, with a minimum of 0, a maximum of 104, a mean of 3\.66, and a median of 2 images per speech\. Here 3,438 speeches lack images and 7,115 speeches have one or more images\. In both languages,declared\_photosis non\-empty for every row andmissing\_photos\_countis identically zero, confirming that the final image counts track the Kremlin’s declared photo numbers exactly\.

Kremlin transcript pages typically include an explicit event location field, which we extract and store in raw/original form aslocation\. In the processed Russian Kremlin file, 12,684 of 13,340 rows have a non\-emptylocationstring \(656 rows are empty\)\. In the processed English Kremlin file, 9,830 of 10,553 rows have a non\-emptylocationstring \(723 rows are empty\)\. For downstream use, we also derive latitude and longitude coordinates from available location strings using the same geolocation transformation pipeline applied across both the Kremlin and MID corpora; full details, validation, and edge cases are provided in Section[0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3)\(Location Extraction and Geocoding Section further below\)\.

Because the Kremlin reuses the same numeric identifier for parallel Russian\- and English\-language versions of a speech, we can align the two files exactly onid\. The Russian file contains 13,340 unique identifiers and the English file contains 10,553; 10,094 identifiers appear in both files\. This leaves 3,246 Russian\-only speeches and 459 English\-only speeches\. Put differently, 75\.7% of Russian IDs have a matching English row, while 95\.7% of English IDs have a matching Russian row\.

![Refer to caption](https://arxiv.org/html/2605.15886v1/kremlin_totals.png)\(a\)Total number of Kremlin speeches and harvested local images in the Russian\- and English\-language corpora\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/kremlin_wordcount.png)\(b\)Distribution of Kremlin speech length \(word count\) for Russian\- and English\-language transcripts \(log\-scaledxx\-axis\)\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/kremlin_images_zero_vs_nonzero.png)\(c\)Share of Kremlin speeches with at least one local image versus no images, separately for the Russian\- and English\-language corpora\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/kremlin_images_per_speech.png)\(d\)Distribution of images per Kremlin speech in the Russian\- and English\-language corpora \(image counts≥25\\geq 25are grouped into a single 25\+ bin\)\.

Figure 3:Kremlin corpus: coverage and basic descriptive statistics for speeches and images across Russian\- and English\-language versions of the site\.Figures[3\(a\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf1)–[3\(d\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf4)summarize the coverage and basic structure of our scraped Kremlin speech corpus\. Figure[3\(a\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf1)reports the total number of speeches and harvested local images in each language\. The Russian\-language archive contains 13,340 speeches and 44,779 images, while the English\-language archive contains 10,553 speeches and 38,637 images\. These totals highlight both the overall size of the corpus and the fact that the Russian site is somewhat more complete than its English counterpart, especially for earlier years\.

Figure[3\(b\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf2)plots the distribution ofword\_countfor Russian\- and English\-language transcripts\. We use a logarithmic scale on thexx\-axis because speech lengths are highly skewed and span more than two orders of magnitude: transcripts range from very short announcements to nearly 40,000\-word events\. On a linear scale, the dense middle of the distribution would be compressed and the long right tail would be almost invisible\. The two language distributions overlap heavily, with most speeches falling between roughly10210^\{2\}and10410^\{4\}words, but the Russian distribution exhibits a heavier right tail\. This indicates a larger number of very long Russian transcripts, consistent with the Russian site hosting additional full\-length meetings, press conferences, and Q&A sessions that are only partially translated, or not translated at all, on the English site\.

Figure[3\(c\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf3)focuses on image coverage\. Approximately 73% of Russian\-language speeches and 67% of English\-language speeches contain at least one locally stored image, with the remaining speeches consisting of text only\. This confirms that images are pervasive but not universal in the Kremlin’s online communications\. Importantly, for all pages where the source site reports an explicit image count, our stored image counts match exactly, indicating complete recovery of images for these image\-bearing pages in both languages\.

Finally, Figure[3\(d\)](https://arxiv.org/html/2605.15886#Sx2.F3.sf4)reports the full distribution ofimages\_countper speech \(for legibility, extremely image\-rich pages with≥25\\geq 25images are pooled into a single 25\+ bin\)\. Most speeches in both languages contain between zero and four images, with a long but thin tail of speeches that include ten or more images \(for example, extended photo galleries associated with major public events\)\. The similarity of the Russian and English distributions suggests that, conditional on a speech having at least one image, the Kremlin tends to mirror the basic image layout across language versions\. Detailed variable descriptions and additional descriptive statistics are provided in the Data Records Section\.

#### Webscraping the MID\.RU Corpus

We also tailor and refine our aforementioned shared pipeline for webscraping in order to construct parallel corpora of Russian\- and English\-language MID speeches, as obtained from the official MID website\. For Russian MID speeches, we target the main ministerial speeches archive atmid\.ru/ru/press\_service/minister\_speeches/, and for English we target its English\-language counterpart atmid\.ru/en/press\_service/minister\_speeches/\. As with the Kremlin corpus, data collection proceeds in two stages: \(i\) building a stable index of transcript identifiers and URLs for each language, and \(ii\) harvesting speech\-level metadata, text, images, and locations for every indexed transcript\.

##### Index construction\.

For each language, we crawl the paginated listing of ministerial speeches by iterating over the?PAGEN\_1=<page\>parameter\. On each listing page we identify links to individual speeches using language\-specific CSS selectors and a strict URL pattern that extracts the numeric identifier from paths of the form/press\_service/minister\_speeches/<id\>/\(or the corresponding English\-language path with/en/\)\. For each match we record the listing page number, the numeric ID, and the canonical absolute URL in an index CSV with columnspage\_num,id, andurl\. IDs are de\-duplicated across pages, and the crawler stops after two consecutive listing pages contain no new IDs\. This ensures that we do not request empty or outdated archive pages\.

##### HTTP session management and politeness\.

The MID harvester reads these index files and iterates over the set of unique IDs in each language \(i\.e\., English and Russian\)\. All requests are issued through a sharedrequests\[[55](https://arxiv.org/html/2605.15886#bib.bib195)\]object that centralizes headers, timeouts, and pacing\. To avoid overloading the MID servers, we enforce a conservative delay of five seconds between*all*HTTP requests, including both transcript pages and image downloads, and apply standard timeouts and error handling\. Responses are verified for status codes and basic content sanity before further processing\. As with the Kremlin scraper, we treat any speech page that lacks a meaningful combination of title, text, or image\(s\) as an error and halt processing, so that every row in the final processed MID corpora corresponds to a page with substantive content\.

##### Speech\-level metadata and text extraction\.

For each successfully retrieved MID page, we parse a consistent set of speech\-level metadata and map these fields directly into the final MID CSV schema\. As in the Kremlin data, list\-valued fields \(e\.g\., stored image paths, captions, topic IDs, and probability vectors\) are stored as serialized list strings \(e\.g\.,\["…","…"\]\) and are left blank when the corresponding information is unavailable for a given document\.

Language note:The*MID English*CSV and*MID Russian*CSV share the same core schema\. The only translation\-specific column in the MID files isimage\_captions\_english, which appears*only*in the*MID Russian*CSV and stores English translations of the corresponding Russian\-languageimage\_captionsfield\.

Probability storage note:In our processed files,curated\_topic\_probabilityis stored as a single scalar \(top\-1 confidence for the assigned text topic\), whilecurated\_image\_topic\_probabilitiesis a list with one scalar confidence value per image\.

Caption availability note:When images are present, image\-topic lists are aligned withstored\_image\_filepaths; image captions are extracted when available and may be missing for some images, resulting in occasional length mismatches between caption lists and the stored image list\.

Table 2:MID data dictionary \(speech\-level CSVs\)\.ColumnDescriptionidNumeric document identifier associated with the MID page\.urlFinal resolved URL of the MID document page\.full\_textFull extracted document body text \(English in MID EN; original Cyrillic in MID RU→\\rightarrowEN\)\.full\_text\_englishEnglish translation offull\_text\(MID RU→\\rightarrowEN CSV only; aligned row\-for\-row with the original Russian text\)\.full\_text\_word\_countWord count of the document body text \(computed as the number of whitespace\-separated tokens from the extracted document body\)\.dateHuman\-readable publication date string as displayed on the webpage\.yearCalendar year extracted from the parsed date/time metadata when available\.monthMonth name \(stored as a full English month name, e\.g\.,*January*\)\.dayDay of month \(numeric\)\.timeClock time parsed from page metadata \(blank if not provided\)\.locationFinal location string for the document \(blank if no clear event location can be recovered\)\.latitudeLatitude in decimal degrees obtained by geocodinglocation\(blank/NaNif unresolved\)\.longitudeLongitude in decimal degrees obtained by geocodinglocation\(blank/NaNif unresolved\)\.speakersExtracted speaker name\(s\) associated with the document \(serialized list string\)\.curated\_topic\_idFinal \(curated\) text\-topic identifier assigned to the document from the MID topic space \(K=32K=32; integer in\{0,…,31\}\\\{0,\\dots,31\\\}\)\.curated\_text\_topic\_labelHuman\-readable label forcurated\_topic\_id\.curated\_text\_topic\_groupHigher\-level group/category for the curated text topic\.curated\_topic\_probabilityTop\-1 text\-topic probability \(single scalar confidence value for the assignedcurated\_topic\_id\)\.stored\_image\_filepathsLocal filepaths of downloaded images linked to the document \(serialized list string; stored as relative paths with respect to the MID corpus image root directory in the data release\)\.saved\_images\_countNumber of images successfully saved locally for the document\.declared\_images\_countNumber of images declared on the MID webpage when available \(blank if not provided\)\.missing\_images\_countNumber of missing images \(declared\_images\_countminussaved\_images\_count\)\.image\_captionsCaptions extracted for the document images in the original language when available \(serialized list string; one caption per image, aligned withstored\_image\_filepaths\)\.image\_captions\_englishEnglish translation ofimage\_captions\(MID RU→\\rightarrowEN CSV only; serialized list aligned withstored\_image\_filepaths\)\.curated\_image\_topic\_idsAssigned image\-topic IDs \(serialized list; aligned withstored\_image\_filepathswhen images are present\)\.curated\_image\_topic\_labelsHuman\-readable labels for the image topics \(serialized list; aligned with images\)\.curated\_image\_group\_namesHigher\-level group names for the image topics \(serialized list; aligned with images\)\.curated\_image\_topic\_probabilitiesPer\-image top\-1 probabilities \(serialized list of floats; one probability score per image, aligned withstored\_image\_filepaths\)\.
##### Image discovery and photo counters\.

The design of the MID’s ministerial speeches section differs from the Kremlin site in that images associated with a MID speech are concentrated in a dedicated photo\-album widget\. For each MID speech we therefore restrict image harvesting to the “Photo album” or “Additional materials” area, identified by the\#photo\-slider \.photo\-slider\_\_listcontainer\. Within this slider we enumerate all<li\>elements, extract the associated<img\>tags, and resolve their image URLs relative to the page base\. For each candidate we record the best available source URL \(usingsrc,data\-src, or the first entry insrcset\) and the corresponding caption, taken from the image’saltattribute where available\.

Images are downloaded to language\-specific directories \(e\.g\.,mid\_russian\_scraped\_images/andmid\_english\_scraped\_images/\) in subfolders named byid\. Within each subfolder, filenames follow a simple pattern<id\>\_<seq\>\.<ext\>, whereseqis a zero\-based counter andextis the original file extension when present \(otherwise\.jpg\)\. The final processed MID CSVs store these paths as JSON\-encoded lists in thestored\_image\_filepathscolumn and the associated captions inimage\_captions\. To avoid trivial duplicates we de\-duplicate images within each speech by the basename of their source URL\.

As on the Kremlin site,mid\.ruexposes a per\-speech photo counter in the photo\-album interface\. We parse this counter intodeclared\_photos, falling back to the number of images in the slider list when the counter text is not present\. The realized number of downloaded images for each speech is recorded inimages\_count, and we store the difference between the declared and realized counts inmissing\_photos\_count\. In the final MID Russian and English processed files,declared\_photosis non\-empty for every row andmissing\_photos\_countis identically zero, indicating that for all speeches the number of locally stored images exactly matches themid\.ruphoto counter\.

##### Location recovery and geocoding\.

Because MID\.RU does not consistently report venues in a dedicated structured field, we recover a usablelocationstring using a conservative multi\-stage pipeline that preserves observed locations when present and backfills only when missing\. In brief, we \(i\) extract locations directly from MID page headers/titles when explicitly stated, \(ii\) apply lightweight rule\-based parsing and named\-entity recognition on the English representation of the record \(title and full text\) for remaining blanks, and \(iii\) run a final backfill pass usingAnthropic Claude 3 Haiku\(claude\-3\-haiku\-20240307\) to return a single location phrase orUNKNOWNwhen no venue is stated\. We then geocode unique non\-empty locations using a two\-stage cascade \(Nominatim first, ArcGIS fallback\) implemented ingeopy, storinglatitudeandlongitudewhen resolution succeeds and leaving coordinates missing otherwise\. Full implementation details, edge cases, and additional validation checks are reported in Section[0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3)\(Location Extraction and Geocoding Section further below\)\.

##### Final coverage and descriptive statistics\.

After scraping, cleaning, and merging, the final Russian MID corpus contains6,056speeches and the final English MID corpus contains5,057speeches\. Together these files span the period from 18 March 2004 through early October 2025: Russian\-language speeches run from 18 March 2004 to 9 October 2025, while English\-language speeches extend from 18 March 2004 to 7 October 2025\. All rows in both files have parseable dates\.

Text coverage in MID is high but not quite complete\. In the Russian MID file only seven rows have an emptyfull\_text\(and henceword\_count= 0\); the remaining 6,049 rows contain non\-empty text\. In the English MID file only one row has emptyfull\_text\. Among speeches with non\-empty text, Russian MID transcripts have a maximumword\_countof 16,834, a mean of 1,099\.41 words, and a median of 612\.5, while English MID transcripts reach a maximum of 19,895 words, with a mean of 1,392\.26 and a median of 786 words\. Thus, compared to the Kremlin corpora, MID speeches are somewhat shorter on average, and the English MID texts tend to be slightly longer than their Russian counterparts, potentially reflecting both translation effects and differences in how the Ministry edits its Russian and English releases\.

Image coverage is substantial but somewhat lighter than for the Kremlin corpora\. In the Russian MID file,images\_countis non\-null for all 6,056 rows, with a total of 4,498 saved images\. The minimum number of images per speech is 0 and the maximum is 19; the mean number of images per speech is 0\.74 and the median is 1\.0\. A total of 2,150 Russian MID speeches \(35\.5%\) have no associated images, while 3,906 speeches \(64\.5%\) have at least one local image\. In the English MID file,images\_countis again non\-null for all 5,057 rows, with 4,145 total images; here the minimum is 0, the maximum is 20, the mean is 0\.82, and the median is 1\.0 image per speech\. Among English speeches, 1,466 \(29\.0%\) lack images and 3,591 \(71\.0%\) include at least one local image\. As noted above,declared\_photosis non\-empty andmissing\_photos\_countis zero for every MID speech, indicating that our harvested image counts exactly match the MID interface wherevermid\.ruexposes a photo counter\.

![Refer to caption](https://arxiv.org/html/2605.15886v1/mid_totals.png)\(a\)Total number of MID\.RU speeches and harvested local images in the Russian\- and English\-language corpora\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/mid_wordcount.png)\(b\)Distribution of MID\.RU speech length \(word\_count\) for Russian\- and English\-language transcripts \(log\-scaledxx\-axis\)\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/mid_images_zero_vs_nonzero.png)\(c\)Share of MID\.RU speeches with at least one local image versus no images, separately for the Russian\- and English\-language corpora\.
![Refer to caption](https://arxiv.org/html/2605.15886v1/mid_images_per_speech.png)\(d\)Distribution of images per MID\.RU speech in the Russian\- and English\-language corpora \(image counts≥10\\geq 10are grouped into a single 10\+ bin\)\.

Figure 4:MID\.RU corpus: coverage and basic descriptive statistics for speeches and images across Russian\- and English\-language versions of the site\.Figures[4\(a\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf1)–[4\(d\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf4)summarize these coverage patterns for the MID corpora\. Figure[4\(a\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf1)reports the total number of speeches and harvested local images in each language\. The Russian MID archive contains 6,056 speeches and 4,498 images, while the English archive contains 5,057 speeches and 4,145 images\. These totals highlight both the smaller scale of the MID corpora relative to the Kremlin corpora and the fact that, for MID, the number of English\-language images is quite close to the number of Russian\-language images despite the smaller number of English speeches\.

Figure[4\(b\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf2)plots the distribution ofword\_countfor Russian\- and English\-language MID transcripts on a logarithmic scale\. As with the Kremlin histograms, the log scale makes the long right tail of very long speeches visible while preserving detail in the denser middle of the distribution\. Most MID speeches fall between roughly10210^\{2\}and10410^\{4\}words\. The English distribution is slightly shifted to the right relative to the Russian distribution, reflecting the higher mean and median word counts in English; nonetheless, the two language distributions overlap heavily across the main range of speech lengths\.

Figure[4\(c\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf3)focuses on image coverage\. Roughly 64\.5% of Russian MID speeches and 71\.0% of English MID speeches contain at least one local image, with the remainder consisting of text\-only releases\. This pattern suggests that within themid\.ruarchives, images are common but not universal, and that the English pages are slightly more likely than the Russian pages to include an accompanying photo album\.

Finally, Figure[4\(d\)](https://arxiv.org/html/2605.15886#Sx2.F4.sf4)shows the full distribution ofimages\_countper speech, pooling all speeches with ten or more images into a single 10\+\+bin for legibility\. In both languages most speeches contain either zero or one image, and only a small minority of speeches host more than three images\. Compared to the Kremlin distributions, the MID image counts are more tightly concentrated near zero and one, with very few large photo galleries\. Together, these figures illustrate that the MID corpora provide broad longitudinal coverage of ministerial speeches with moderate but systematic photographic documentation, and suggest that the basic structure of speech lengths and image counts is similar across Russian and English versions of the site\.

### Additional Variables

#### 0\.2\.1Additional Extracted or Transformed Variables

Beyond the core scraped text and page metadata, our final speech\-level CSVs also include additional variables derived from the raw HTML and downstream processing pipelines \(translation, location recovery, geocoding, and topic modeling\)\. These variables are constructed in a harmonized way across all four corpora, with certain English\-parallel fields present only for the Russian→\\rightarrowEnglish translation files, as noted below\.

Text fields\.Thetitlecolumn stores the page headline in its original language \(English for the EN corpora; Russian for the RU corpora\)\. For the Russian corpora we additionally provide an Argos Translate\[[62](https://arxiv.org/html/2605.15886#bib.bib223)\]English rendering intitle\_english\. The main body text is stored asfull\_textin the original language for all corpora; for the Russian files we also providefull\_text\_english, generated via our chunked Argos Translate\[[62](https://arxiv.org/html/2605.15886#bib.bib223)\]pipeline \(long Russian texts are split into≤\\leq4,000\-character segments, translated segment\-by\-segment with basic quality checks, and concatenated back into a single English string\)\. These translated fields are strictly row\-aligned with the original Russian:full\_textandfull\_text\_englishalways refer to the same speech\-level document\.

Dates, times, and derived calendar variables\.The original publication date and time are stored indateandtime, parsed from the source site HTML\. Fromdatewe derive integer calendar variablesyear,month, andday\(uniformly defined across corpora as four\-digit year, month 1–12, day 1–31\)\. Thetimefield preserves the reported clock time \(when present\) as a simple string\. These derived variables support time\-series and panel construction without requiring users to re\-parse the original date formats\.

Locations, English location labels, and geocoordinates\.Each CSV includes an event\-location string inlocation, stored as a cleaned human\-readable value \(light whitespace and punctuation normalization\)\. For the Kremlin corpora,locationis scraped directly from the source site \(we do not infer missing locations\)\. For the Kremlin Russian→\\rightarrowEnglish corpus, we additionally provide an English\-rendered location string inlocation\_englishwhile retaining the Russian original inlocation\. For the MID corpora, many records lack an explicit location on the source site; we therefore apply a conservative recovery procedure that fillslocationonly when the title and/or body text provide clear evidence of a venue, otherwise leaving the field blank \(such that no location is invented without textual support\)\. Approximate coordinates are provided inlatitudeandlongitude\. Full details of MID location recovery, Kremlin handling, and the geocoding cascade are provided in Section[0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3)\(Location Extraction and Geocoding Section further below\)\.

Speaker names\.We provide a speech\-levelspeakersfield constructed via corpus\-specific extraction procedures tailored to how the Kremlin and MID sites encode speaker cues \(structured HTML labels versus transcript\-like prefixes\)\. We then normalize extracted labels into canonical person names for analysis\. Full extraction, normalization, and verification details are provided in Section[0\.2\.4](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS4)\.

Document length and media count variables\.We include simple length and count variables derived from the scraped page content:

- •full\_text\_word\_count: an integer count of whitespace\-delimited tokens in the cleanedfull\_textfield, providing a crude but useful measure of speech length for filtering and modeling\.
- •saved\_images\_count: the number of images successfully downloaded and stored for the speech\.
- •declared\_images\_count: for the Kremlin corpora, the number of photos the source page claims to contain \(a site\-provided count rather than a crawler\-derived count\)\.
- •missing\_images\_count: defined asdeclared\_images\_countminussaved\_images\_count\. This quantity is zero when all site\-declared images were successfully scraped and positive when some declared images could not be downloaded \(e\.g\., broken links or access restrictions\)\. It is left missing when no declared image count is available\.

Stored image paths and captions\.For every speech we provide list\-valued columns summarizing its associated images:

- •stored\_image\_filepaths: a serialized list of relative filepaths for each image successfully downloaded for that speech, linking the speech\-level CSV to the underlying image files used in downstream multimodal processing\.
- •image\_captions: a parallel list of captions aligned withstored\_image\_filepaths\. Captions are scraped from the HTML \(e\.g\., figure captions or alt text\) and lightly cleaned; when no caption is available, the corresponding entry is left empty\.
- •image\_captions\_english\(Russian→\\rightarrowEnglish corpora only\): an English\-rendered caption list aligned toimage\_captions, produced via the same translation approach used for other Russian→\\rightarrowEnglish metadata fields\.

Page summaries \(Kremlin corpora\)\.For the Kremlin corpora we include a short page\-level summary field inpage\_summary\. For the Kremlin Russian→\\rightarrowEnglish corpus we additionally provide an English\-rendered version inpage\_summary\_english\. These summaries are included as descriptive metadata to support browsing and quick inspection; they are not required for or included in topic estimation\.

Site\-declared tags \(Kremlin only\)\.In the two Kremlin CSVs we retain structured content tags provided by the Kremlin and its official website\. The primary tag field isdeclared\_topics, which records any site\-assigned topical labels for various thematic categories\. We also retain additional site\-declared metadata fields, includingdeclared\_geographyanddeclared\_persons\. For the Kremlin Russian→\\rightarrowEnglish corpus we additionally provide English\-rendered versions of these tag fields \(declared\_topics\_english,declared\_geography\_english, anddeclared\_persons\_english\)\. These site\-declared annotations are not used to fit ourBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]models, but they support comparisons between the source site’s tagging scheme and our unsupervised topics, and can be leveraged for validation or supervised extensions—including in our own validations further below\.

Final text and image topic variables \(curated\)\.Finally, we attach the curated text and image topic variables from Section[0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2)directly to each row\. For text, we store:

- •curated\_topic\_id: the integer ID of the primary text topic for the speech, defined ask∗​\(d\)=arg⁡maxk⁡p​\(k∣d\)k^\{\\ast\}\(d\)=\\arg\\max\_\{k\}p\(k\\mid d\)after reassigning anyhdbscan\[[14](https://arxiv.org/html/2605.15886#bib.bib208)\]outliers to their nearest topic centroid\. In the Kremlin corpora this lies in\{0,…,88\}\\\{0,\\dots,88\\\}; in the MID corpora it lies in\{0,…,31\}\\\{0,\\dots,31\\\}\.
- •curated\_text\_topic\_label: the short human\-readable label assigned to that topic\.
- •curated\_text\_topic\_group: a broader group label capturing higher\-level domains\.
- •curated\_topic\_probability: a serialized vector containing the full document–topic probability distribution\{p​\(k∣d\)\}k=0K−1\\\{p\(k\\mid d\)\\\}\_\{k=0\}^\{K\-1\}aligned to the topic IDs for that corpus\.

On the image side, each speech stores list\-valued image\-topic summaries aligned withstored\_image\_filepaths:

- •curated\_image\_topic\_ids: for each stored image, the ID of the topic most strongly associated with that image\.
- •curated\_image\_topic\_labels: the corresponding short topic labels in the same order ascurated\_image\_topic\_ids\.
- •curated\_image\_group\_names: the corresponding group labels \(again aligned in order\)\.
- •curated\_image\_topic\_probabilities: for each stored image, a serialized vector containing its full image–topic score/probability distribution aligned to the topic IDs for that corpus\.

Because image\-topic fields are defined at the level of individual images but stored as list\-valued columns at the speech level in our CSVs, they can be used as speech\-level summaries \(e\.g\., whether any attached image is associated with a topic\) or as a starting point for constructing an image\-level dataset by exploding the lists\.

Taken together, these additional extracted and transformed variables—translated text and metadata fields, calendar variables, extracted speakers, cleaned locations and geocoordinates, length and media counts, stored image paths and captions, site\-declared tags, and curated text and image topic variables \(including full probability vectors\)—make the final CSV files immediately usable as analysis\-ready datasets\.

#### 0\.2\.2Speech and Image Topic Labels

This section discusses our topic modeling\-derived variables, which assign each speech and its associated images interpretable topic labels\. Figure[5](https://arxiv.org/html/2605.15886#Sx2.F5)provides a high\-level overview of the topic\-labeling pipeline used to produce these variables\.

Text corporaKremlin & MIDspeeches \(EN & RU\)Translation & preprocessingRU→\\rightarrowEN, cleaning,stopwordsText embeddingsSentence\-transformerrepresentationsImage dataOfficial imagesImage preprocessingLoad, resize, normalizeCLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]embeddingsViT\-B/32 image featuresBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]topics \(per corpus\)Text\-based clusters fromsentence embeddingsImage–topic scoring \(CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\)Similarity of imagesto topic promptsDiagnostics & labelsCoherence checks,human topic namesCurated topic variablesSpeech & image topic IDs,labels, groups in CSV/HTMLFigure 5:High\-level overview of the topic modeling pipeline\. We embed Kremlin and MID speech texts \(EN and RU→\\rightarrowEN\) with sentence\-transformer models and fitBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]separately per corpus\. In parallel, associated images are embedded with CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\(ViT\-B/32\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\) and scored against topic prompts to assign image\-topic labels\. Final curated topic IDs, labels, and groups are saved for speeches and images\.While our Kremlin corpora included a set of Kremlin\-declared thematic \(i\.e\., topic\) tags, these were incomplete, accompanying only 62\.2% \(6563/10553\) and 62\.2% \(8298/13340\) of our English and Russian Kremlin speeches, respectively\. We also do not know how these declared topics were assigned by the Kremlin itself, and our MID corpora do not include any MID\-declared topic tags\. In order to consistently assign thematic labels and hence topic variables toallrelevant speeches within our English and Russian Kremlin and MID corpora, we implement a text\+\+image topic extraction pipeline followed by human labeling\.

In this regard, our goal in this step is to turn each speech and its associated images into a set of interpretable, comparable topic labels that future researchers can use as covariates in downstream analyses\. We achieve this by estimating topic models on four corpora that jointly cover the Russian presidency and the Ministry of Foreign Affairs \(MID\): Kremlin English speeches \(Kremlin EN\), Kremlin Russian speeches translated into English \(Kremlin RU→\\rightarrowEN\), MID English texts \(MID EN\), and MID Russian texts translated into English \(MID RU→\\rightarrowEN\)\. For each institution, we construct a single shared*text\-based*topic space \(withK=89K=89topics for the Kremlin andK=32K=32for MID\) that we apply across languages \(native English and Russian→\\rightarrowEnglish\)\. We then use CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\(Contrastive Language–Image Pretraining;\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\) — a pretrained vision–language model that learns a shared embedding space for images and text—to map images into these same topic spaces, so that images can be associated with the learned topics without ever influencing the underlying text model\. Concretely, CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]produces a numeric vector \(embedding\) for each image and for each short text prompt; when an image and a prompt are semantically related, their embeddings are close \(high cosine similarity\)\. We therefore represent each topic with a short English prompt built from that topic’s key terms \(e\.g\., “official photo about Ukrainian affairs”\), encode both the prompts and images with CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\], and assign images to topics by comparing similarities across all topic prompts\. This procedure lets us score and label images using the same topic inventory learned from text, while keeping the topic model itself purely text\-based\. The resulting topic IDs, labels, and overarching topic group labels are stored directly in our final CSV files and serve as additional variables in our merged dataset\. We next describe this approach in greater detail\.

Software stack and computing environment\.All text and image processing, embedding, and topic modeling is implemented inPython 3\[[51](https://arxiv.org/html/2605.15886#bib.bib217)\]withinGoogle Colab Pro\+\[[24](https://arxiv.org/html/2605.15886#bib.bib222)\]notebooks\. We rely onpandas\[[39](https://arxiv.org/html/2605.15886#bib.bib201)\]andnumpy\[[28](https://arxiv.org/html/2605.15886#bib.bib202)\]for data handling,PyTorch\[[48](https://arxiv.org/html/2605.15886#bib.bib203)\]andtransformers\[[71](https://arxiv.org/html/2605.15886#bib.bib204)\]for neural models,sentence\-transformers\[[54](https://arxiv.org/html/2605.15886#bib.bib205)\]for sentence\-level embedding models,BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]for topic modeling,umap\-learn\[[38](https://arxiv.org/html/2605.15886#bib.bib207)\]for dimensionality reduction, andhdbscan\[[14](https://arxiv.org/html/2605.15886#bib.bib208)\]for density\-based clustering\.

Although the final topic models are estimated on English\-language text \(including Russian→\\rightarrowEnglish translations for the Russian corpora\), we also conducted extensive exploratory experiments that attempted to model topics directly from Cyrillic Russian text\. These experiments did not yield sufficiently stable or interpretable solutions for inclusion in the final pipeline, but they informed our preprocessing choices for the Russian corpora\. For these Russian\-specific experiments and preprocessing steps, we useStanza\[[52](https://arxiv.org/html/2605.15886#bib.bib209)\],spaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\]\(with a large Russian model\), andpymorphy3\[[31](https://arxiv.org/html/2605.15886#bib.bib210)\]for morphological normalization\.

Image handling usesPillow\[[17](https://arxiv.org/html/2605.15886#bib.bib211)\], and CLIP models\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]are accessed viasentence\-transformers\[[54](https://arxiv.org/html/2605.15886#bib.bib205)\]\. Embedding and topic\-model estimation runs are executed on NVIDIA A100 GPUs provided by Google Colab Pro\+\[[24](https://arxiv.org/html/2605.15886#bib.bib222)\], while heavier Russian preprocessing and the direct\-Russian topic\-modeling experiments are CPU\-bound and executed on high\-RAM Colab Pro\+ instances \(on the order of tens of CPU\-hours per Russian corpus\)\. We usejoblib\[[64](https://arxiv.org/html/2605.15886#bib.bib221)\]for parallelization and on\-diskParquet\[[2](https://arxiv.org/html/2605.15886#bib.bib220)\]shards to stream documents and avoid exhausting RAM\.

Pipeline overview \(released vs\. diagnostic components\)\.We describe first the*released*topic\-labeling workflow, which defines the canonical topic IDs and curated topic variables distributed in the CSVs\. This released workflow operates in a shared*English semantic space*: it uses native\-English texts for the English corpora and machine\-translated English text fields for the Russian corpora, enabling direct cross\-language comparability within each institution\. We also implemented an earlier*native\-Russian \(Cyrillic\) topic\-modeling pipeline*that attempted to fitBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]directly on the original Russian texts \(rather than on translated English\)\. This workflow required substantially heavier Russian\-specific preprocessing \(sentence segmentation, Cyrillic/noise gating, morphological normalization, and expanded Russian stopwords\) and extensive experimentation with multiple multilingual embedding models and tuning choices\. Despite these efforts, the resulting native\-Russian topic solutions were not sufficiently coherent and stable for inclusion as the released topic inventories\. We therefore adopt a translation\-based strategy for the final datasets: we translate the Russian corpora into English \(RU→\\rightarrowEN\), apply the same English\-side preprocessing, and fitBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]in English space for the Kremlin RU→\\rightarrowEN and MID RU→\\rightarrowEN corpora\. The released topic IDs and curated topic labels are thus derived from the English and translated\-English pipelines, while the native\-Russian pipeline is retained only as an exploratory workflow\.

Released \(English\-space\) workflow\.For each institution \(Kremlin and MID\), the released workflow proceeds in five steps: \(i\) text cleanup \(and RU→\\rightarrowEN translation for Russian corpora\), \(ii\) English\-space document embeddings, \(iii\)BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]estimation and stabilization of a shared topic inventory per institution, \(iv\) image–topic scoring using CLIP\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]against the fixed text\-topic inventory, and \(v\) human annotation of topic labels and group labels, followed by writing curated variables to the final CSVs\.

English\-space text construction: RU→\\rightarrowEN translation and preprocessing\.For all four corpora, our goal is to construct a clean, comparable*English\-space*text input for downstream topic modeling\. We therefore \(i\) translate the Russian corpora into English to create parallel English text fields, and then \(ii\) apply the same English\-side preprocessing steps to*all*English\-space text—both native English and translated English\.

*Russian corpora \(Kremlin RU→\\rightarrowEN, MID RU→\\rightarrowEN\): translation\.*We construct parallel English fields \(title\_english,full\_text\_english\) by translating each row’s original Cyrillic text usingArgos Translate\[[62](https://arxiv.org/html/2605.15886#bib.bib223)\], applied consistently across both Russian corpora\. Translation is performed strictly row\-by\-row:full\_text\(Cyrillic\) andfull\_text\_englishalways refer to the same speech, and we do not reorder, merge, split, or otherwise alter document boundaries during translation\. IDs and URLs are preserved\.

*English\-space preprocessing \(applied uniformly to native and translated English\)\.*After translation, we treatfull\_text\(native English\) andfull\_text\_english\(translated English\) as a single class of English\-space inputs and apply identical preprocessing steps\. We start from HTML\-stripped text and apply light normalization: Unicode whitespace normalization \(including replacement of non\-breaking spaces\), removal of residual scraping artifacts, and simple punctuation cleanup \(e\.g\., stripping redundant line breaks introduced by HTML extraction\)\. We do not apply stemming or lemmatization to English\-space text\. This is deliberate:BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]’s class\-based TF–IDF representation \(c\-TF–IDF\) operates on surface forms, and preserving proper nouns improves interpretability for political actors, organizations, and place names\. Across both native\-English and translated\-English texts, we remove stopwords using a union ofspaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\]’s default English stopword list and a small hand\-curated set of highly frequent domain terms \(e\.g\., “russia”, “russian federation”, “president”\) and year tokens \(e\.g\., “2008”\) that otherwise dominate c\-TF–IDF but add little topical information\. The exact stopword lists are provided in the replication code and configuration files for this project\.

English\-space text embeddings \(canonical\)\.For all*released*topic models \(including the translated Russian corpora\), we embed each speech usingsentence\-transformers\[[54](https://arxiv.org/html/2605.15886#bib.bib205)\]withall\-mpnet\-base\-v2\[[60](https://arxiv.org/html/2605.15886#bib.bib212)\]\. Because many speeches exceed transformer context limits, we encode each document using overlapping windows of up to 512 word\-piece tokens \(a sliding window with small overlap to reduce boundary artifacts\), producing one embedding per window\. We L2\-normalize window embeddings and average them to obtain a single 768\-dimensional L2\-normalized document vector per speech\. For shorter speeches, this reduces to a single forward pass\. These English\-space embeddings \(native English \+ translated English\) define the semantic space used for the released topic IDs\.

BERTopicmodeling and shared topic inventories\.We useBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]to obtain topic assignments andc\-TF\-\-IDFtopic representations from the English\-space embeddings\.BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]combines dimensionality reduction \(we useumap\-learn\[[38](https://arxiv.org/html/2605.15886#bib.bib207)\]\) withHDBSCAN clustering\[[14](https://arxiv.org/html/2605.15886#bib.bib208)\]andc\-TF\-\-IDFkeyword extraction\. Images never influence clustering; the topic model is fit purely on text embeddings\.

We estimate four corpus\-levelBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]models \(Kremlin EN, Kremlin RU→\\rightarrowEN, MID EN, MID RU→\\rightarrowEN\)\. In contrast to a single pooled model per institution, we fit*one model per corpus*, and we treat the resulting topic IDs as*corpus\-specific*\(i\.e\., topic 12 in Kremlin EN is not assumed to be the same substantive topic as topic 12 in Kremlin RU→\\rightarrowEN\)\. We nonetheless enforce institution\-level comparability in*granularity*by fixing the same target topic count within each institution \(Kremlin:K=89K=89; MID:K=32K=32\), so that the English and translated corpora within the same institution are summarized at a similar level of detail\.

![Refer to caption](https://arxiv.org/html/2605.15886v1/kremlin_topic_screen_plot.png)Figure 6:K\-sweep scree plot for the Kremlin English corpus\. Lines show normalized topic\-quality metrics \(coherencecnpmic\_\{\\mathrm\{npmi\}\}, diversity, compactness, and separation\) and their weighted composite score; the selected solution isK=89K=89topics\.![Refer to caption](https://arxiv.org/html/2605.15886v1/mid_topic_scree_plot.png)Figure 7:K\-sweep scree plot for the MID English corpus\. Lines show normalized topic\-quality metrics \(coherencecnpmic\_\{\\mathrm\{npmi\}\}, diversity, compactness, and separation\) and their weighted composite score; the selected solution isK=32K=32topics\.To select these target topic counts, we first performed model\-selection diagnostics on the native\-English corpora \(Kremlin EN and MID EN\)\. For each corpus, we began from a single high\-resolutionBERTopicfit \(excludingTopic=−1=\-1outliers throughout\), which yielded up to 185 substantive topics for Kremlin EN and up to 79 substantive topics for MID EN\. We then reduced each high\-resolution solution to a grid of candidate topic counts and evaluated each candidate using four complementary criteria: semantic coherence \(c\_NPMI\), keyword diversity, within\-topic compactness, and between\-topic separation\. In parallel, we monitored cluster diagnostics \(topic\-size distributions and outlier rates\) and centroid\-based redundancy checks \(pairwise cosine similarities between topic centroids to detect near\-duplicate topics\)\. We summarize the four primary criteria with a weighted composite score \(coherence 0\.40, diversity 0\.20, compactness 0\.25, separation 0\.15\) and corroborate the quantitative diagnostics with qualitative inspection of each topic’s*top 10*c\-TF–IDF keywords and representative speeches to avoid overly coarse topics \(too fewKK\) or fragmented/redundant topic sets \(too manyKK\)\. This procedure favoredK=89K=89for Kremlin EN andK=32K=32for MID EN as stable, interpretable solutions that balance thematic granularity against noise and redundancy\. For consistency across each institution’s language variants, we then fixed these same target topic counts when estimating the corresponding RU→\\rightarrowEN models \(Kremlin RU→\\rightarrowEN and MID RU→\\rightarrowEN\)\. As an additional face\-validity check, we verified that the resulting keyword profiles in the Kremlin EN and Kremlin RU→\\rightarrowEN models were broadly similar at these target values, supporting the use of common topic granularity across languages even though topic IDs and human\-assigned labels remain corpus\-specific\. We provide the full diagnostics outputs \(the candidate\-KKmetric table and scree\-style plots used for selection\) in this paper’s replication materials\.

For each corpus,BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]produces a ranked keyword representation for every topic \(via c\-TF–IDF\) and a document–topic probability distribution\. We then perform corpus\-specific human annotation: for each topic in each corpus, we review its top 10 keywords and highest\-probability speeches \(and, where relevant, representative images from the associated HTML summaries\) and assign \(i\) a short topic label and \(ii\) a broader topic\-group label\. These labels are therefore defined*separately for each corpus*and are stored alongside the corpus\-specific topic IDs in our final full CSVs\.

Outlier handling and primary topic assignment\.BERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]/HDBSCANmay initially label some documents as outliers \(topic−1\-1\)\. For production use we remove−1\-1labels by assigning each outlier to the nearest topic centroid \(via cosine distance in embedding space\)\. Drawing on applied topic model research in the social sciences\[[3](https://arxiv.org/html/2605.15886#bib.bib89),[70](https://arxiv.org/html/2605.15886#bib.bib4),[7](https://arxiv.org/html/2605.15886#bib.bib90)\], we then define the primary \(dominant\) speech topic as the maximum a posteriori topic,

k∗​\(d\)=arg⁡maxk⁡p​\(k∣d\),k^\{\\ast\}\(d\)=\\arg\\max\_\{k\}p\(k\\mid d\),and store this ascurated\_topic\_id\. Thus, in the final released CSVs, every speech has a validcurated\_topic\_idin\{0,…,K−1\}\\\{0,\\dots,K\-1\\\}with no remaining−1\-1labels\. Once this text\-topic inventory is fixed \(including each topic’s c\-TF–IDF*top 10*keywords\), we treat it as the canonical semantic reference for downstream image scoring and curation\.

Image embeddings and image–topic scoring \(CLIP\)\.For each speech, we load all available images referenced instored\_image\_filepathsusingPillow\[[17](https://arxiv.org/html/2605.15886#bib.bib211)\], apply the standard CLIP preprocessing, and encode each image using a CLIP ViT\-B/32 backbone\[[53](https://arxiv.org/html/2605.15886#bib.bib194)\]\(viasentence\-transformers\[[54](https://arxiv.org/html/2605.15886#bib.bib205)\]\)\. This yields an L2\-normalized image embedding for every stored image \(images are treated individually; multi\-image speeches retain multiple embeddings\)\.

Crucially, CLIP does*not*alter the fitted text topics\. Instead, after the text\-based topic inventory is fixed, we map images into that inventory using prompt\-based similarity scoring\. For each final topickk, we construct a short English prompt based on that topic’s top keywords \(e\.g\., “official photo about Ukrainian affairs”\) and encode the prompt with the CLIP text encoder to obtain a topic vectorvktopicv^\{\\text\{topic\}\}\_\{k\}\. For each imageiiwith embeddingviimagev^\{\\text\{image\}\}\_\{i\}, we compute cosine similarities to all topic vectors and convert them into an image–topic probability distribution via a softmax:

p​\(k∣image​i\)∝exp⁡\(λ⋅cos⁡\(viimage,vktopic\)\),p\(k\\mid\\text\{image \}i\)\\;\\propto\\;\\exp\\\!\\bigl\(\\lambda\\cdot\\cos\(v^\{\\text\{image\}\}\_\{i\},v^\{\\text\{topic\}\}\_\{k\}\)\\bigr\),whereλ\>0\\lambda\>0is a fixed temperature parameter \(held constant across a corpus in the replication code\)\. We assign each image a primary topick∗​\(i\)=arg⁡maxk⁡p​\(k∣i\)k^\{\\ast\}\(i\)=\\arg\\max\_\{k\}p\(k\\mid i\)and retain the full probability vectors for downstream robustness checks and qualitative inspection\.

Human annotation of topic labels and groups, and handling assignment uncertainty\.Once topic inventories and keyword lists are fixed, a Russian\-speaking coauthor conducts an iterative qualitative coding pass over all topics\. For each topic \(within each corpus\), the coder inspects the c\-TF–IDF keyword list, the highest\-probability speeches, and representative images \(ranked using image–topic probabilities\), and assigns \(i\) a short human\-readable topic label \(e\.g\., “Ukrainian affairs”\) and \(ii\) a broader group label capturing higher\-level domains \(e\.g\., “Post\-Soviet Relations”, “Military & Security”, “Domestic Politics”\)\. When upstream preprocessing refinements \(e\.g\., improved stopword lists\) change topic boundaries, labels are revisited to maintain internal consistency\.

Because topics are learned by clustering in a continuous embedding space and because c\-TF–IDF summarizes each topic using only its most distinctive surface keywords, labels should be interpreted as*summary descriptors*rather than exhaustive tags\. Substantively adjacent subthemes are often grouped together—especially in foreign\-policy discourse—and a topic label may emphasize the most salient recurring referent even when the cluster contains closely related variants\. For example, a topic whose top keywords include multiple East Asian referents \(e\.g\., “China”, “Japan”, “Beijing”, “Tokyo”\) may be labeled with the dominant descriptor \(e\.g\., “China”\) even though some speeches within that cluster primarily concern Japan\. This is an inherent trade\-off when we distribute a tractable topic representation over many thousands of speeches: in the released speech\-level tables we provide a single primary topic per speech for ease of use, while also providing the full item–topic probability information so users can model overlap and ambiguity when needed\.

Writing curated variables and exporting full probability tables\.For each speech, we store the dominant \(highest\-probability\) topic ascurated\_topic\_idand join this ID to the human\-assigned topic label and group label, producingcurated\_text\_topic\_labelandcurated\_text\_topic\_group\. For images, we store primary image\-topic assignments and summarize them back into the speech\-level CSV as list\-valued columns \(curated\_image\_topic\_ids,curated\_image\_topic\_labels,curated\_image\_group\_names\), with one entry per stored image\.

In addition to these dominant\-topic variables, we export two auxiliary*long\-format*probability tables as supplemental files: \(i\) a document–topic table with one row perspeech\_id×\\timestopic\_idpair containingp​\(k∣d\)p\(k\\mid d\)for allk∈\{0,…,K−1\}k\\in\\\{0,\\dots,K\-1\\\}, and \(ii\) an image–topic table with one row perimage\_id×\\timestopic\_idpair containing the corresponding image–topic score/probability for all topics\. These long\-format files allow users to \(i\) recover the full topic distribution for any speech or image, \(ii\) analyze mixtures rather than only the top topic \(e\.g\., retaining the top\-mmtopics per speech\), \(iii\) apply custom thresholds or uncertainty\-aware models \(e\.g\., filtering onp​\(k∣d\)p\(k\\mid d\)\), and \(iv\) conduct robustness checks beyond dominant\-topic labeling\.

Diagnostic native\-Russian preprocessing and embedding pipeline\.While not used in our final released topics, note that in parallel to the released English\-space workflow, we invested substantial effort in a native\-Russian pipeline operating directly on the original Cyrillicfull\_text, with the goal of obtaining a fully Russian\-language topic solution of comparable quality\. In this case, we iterated through multiple increasingly stringent cleaning, normalization, and representation strategies and implemented a comparatively heavy preprocessing and embedding workflow designed specifically for Russian morphology and domain\-specific boilerplate\. Despite these efforts, and based upon a Russian\-speaking expert’s qualitative reviews, native\-Russian topic solutions remained systematically less stable and less coherent than the English\-space approach for our corpora\. Accordingly, we do not release native\-Russian clustering outputs as official topic IDs\. We nonetheless document the pipeline here to demonstrate the extent of our diagnostic work and to provide a reproducible baseline for future improvements\. In brief, the diagnostic pipeline: \(i\) applies sentence segmentation withStanza\[[52](https://arxiv.org/html/2605.15886#bib.bib209)\]and tokenization/POS/NER withspaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\], \(ii\) implements “Cyrillic gating” to remove non\-Russian noise \(URLs, boilerplate, and lines dominated by Latin characters\), \(iii\) performs morphological normalization withpymorphy3\[[31](https://arxiv.org/html/2605.15886#bib.bib210)\]\(lemmatizing content\-bearing tokens while preserving proper\-noun surface forms where lemmatization harms interpretability\), \(iv\) applies an expanded Russian stopword list \(including curated lists to remove frequent formalities and greeting formulas\), and \(v\) constructs a cleanedmodel\_textfield per speech for downstream Russian embeddings and keyword extraction\.

For Russian embeddings in this diagnostic workflow, we use an ensemble of multilingual embedding models \(intfloat/multilingual\-e5\-large\[[67](https://arxiv.org/html/2605.15886#bib.bib213)\], LaBSE\[[22](https://arxiv.org/html/2605.15886#bib.bib214)\], andBAAI/bge\-m3\[[5](https://arxiv.org/html/2605.15886#bib.bib215)\]\) with the same 512\-token windowing and normalization strategy used for English\-space embeddings, and generated the top 10 keywords for each speech\. In practice, however, these native\-Russian experiments were not sufficiently stable or interpretable to be used in the released pipeline: topic keyword lists were often overly heterogeneous, topics did not align well with representative images, and small implementation/encoding artifacts could introduce mixed\-language noise \(e\.g\., English tokens appearing inside Cyrillic outputs\)\. Qualitative feedback on these native\-Russian outputs further emphasized that, without extensive per\-topic document reading, many topics could not be reliably interpreted from keywords and images alone, suggesting that the model was not producing a coherent, analyst\-usable topical structure for the MID RU corpus and that similar instability could appear in the Kremlin RU setting under alternative parameterizations\. Because these issues undermine replicability and substantively meaningful topic labeling, we do*not*use the native\-Russian pipeline for any released topic IDs, curated topic variables, or keyword summaries\.

We therefore used the earlier discussed translation\-based strategy for the final datasets: we translate the Russian corpora into English \(RU→\\rightarrowEN\), apply the same English\-side preprocessing used for the native\-English corpora, and estimate the releasedBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]models in a single English semantic space \(native English \+ translated English\)\. All released topic IDs and curated topic variables are derived exclusively from these English\-space models\.

#### 0\.2\.3Location Extraction and Geocoding

We construct a harmonized set of location variables for all four corpora, consisting of a cleaned event location string \(location\) and approximate geographic coordinates \(latitude,longitude\)\. These fields support spatial aggregation, mapping, and distance\-based analysis while preserving the original source information wherever possible\.

Kremlin corpora \(observed locations from the source site\)\.For the Kremlin English corpus, event locations are scraped directly from the Kremlin website and stored inlocation\. We do not infer or rewrite these strings beyond light whitespace and punctuation cleanup\. For the Kremlin Russian→\\rightarrowEnglish corpus, we retain the original Russian location text and additionally store an English\-rendered version inlocation\_english\. For geocoding and cross\-corpus comparability, coordinates for the Kremlin Russian corpus are derived usinglocation\_english\.

MID corpora \(location recovery using title*and*full text, with an LLM backfill step\)\.For both the MID English corpus and the MID Russian→\\rightarrowEnglish translated corpus,locationis the final named location column\. We preserve any non\-emptylocationvalues already present in the dataset and only attempt recovery whenlocationis blank\.

Whenlocationis blank, we recover a location using aspaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\]\-based heuristic that uses both the document title and the full text\. Specifically, we apply a fixed priority order: \(i\) a rule\-based extraction from the tail of the title \(when the title ends with a location\-like segment\), \(ii\) named\-entity recognition \(NER\) on the title\. If no location is detected by any step,locationremains blank\.

After thisspaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\]pass, we apply a second “backfill” pass for any remaining blanks usingAnthropic Claude 3 Haiku\(model identifierclaude\-3\-haiku\-20240307\)\[[1](https://arxiv.org/html/2605.15886#bib.bib224)\]via the Messages API\. We prompt the model to extract*only*the location phrase from the record’s English title and full text, and to returnUNKNOWNwhen no location is clearly stated\. We then write the returned location phrase intolocationfor those rows \(keeping blanks when the model returnsUNKNOWN\)\. Because the model may return either city\-only strings \(e\.g\., “Moscow”\) or composite strings \(e\.g\., “Moscow, Russia”\), we standardize all recovered values after extraction by retaining only the city component \(i\.e\., the substring before the first comma\) and applying light whitespace/punctuation cleanup\. As a result, the final MIDlocationfield is stored consistently in a city\-only format\.

The final MID CSVs do not retain intermediate extraction provenance fields \(e\.g\.,location\_method,location\_confidence, or raw NER candidate columns\)\. Only the finallocationstring and the geocoded coordinate columns \(latitude,longitude\) are included in the released dataset files\.

Geocoding pipeline \(all corpora\)\.We map each unique non\-emptylocationstring to approximate coordinates using a two\-stage geocoding cascade implemented withgeopy\[[23](https://arxiv.org/html/2605.15886#bib.bib183)\]\. We first queryNominatim \(OpenStreetMap\)\[[46](https://arxiv.org/html/2605.15886#bib.bib185),[47](https://arxiv.org/html/2605.15886#bib.bib184)\]; if the request fails or times out, we fall back to theArcGIS geocoder\[[21](https://arxiv.org/html/2605.15886#bib.bib219)\]\. We do not apply explicit ambiguity filtering or country\-bias restrictions \(e\.g\., Russia\-only constraints\) and accept the first geocoder result returned\. Coordinates are stored as raw floating\-point values \(no rounding\), and any location that cannot be resolved by either service remains missing \(NaN\) inlatitude/longitude\. We do not distribute a separate persistent geocode lookup table; only the final per\-row coordinate columns are included in the replication package\.

Coverage in the final datasets\.Table[3](https://arxiv.org/html/2605.15886#Sx2.T3)summarizes location and coordinate coverage in the final corpora\. “Loc” indicates non\-emptylocation; “Coords” indicates bothlatitudeandlongitudeare non\-missing; and “Loc&Coords” indicates both conditions hold\. In these final data resources, location strings and coordinates are fully aligned: wheneverlocationis present, both coordinates are also present, soLoc=Coords=Loc&CoordsandLoc/noCoords=0 across corpora\. Likewise, the set of unique locations equals the set of unique geocoded locations \(UniqueLoc=UniqueGeocoded\), indicating complete geocoding coverage for the observed location strings in the released datasets\.

Table 3:Location and geocoding coverage in the final corpora\. “Loc” indicates non\-emptylocation\. “Coords” indicates bothlatitudeandlongitudenon\-missing\. “Loc&Coords” indicates both conditions hold\.Corpusn​d​o​c​sn\\ docsLoc%LocCoords%CoordsLoc&Coords%Loc&CoordsLoc/noCoordsUniqueLocUniqueGeocodedKremlin EN10553983093\.1983093\.1983093\.10843843Kremlin RU→\\rightarrowEN133401268495\.11268495\.11268495\.10859859MID EN5057483395\.6483395\.6483395\.60323323MID RU→\\rightarrowEN6056574294\.8574294\.8574294\.80374374

#### 0\.2\.4Speaker Name Extraction

We construct aspeaker\_namesvariable for each speech and corpus that records the set of distinct speakers explicitly marked in the source materials\. Because the Kremlin and MID websites encode speaker cues differently \(i\.e\., as structured HTML labels versus transcript\-like textual prefixes\), we use two corpus\-specific extraction procedures designed to prioritize precision and to preserve original surface forms prior to downstream normalization\.

MID corpora\.For the MID corpora, we scrape each speech page and restrict extraction to the main article body \(the\.text\.article\-contentcontainer when available\)\. MID pages frequently format dialog\-style segments using bold or strongface speaker labels immediately followed by a colon \(e\.g\.,С\.В\.Лавров:Вопрос:Question:\)\. We therefore collect all<b\>and<strong\>elements whose rendered text contains a colon, normalize whitespace and colon spacing, and de\-duplicate labels while preserving first\-seen order\. We then convert these markup labels into speaker candidates by keeping only the substring to the left of the first colon and writing the resulting list tospeaker\_names\. For the Russian MID corpus, this step yields speaker labels in Cyrillic; we subsequently translate these extracted speaker\-name strings into English using anArgos Translate\[[62](https://arxiv.org/html/2605.15886#bib.bib223)\]Russian→\\rightarrowEnglish pipeline, producing an English\-rendered speaker list while retaining the original Russian forms for internal auditing\. This translation step is applied only to the extracted speaker\-name strings \(not to the full page content\) and is used solely to harmonize naming conventions across corpora\.

MID fallback for missing speaker labels\.A non\-trivial number of MID pages do not contain any bold/strong speaker labels \(and thus yield an emptyspeaker\_nameslist under the HTML\-based procedure above\), even though the underlying text clearly indicates the speech is delivered by Foreign Minister Sergey Lavrov\. To avoid systematically missing the principal speaker in these cases, we apply a conservative default rule: for any MID record whose extractedspeaker\_nameslist is empty after de\-duplication and translation, we setspeaker\_namesto contain‘‘Sergey Lavrov’’\. As a result, in the final MID corpora an empty speaker list does not occur; any item with no extractable speaker markup is treated as a Lavrov\-delivered speech for purposes of thespeaker\_namesvariable\.

Kremlin corpora\.For the Kremlin corpus, speaker cues are not consistently encoded in the page HTML as bold labels; instead, transcripts often contain speaker\-prefixed segments embedded directly in the speech text \(e\.g\.,‘‘V\. V\. Putin:’’,‘‘S\. V\. Lavrov:’’,‘‘Question:’’\)\. We therefore apply a regex\-based extractor over the full speech transcript text to identify short line\- or sentence\-initial spans that immediately precede common dialog delimiters \(a colon or semicolon\)\. We clean each captured span by removing surrounding punctuation and normalizing whitespace, and we de\-duplicate candidates within each speech to form an initial list of speaker\-like labels\. Because this can still capture non\-speaker fragments \(e\.g\., generic headers or discourse markers\), we apply a second, high\-precision filter using named\-entity recognition: we runspaCy\[[30](https://arxiv.org/html/2605.15886#bib.bib225)\]withen\_core\_web\_lgover each unique candidate label and retain only those candidates that contain aPERSONentity\.

Downstream normalization and verification\.The extraction steps above intentionally preserve the source\-facing surface forms of labels, which may include initials, role\-like tokens \(e\.g\.,‘‘Question’’\), and minor formatting variation across pages\. After extraction, we compute the distinct set of extracted label strings per speech and then use a large language model \(ChatGPT\[[45](https://arxiv.org/html/2605.15886#bib.bib226)\]\) to normalize these strings into canonical person names \(e\.g\., expanding initials where possible, removing non\-person labels, and consolidating variants that refer to the same individual\)\. Finally, we perform a manual verification pass to ensure that the resultingspeaker\_nameslists contain only valid person names and are consistent across the Kremlin and MID corpora\. The final distributed datasets retain only the cleanedspeaker\_namesfield; intermediate artifacts such as raw HTML label lists, regex match traces, or NER diagnostics are not included in the released CSV resources\.

## Data Records

The final release consists of four speech\-level CSV files—one per corpus and language variant—together with corpus\-specific directory trees that store all scraped images referenced by those CSVs\. In all four cases, the*unit of observation*is a single speech\- or document\-level record as it appears on the Kremlin or MID website\. Each record may have zero or more associated images, which are distributed as separate raster files on disk and linked back to speeches via path\-level identifiers stored in the CSVs\.

Speech\-level CSV files\.The four primary data tables are:

- •kremlin\_english\.csv\(Kremlin EN\):10,55310\{,\}553English\-language Kremlin speeches, spanning 1999–2025\.
- •kremlin\_russian\.csv\(Kremlin RU→\\rightarrowEN\):13,34013\{,\}340Russian\-language Kremlin speeches \(original in Cyrillic, with parallel English translations\), spanning 1999–2025\.
- •mid\_english\.csv\(MID EN\):5,0575\{,\}057English\-language Ministry of Foreign Affairs documents, spanning 2004–2025\.
- •mid\_russian\.csv\(MID RU→\\rightarrowEN\):6,0566\{,\}056Russian\-language MID documents \(original in Cyrillic, with parallel English translations\), spanning 2004–2025\.

In each of these files, one row corresponds to one published speech, statement, interview, briefing, or similar document\. All four CSVs share a common core schema, with a small number of institution\- or language\-specific columns reflecting differences in the underlying websites \(see Section[0\.2\.1](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS1)for full additional column documentation\)\.

Core schema components\.

- •*Identifiers and URLs\.*Each row has a unique integeridand a canonicalurlpointing to the original Kremlin or MID webpage\. Theidserves as the primary key within each corpus and is used throughout the replication materials to join speech\-level records to auxiliary tables \(e\.g\., long\-format probability tables\) and to image inventories\.
- •*Titles and full text\.*The Kremlin EN and MID EN files containtitleandfull\_textfields in English\. The Kremlin RU→\\rightarrowEN and MID RU→\\rightarrowEN files contain the original Cyrillictitleandfull\_texttogether with Argos Translate\[[62](https://arxiv.org/html/2605.15886#bib.bib223)\]\-based translationstitle\_englishandfull\_text\_englishthat are aligned row\-for\-row with the original Russian fields \(Section[0\.2\.1](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS1)\)\. For the Kremlin corpora we also retain a shortpage\_summarywhen available on the site; for the Kremlin RU→\\rightarrowEN corpus we additionally providepage\_summary\_english\.
- •*Dates and derived calendar variables\.*All four CSVs include adatefield \(as scraped from the site\), an optionaltimefield when present on the page, and derived calendar variablesyear,month, anddayobtained by parsingdate\. These derived fields introduce no new information beyonddate, but provide convenient numeric variables for temporal aggregation and modeling\.
- •*Speakers\.*Each row includes aspeakersfield recording the speaker\(s\) associated with the document as a list\-valued string \(e\.g\., a single named speaker for most addresses, or multiple named speakers for dialog\-style transcripts\)\. Speaker strings are extracted from on\-page cues and structured text patterns on the source sites and are then normalized into canonical person names for analysis \(Section[0\.2\.4](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS4)\)\. We do not impute speakers when attribution cues are absent; instead, missingness reflects lack of reliable on\-page markup\.
- •*Locations and geocoordinates\.*Each row includes alocationfield recording the event location when it is available on the source site \(Kremlin\) or can be conservatively recovered from the title/body text \(MID\); otherwise it is left blank\. For the Kremlin RU→\\rightarrowEN corpus, we additionally providelocation\_english, an English rendering of the Russian location string for cross\-language comparability and geocoding\. We then geocode unique non\-blank locations and map the resulting coordinates back to all rows, yieldinglatitudeandlongitudeas approximate point locations for speeches with resolvable locations\. Full corpus\-specific rules, geocoding procedures, and coverage summaries are provided in Section[0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3)\(Location Extraction and Geocoding Section further below\)\.
- •*Document length and media counts\.*Each row includesfull\_text\_word\_count, the number of whitespace\-delimited tokens infull\_text\. On the media side,saved\_images\_countrecords the number of images successfully downloaded for that speech\. For the Kremlin corpora,declared\_images\_countrecords the number of photos the site indicates should accompany the speech when such metadata are available;missing\_images\_countcaptures the difference between declared and successfully downloaded images\.
- •*Site\-declared tags \(Kremlin only\)\.*The Kremlin corpora retain structured metadata tags provided by the source site, includingdeclared\_topics,declared\_geography, anddeclared\_persons\. For the Kremlin RU→\\rightarrowEN corpus, we additionally provide English\-rendered versions of these tag fields \(e\.g\.,declared\_topics\_english\)\. These site\-declared tags are distributed as raw metadata and are not used to fit ourBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]models; they enable comparisons between the site’s own tagging scheme and our unsupervised topic labels\.
- •*Text topic variables\.*Each speech has a primary text topic assignment from the institution\-specificBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]model \(Section[0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2)\)\. This assignment is stored ascurated\_topic\_id\(Kremlin:\{0,…,88\}\\\{0,\\dots,88\\\}; MID:\{0,…,31\}\\\{0,\\dots,31\\\}\), together withcurated\_text\_topic\_label\(a short topic label\) andcurated\_text\_topic\_group\(a broader domain label\)\. We additionally storecurated\_topic\_probability, a serialized vector containing the full document–topic probability distributionp​\(k∣d\)p\(k\\mid d\)aligned to the topic IDs for that corpus, allowing uncertainty\-aware and multi\-topic analyses beyond the dominant\-topic label\.
- •*Image topic variables\.*For each speech, we store image\-level topic summaries in list\-valued columns aligned with the stored image list:curated\_image\_topic\_ids,curated\_image\_topic\_labels, andcurated\_image\_group\_names\. We also providecurated\_image\_topic\_probabilities, which stores \(for each image\) the full image–topic score/probability vector aligned to the topic IDs for that corpus\. This design lets users either analyze images at the speech level \(treating images as attributes of the speech row\) or construct an image\-level dataset by exploding the list\-valued columns\.

A small number of variables are specific to particular corpora \(e\.g\.,page\_summaryand site\-declared tags for the Kremlin;location\_englishand\*\_englishtag/synopsis fields for the Kremlin RU→\\rightarrowEN corpus\)\. The complete, corpus\-specific column lists and construction rules are documented in Section[0\.2\.1](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS1)and in the accompanying data dictionary\.

Cross\-language linkage within institutions\.Within each executive\-branch institution \(Kremlin or MID\), the English and Russian corpora can be linked usingid\. For the Kremlin corpora, the sameidrefers to the same underlying speech \(same source URL and publication date\) across the English and Russian→\\rightarrowEnglish files, enabling bilingual representations of a speech via a simple inner join onid\. An analogous structure holds for the MID corpora\. Users interested in a single language variant can treat each CSV as a stand\-alone data table; the core identifiers, dates, locations, and topic variables are self\-contained within each file\.

Image files and folder structure\.All images are distributed as separate raster files in corpus\-specific directory trees\. Each corpus has a dedicated root image directory containing the scraped images referenced by that corpus’ CSV\. Images are stored in standard web formats \(predominantly JPEG, with occasional PNG files\) and are preserved as downloaded from the original websites \(no redistribution\-time recompression or resizing is performed\)\. Any resizing or normalization required for modeling is applied on\-the\-fly within the embedding pipeline rather than baked into the distributed image files\.

Within the speech\-level CSVs, thestored\_image\_filepathscolumn provides the primary link from speeches to images\. Each cell contains a JSON\-style list of relative filepaths \(relative to the appropriate corpus\-level image root\), one per image associated with that speech\. For example:

\["images/kremlin\_en/000123\_01\.jpg", "images/kremlin\_en/000123\_02\.jpg"\]\.\\texttt\{\["images/kremlin\\\_en/000123\\\_01\.jpg", "images/kremlin\\\_en/000123\\\_02\.jpg"\]\}\.When combined with the relevant corpus image root directory, these relative paths uniquely identify the corresponding image files on disk\.

This design allows users to:

- •move from speeches to images by readingstored\_image\_filepathsfor a givenid, and
- •move from images back to speeches by matching an image filepath against entries instored\_image\_filepaths\(or by constructing an exploded “image index” table from this column, which we provide in the replication materials\)\.

Because image\-topic assignments, labels, group names, and probability vectors are stored directly in the speech\-level CSVs as list\-valued columns aligned withstored\_image\_filepaths, users can either \(i\) work entirely at the speech level, treating images as attributes of each row, or \(ii\) convert these list\-valued columns into a separate image\-level dataset with one row per image, keyed byidplus an image index \(or by animage\_ididentifier when using the auxiliary image\-level tables in the replication package\)\.

Auxiliary topic HTML files and probability tables\.In addition to the four main CSVs and the image directories, the replication package includes \(i\) a set of HTML summary files for eachBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]topic in each corpus and \(ii\) auxiliary tables containing the full topic probability distributions for both speeches and images \(Section[0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2)\)\. The HTML files are organized by institution and topic ID \(e\.g\., a Kremlin topick∈\{0,…,88\}k\\in\\\{0,\\dots,88\\\}or a MID topick∈\{0,…,31\}k\\in\\\{0,\\dots,31\\\}\) and provide a qualitative overview of each topic, including: the top\-ranked English and Russian keywords, the highest\-probability speeches, and a panel of representative images\.

The probability tables complement these qualitative summaries by providing the underlying model outputs in machine\-readable form\. Specifically, they store long\-format topic probability distributions—one row per \(item, topic\) pair—keyed by the same identifiers used in the main CSVs \(speechidfor document\-level tables, and image identifiers with parent speechidfor image\-level tables\)\. These auxiliary probability tables are not required for standard replication using the dominant\-topic assignments stored in thecurated\_\*variables, but they support multi\-topic analyses, uncertainty\-aware inference, robustness checks under alternative decision rules \(e\.g\., Top\-NNor probability\-thresholding\), and downstream modeling strategies that exploit the full posterior\-like topic distribution rather than a single maximum\-probability label\.

Taken together, the four speech\-level CSVs, corpus\-specific image directories, and auxiliary HTML topic summaries and probability tables CSVs form a coherent set of data records\. Speech\-level variables \(identifiers, dates, locations, topics\) live in the CSVs; image content lives in separate directories and is linked via relative paths; topic\-level summaries live in HTML files indexed by topic ID; and full probability distributions live in auxiliary tables keyed by the sameid\-based joins\. All items are designed to be joined by simple keys \(id, relative image paths, and topic IDs\) and to support both institution\-specific and cross\-lingual analyses\.

### Data Details

Our final dataset consists of four curated corpora of public speeches and press materials from the Russian Presidency \(Kremlin\) and the Ministry of Foreign Affairs \(MID\)\. Together, these corpora contain 35,006 unique text items \(rows\), each corresponding to a speech, interview, briefing, or related communication event with associated metadata, topic labels, and image information\.

The two corpora from the Presidential \(Kremlin\) website: an English\-language corpus \(kremlin\_en\) and a Russian\-language corpus paired with English translations \(kremlin\_ru\)\. Thekremlin\_encorpus contains 10,553 entries, whilekremlin\_rucontains 13,340 entries\. Both span from 31 December 1999 to 20 September 2025, covering 27 distinct calendar years and 310 year–month combinations, all of which have at least one recorded speech\. Averaged over this calendar span, the English Kremlin corpus includes approximately 390\.9 speeches per year \(34\.0 per month\), while the Russian–English Kremlin corpus includes approximately 494\.1 speeches per year \(43\.0 per month\)\.

Likewise, The two corpora from the MID website: an English\-language corpus \(mid\_en\) and a Russian\-language corpus paired with English translations \(mid\_ru\)\. Themid\_encorpus includes 5,057 speeches spanning from 18 March 2004 to 7 October 2025, covering 22 calendar years and 260 year–month cells, of which 256 contain at least one speech\. On average, this corpus includes roughly 229\.9 speeches per year \(19\.5 per month, or 19\.8 per observed year–month cell\)\. Themid\_rucorpus contains 6,056 entries with year and month metadata spanning from 18 March 2004 to 9 October 2025 \(22 calendar years and 264 year–month cells\)\. Averaged over the full calendar span, the Russian–English MID corpus includes approximately 275\.3 speeches per year \(22\.9 per month\)\.

Across all four corpora, speeches are reasonably long: mean speech length, measured byword\_count, is approximately 1,555 words\. Within the individual corpora, mean speech lengths range from 1,099\.4 words inmid\_ru\(median==612\.5, maximum==16,834\) to 1,904\.7 words inkremlin\_ru\(median==763\.0, maximum==33,352\)\. The English Kremlin corpus,kremlin\_en, has a meanword\_countof 1,453\.2 \(median==806\.0, maximum==39,898\), while the English MID corpus,mid\_en, has a mean of 1,392\.3 words \(median==786\.0, maximum==19,895\)\. These distributions indicate that the vast majority of items are full speeches or detailed statements rather than short headlines\.

Image content is pervasive but varies across corpora\. Inkremlin\_en, the mean number of associated images \(images\_count\) is 3\.66 \(median==2, maximum==104\), and 67\.4% of speeches have at least one associated image\. Inkremlin\_ru, the meanimages\_countis 3\.36 \(median==2, maximum==104\), with 73\.0% of speeches linked to at least one image\. The MID corpora have fewer images on average but still exhibit substantial coverage:mid\_enhas a mean of 0\.82 images per speech \(median==1, maximum==20\) and 71\.0% of entries with at least one image, whilemid\_ruhas a mean of 0\.74 images \(median==1, maximum==19\) and 64\.5% of entries with at least one image\. Aggregating across corpora, approximately 69\.6% of all 35,006 speeches have at least one associated image\.

Geolocation information is also near\-complete\. Inkremlin\_en, 9,830 of 10,553 entries \(93\.2%\) contain non\-empty location strings, and 9,827 entries \(93\.1%\) have non\-missinglatitudeandlongitude\. The Kremlin Russian corpus \(kremlin\_ru\) exhibits similar coverage, with 12,684 of 13,340 speeches \(95\.1%\) having both non\-empty location strings and valid geo\-coordinates\. MID corpora exhibit slightly higher geolocation completeness:mid\_encontains non\-empty location strings for 4,838 of 5,057 entries \(95\.7%\), with 4,837 entries \(95\.7%\) having non\-missing coordinates, andmid\_rucontains non\-empty locations and non\-missing coordinates for 5,742 of 6,056 entries \(94\.8%\)\. Taken together, 31,090 of 35,006 entries \(94\.5%\) in our combined dataset have non\-missinglatitudeandlongitude\.

TableLABEL:tab:missingness\_fullsummarizes the percentage of missing observations for key variables in each corpus\. Missingness in core analysis variables is modest and well\-characterized: identifiers and URLs, word\-count fields, date components \(date,year,month,day\), curated text\-topic variables, image\-count fields, and curated image\-topic fields exhibit complete coverage across the four corpora \(0% missingness throughout\)\. The raw text fields \(stored in separate companion files due to size\) are also nearly complete:full\_textis missing for only 0\.04% ofkremlin\_en, 0\.04% ofkremlin\_ru, 0\.02% ofmid\_en, and 0\.12% ofmid\_ru; andfull\_text\_englishis missing for 0\.04% ofkremlin\_ruand 0\.12% ofmid\_ru\. Location and geo\-coordinate fields are missing for only 4–7% of entries in each corpus, yielding complete latitude/longitude coordinates for approximately 94\.5% of all speeches\. The largest pockets of missingness occur in optional or source\-dependent fields:page\_summaryis missing for roughly 42–44% of speeches in the Kremlin corpora,declared\_topicsis missing for about 38% of Kremlin speeches, and the Kremlindeclared\_geographyanddeclared\_personsfields are missing for roughly two\-thirds of entries\. Image\-caption fields exhibit near\-complete coverage in the Kremlin corpora after accounting for speeches with zero images, but remain missing for a majority of MID speeches \(image\_captionsmissingness of 70\.79% inmid\_enand 64\.13% inmid\_ru\)\.

Table 4:Missingness\(%\) for all variables in each corpus\. Missingness is the share of blank/NA cells per column; values such as 0 or \[\] are treated as observed \(not missing\)\. For the large raw text fields \(full\_textandfull\_text\_english\), percentages are computed from the companion text files\.Variablekremlin\_enkremlin\_rumid\_enmid\_ruid0\.000\.000\.000\.00url0\.000\.000\.000\.00full\_text0\.040\.040\.020\.12full\_text\_english–0\.04–0\.12full\_text\_word\_count0\.000\.000\.000\.00date0\.000\.000\.000\.00year0\.000\.000\.000\.00month0\.000\.000\.000\.00day0\.000\.000\.000\.00time0\.000\.000\.000\.00location6\.884\.924\.355\.18location\_english–4\.92––latitude6\.884\.924\.355\.18longitude6\.884\.924\.355\.18page\_summary42\.0043\.73––page\_summary\_english–43\.73––speakers0\.000\.000\.000\.00declared\_geography69\.9367\.25––declared\_geography\_english–67\.25––declared\_topics37\.8137\.80––declared\_topics\_english–37\.80––declared\_persons70\.7366\.50––declared\_persons\_english–66\.50––curated\_topic\_id0\.000\.000\.000\.00curated\_text\_topic\_label0\.000\.000\.000\.00curated\_text\_topic\_group0\.000\.000\.000\.00curated\_topic\_probability0\.000\.000\.000\.00stored\_image\_filepaths0\.000\.000\.000\.00saved\_images\_count0\.000\.000\.000\.00declared\_images\_count0\.000\.000\.000\.00missing\_images\_count0\.000\.000\.000\.00image\_captions0\.250\.0270\.7964\.13image\_captions\_english–0\.04–64\.13curated\_image\_topic\_ids0\.000\.000\.000\.00curated\_image\_topic\_labels0\.000\.000\.000\.00curated\_image\_group\_names0\.000\.000\.000\.00curated\_image\_topic\_probabilities0\.000\.000\.000\.00

## 1Technical Validation

Our technical validation efforts assess the accuracy of our dataset’s \(i\) automated location extraction and geolocation extraction routines and \(ii\) expert labels as assigned to theBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]results\.

### 1\.1Topic Validation

This subsection validates the relationship between our estimated topic structure and available human\-declared annotations\. We focus this validation exercise on speeches and associated topics for our Kremlin English and Kremlin Russian datasets, given the unique availability of Kremlin assigned thematic tags as ground truth within these corpora\. Recall that we separately extracted and included these Kremlin\-assigned tags asdeclared\_topicsin our final Kremlin datasets\. The Kremlin’s geographic and person tags \(declared\_geography,declared\_persons\) are outside the scope of the present validation\.

Our ground truth\-treated declared themes are multi\-label: a document can contain multiple items indeclared\_topics\. Our topic models, by contrast, produce a full probability distribution overK=89K=89learned topics for every document\. Our validation exercise in this case therefore asks a concrete question:*when a document is annotated with one or more declared themes, does the model assign high probability mass to learned topics that correspond to those same themes?*

Because declared themes are strings \(no numeric IDs\), we require a mapping from learned topics to declared themes\. We use alabel\-onlymapping: we compare only our expert\-assigned textualBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]labels and the Kremlin’s own textual declared\-theme labels and assign each learned topic to exactly one Kremlin\-based declared theme\. Each of these topic\-to\-theme assignments receives a qualitative confidence label \(high,medium,low\)\. The mapping is reviewed for reasonableness but not manually edited\. We then report validation results for three mapping subsets:All\(high\+medium\+low\),High\+Medium, andHigh only\.

Two complementary views are reported throughout:

- •All\-themes view:evaluates over*all*declared themes observed in the declared subset; themes not covered by the mapping are forced to receive a predicted probability 0\. This view reflects both predictive performance and mapping coverage\.
- •Mapped\-only view:evaluates only on declared themes that are covered by our mapping subset; any document whose declared\-theme list becomes empty after restricting to mapped themes is excluded\. This view isolates performance*conditional on coverage*\.

Because of the multi\-label and multi\-class nature of our relevant inputs, we favor appropriate accuracy metrics for these non\-binary outcome sets such as subset accuracy, hamming loss, and ranking loss\[[20](https://arxiv.org/html/2605.15886#bib.bib6)\]\. Across our ensuing validation tables, higher subset accuracy and F1 indicate better alignment; lower Hamming loss and ranking loss indicate better alignment\. In practice, subset accuracy is a strict metric \(it requires an exact match to the full set of declared themes for a document\), so it is expected to be noticeably lower than single\-label accuracy in classification settings such as our own, especially given \(i\) many possible labels, \(ii\) strong label imbalance, and \(iii\) multi\-thematic documents\.

#### 1\.1\.1Kremlin English: data integrity and evaluation subset

Kremlin English contains10,55310\{,\}553documents in total\. Declared themes are present only for a subset:6,5636\{,\}563documents have a non\-emptydeclared\_topicsfield\. Becausedeclared\_topicsprovides the ground\-truth labels used in the validation,all Kremlin English validation results below are computed on these6,5636\{,\}563documents\.

Our raw topic probabilities are stored in a long\-format table of\(id,topic\_id,probability\_score\)\(\\texttt\{id\},\\texttt\{topic\\\_id\},\\texttt\{probability\\\_score\}\)rows, wheretopic\_id∈\{0,…,88\}\\texttt\{topic\\\_id\}\\in\\\{0,\\dots,88\\\}\(89 topics\)\. We performed strict quality\-control checks before running validation:

- •Row\-count and coverage:the probability table contains exactly10,553×89=939,21710\{,\}553\\times 89=939\{,\}217rows, confirming complete topic coverage for all documents\.
- •Per\-document completeness:every document has exactly 89 topic probabilities with topic IDs0–8888\(no missing topics and no duplicates\)\.
- •Probability sanity:all probabilities fall in\[0,1\]\[0,1\]and per\-document probability sums are essentially 1\.0 \(no renormalization required\)\.
- •ID alignment on evaluation subset:all6,5636\{,\}563documents with declared themes appear in the topic\-probability table \(no declared\-theme IDs missing from probabilities\)\.

Across the6,5636\{,\}563documents withdeclared\_topics, we observe8888unique declared theme labels\.

#### 1\.1\.2Kremlin English: raw\-topic \(multi\-label\) validation results

For each documentdd, letpd,kp\_\{d,k\}be the probability thatddbelongs to learned topick∈\{0,…,88\}k\\in\\\{0,\\dots,88\\\}\. The topic\-to\-theme mapping assigns each learned topickkto one declared themem​\(k\)m\(k\)\. Under any mapping subset, we convert a document’s 89\-topic probability vector into scores over declared themes by summing probabilities for all topics mapped to the same theme:

p^d,t=∑k:m​\(k\)=tpd,k\.\\hat\{p\}\_\{d,t\}\\;=\\;\\sum\_\{k:\\,m\(k\)=t\}p\_\{d,k\}\.\(1\)To produce binary predictions for multi\-label metrics without choosing an arbitrary threshold, we use a Top\-NNrule: if documentddhasNdN\_\{d\}declared themes, we predict theNdN\_\{d\}themes with the largestp^d,t\\hat\{p\}\_\{d,t\}values\.

Table[5](https://arxiv.org/html/2605.15886#S1.T5)reports multi\-label alignment between these predicted theme\-scores and the declared\-theme lists\. The All\-themes rows include*all*88 observed declared themes, which means that reduced mapping coverage \(especially in High\-only\) directly lowers performance by forcing unmapped themes to have predicted score 0\. The Mapped\-only rows remove this coverage penalty by restricting evaluation to mapped themes \(and excluding documents that become label\-empty\), giving a clearer view of performance when the mapping applies\.

Table 5:Kremlin English Raw Topic Validation \(multi\-label\)\. The All\-themes view evaluates over 88 observed declared themes; the Mapped\-only view evaluates only themes covered by the mapping subset \(coverage\-aware\)\.ViewMappingnndocsSubsetAcc\.HammingLossRankingLossF1microF1macroAll\-themes\(88\)All65630\.31810\.022250\.25590\.39400\.1897All\-themes\(88\)High\+Medium65630\.31430\.022360\.26100\.39100\.1901All\-themes\(88\)High only65630\.27700\.024250\.35400\.33940\.1747Mapped\-onlyAll61720\.39830\.035850\.18760\.45470\.3861Mapped\-onlyHigh\+Medium61720\.39450\.036070\.19310\.45140\.3870Mapped\-onlyHigh only58870\.37860\.045040\.22330\.43370\.4595

Several patterns are noticeable in Table[5](https://arxiv.org/html/2605.15886#S1.T5)\. First, the Mapped\-only view yields substantially higher subset accuracy than the All\-themes view, showing that mapping coverage is a major driver of the strict exact\-set metric\. Second, restricting from All to High\+Medium produces only small changes, suggesting that the low\-confidence assignments are not strongly determining outcomes\. Third, High\-only reduces coverage and increases ranking loss, which indicates that the probability mass in the model distribution is spread across topics that often map to themes outside the high\-confidence subset; this is expected when using a conservative mapping\. More broadly, note that macro F1 is systematically lower than micro F1 in the All\-themes view; this is consistent with a long\-tailed theme distribution where rare declared themes contribute equally to macro averaging and are therefore more difficult to recover under any fixed mapping and Top\-NNdecision rule\[[20](https://arxiv.org/html/2605.15886#bib.bib6)\]\.

#### 1\.1\.3Kremlin English: dominant\-topic \(top\-1\) validation results

A complementary check asks whether the model’s single most probable topic “points” to a plausible declared theme\. For each documentdd, letk∗​\(d\)=arg⁡maxk⁡pd,kk^\{\\ast\}\(d\)=\\arg\\max\_\{k\}p\_\{d,k\}\(ties broken deterministically by smallest topic ID\)\. The dominant\-topic prediction is thent^​\(d\)=m​\(k∗​\(d\)\)\\hat\{t\}\(d\)=m\(k^\{\\ast\}\(d\)\), and we count a Top\-1 hit whent^​\(d\)\\hat\{t\}\(d\)appears in the document’sdeclared\_topicslist\. We also report micro/macro F1 when treating the prediction as single\-label and the ground truth as multi\-label \(standard practice for evaluating top\-1 theme identification against multi\-label annotations\)\. “Mapped docs” counts documents whose dominant topic is covered by the mapping subset\.

Table 6:Kremlin English Dominant Topic Validation \(top\-1\)\. “Mapped docs” counts documents whose dominant topic is covered by the mapping subset\.MappingnndeclareddocsnnmappeddocsTop\-1 hit\(all\)Top\-1 hit\(mapped\)F1microF1macroAll656365630\.59470\.59470\.45480\.2119High\+Medium656364530\.57870\.58860\.44890\.2120High only656338640\.33990\.57740\.43730\.2122

The dominant\-topic table highlights the same coverage trade\-off: Top\-1 hit computed over*all*declared documents declines as fewer dominant topics are considered “mapped\.” However, when we condition on mapped documents \(Top\-1 hit \(mapped\)\), the hit rate remains similar across mapping variants, indicating that the dominant\-topic signal is comparatively stable once the mapping applies\.

Kremlin English take\-home implications\.Taken together, Tables[5](https://arxiv.org/html/2605.15886#S1.T5)–[6](https://arxiv.org/html/2605.15886#S1.T6)suggest meaningful alignment between learned topics and Kremlin\-declared themes, while also clarifying the limits of exact\-set recovery in this multi\-label setting\. In particular, the dominant\-topic hit rate near0\.590\.59indicates that the model’s single most probable topic often corresponds to at least one declared theme, which supports the use of our curated labels as coarse summaries\. At the same time, the strict subset accuracy in the raw\-topic validation is lower—even in the Mapped\-only view—which is expected given \(i\) multi\-thematic documents, \(ii\) a large theme space, and \(iii\) the conservative, label\-only mapping that assigns each learned topic to a single declared theme without manual optimization\. These magnitudes are consistent with the general pattern emphasized in prior political text\-as\-data work on multi\-label prediction: strict exact\-match metrics tend to be modest relative to overlap\-based measures \(e\.g\., micro F1\) when labels are numerous and imbalanced\[[20](https://arxiv.org/html/2605.15886#bib.bib6)\]\.

#### 1\.1\.4Kremlin Russian: data integrity and evaluation subset

Kremlin Russian contains13,34013\{,\}340documents in total\. Declared themes are present only for a subset:8,2988\{,\}298documents have a non\-empty declared\-themes field \(we use the English\-declared\-theme strings,declared\_topics\_english, for consistency with the validation mapping\)\. Because declared themes provide the ground\-truth labels used in the validation,all Kremlin Russian validation results below are computed on these8,2988\{,\}298documents\.

Our raw topic probabilities are stored in a long\-format table of\(id,topic\_id,probability\_score\)\(\\texttt\{id\},\\texttt\{topic\\\_id\},\\texttt\{probability\\\_score\}\)rows, wheretopic\_id∈\{0,…,88\}\\texttt\{topic\\\_id\}\\in\\\{0,\\dots,88\\\}\(89 topics\)\. We performed strict quality\-control checks before running validation:

- •Row\-count and coverage:the probability table contains exactly13,340×89=1,187,26013\{,\}340\\times 89=1\{,\}187\{,\}260rows, confirming complete topic coverage for all documents\.
- •Per\-document completeness:every document has exactly 89 topic probabilities with topic IDs0–8888\(no missing topics and no duplicates\)\.
- •Probability sanity:all probabilities fall in\[0,1\]\[0,1\]and per\-document probability sums are essentially 1\.0 \(mean=1\.000000001=1\.000000001, min=0\.9999998799=0\.9999998799, max=1\.000000119=1\.000000119; no renormalization required\)\.
- •ID alignment on evaluation subset:all8,2988\{,\}298documents with declared themes appear in the topic\-probability table \(no declared\-theme IDs missing from probabilities\)\. The remaining5,0425\{,\}042documents have probabilities but no declared themes and are therefore excluded from validation\.

Across the8,2988\{,\}298documents with declared themes, we observe8989unique declared theme labels\.

#### 1\.1\.5Kremlin Russian: mapping coverage and what it implies

For Kremlin Russian, mapping coverage is more limited than in Kremlin English\. Under the All variant, only 48 of the 89 observed declared themes are covered by at least one learned topic; High\+Medium covers 47; and High\-only covers 35\. This matters for interpretation: in the All\-themes view, unmapped themes necessarily receive predicted probability 0, so performance reflects both \(i\) how well probabilities concentrate on mapped themes that match declared labels and \(ii\) how much of the declared\-theme space is covered by the mapping\.

#### 1\.1\.6Kremlin Russian: raw\-topic \(multi\-label\) validation results

We apply exactly the same raw\-topic validation procedure used for Kremlin English\. Specifically, we sum topic probabilities into declared\-theme scores via Eq\.[1](https://arxiv.org/html/2605.15886#S1.E1)and apply the Top\-NNbinarization rule \(predict the same number of themes as declared for each document\)\. Table[7](https://arxiv.org/html/2605.15886#S1.T7)reports both All\-themes and Mapped\-only views\.

Table 7:Kremlin Russian Raw Topic Validation \(multi\-label\)\. The All\-themes view evaluates over 89 observed declared themes; the Mapped\-only view evaluates only themes covered by the mapping subset \(coverage\-aware\)\.ViewMappingnndocsSubsetAcc\.HammingLossRankingLossF1microF1macroAll\-themes\(89\)All82980\.21150\.027880\.25870\.28310\.1816All\-themes\(89\)High\+Medium82980\.21160\.028450\.26460\.26830\.1789All\-themes\(89\)High only82980\.13090\.030780\.37100\.20850\.1535Mapped\-onlyAll79410\.25070\.044170\.20880\.31410\.3328Mapped\-onlyHigh\+Medium79390\.25170\.045650\.22020\.30330\.3371Mapped\-onlyHigh only75240\.18020\.060860\.29930\.25850\.3927

These results again show the coverage trade\-off clearly\. The All\-themes view is lower because more than half of declared themes are unmapped \(48/89 even in the All variant\)\. When we restrict to Mapped\-only themes, subset accuracy increases because evaluation excludes unmapped labels \(and excludes documents that lose all labels\)\. As in Kremlin English, All vs\. High\+Medium is similar, indicating that excluding low\-confidence mappings does not substantially change outcomes\. The High\-only variant reduces coverage most severely and increases ranking loss, indicating that much of the model’s probability mass often lies on topics whose mappings were excluded under this strict subset; this is expected when restricting to only the most conservative assignments\. As in the English corpus, differences between micro and macro F1 should be interpreted in light of imbalanced theme frequencies and the strictness of exact\-set matching under many possible labels\[[20](https://arxiv.org/html/2605.15886#bib.bib6)\]\.

#### 1\.1\.7Kremlin Russian: dominant\-topic \(top\-1\) validation results

We also report the dominant\-topic validation for Kremlin Russian using the same rule as in Kremlin English: for each document, we select the most probable learned topic, map it to a declared theme \(if mapped\), and count a Top\-1 hit when that mapped theme appears in the document’s declared\-theme list\. “Mapped docs” counts documents whose dominant topic is covered by the mapping subset\.

Table 8:Kremlin Russian Dominant Topic Validation \(top\-1\)\. “Mapped docs” counts documents whose dominant topic is covered by the mapping subset\.MappingnndeclareddocsnnmappeddocsTop\-1 hit\(all\)Top\-1 hit\(mapped\)F1microF1macroAll829882980\.38350\.38350\.28090\.1975High\+Medium829881860\.37800\.38320\.28030\.1988High only829828810\.23700\.68270\.50030\.2290

This table makes the coverage/precision trade\-off especially transparent\. When computed over all declared documents, Top\-1 hit declines as mapping becomes more conservative because fewer dominant topics are treated as mapped\. However, when we condition on mapped docs, the Top\-1 hit rate rises sharply in the High\-only setting \(0\.6827\), indicating that the dominant\-topic mappings that survive the strictest confidence filter are substantially more reliable when they apply\. The jump in F1microunder High\-only reflects the same phenomenon: it is computed on a much smaller mapped subset \(2,881 documents\)\. However, within that subset, the dominant\-topic prediction aligns with declared labels much more often\.

Overall implications for topic\-model accuracy and use\.Across both corpora, three general conclusions follow\. First,*coverage*is a first\-order constraint: metrics computed in the All\-themes view conflate model alignment with the extent to which learned topics can be meaningfully mapped into the declared taxonomy, whereas the Mapped\-only view isolates performance conditional on that mapping applying\. Second, dominant\-topic hit rates indicate that the single most probable learned topic often corresponds to at least one declared theme \(especially in the high\-confidence mapped subset\), which supports using our curated topic labels as high\-level summaries and for coarse stratification\. Third, stricter set\-based agreement measures \(subset accuracy\) are expected to be lower in this setting, because they require exact recovery of multi\-label theme sets for long, multi\-thematic texts and because our mapping is intentionally conservative \(label\-only, one declared theme per learned topic, and not tuned to optimize predictive metrics\)\. This pattern—moderate overlap\-based agreement alongside lower exact\-set agreement—is consistent with the behavior of multi\-label evaluation metrics in political text\-as\-data settings with many labels and strong imbalance\[[20](https://arxiv.org/html/2605.15886#bib.bib6)\]\. For users, the practical implication is that our topic annotations are most reliable for descriptive organization and broad categorization, while applications requiring high topical purity should incorporate probability thresholds, multi\-topic representations, and robustness checks using the full topic probability distributions provided in the supplemental materials\.

#### 1\.1\.8Geolocation validation

To validate the accuracy of our automated geocoding procedure \(Section[0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3)\), we compare machine\-assigned coordinates against an independent manual reference based on GeoNames\. We draw a simple random sample of 1,000 speeches from the final processed data, stratified evenly across corpora \(n=250n=250each\): Kremlin EN, Kremlin RU→\\rightarrowEN, MID EN, and MID RU→\\rightarrowEN\. For each sampled speech we retain the speech identifier \(id\), the source URL \(url\), the extracted location strings \(including both English and Russian forms when available\), and the automated coordinates \(machine\_latitude,machine\_longitude\)\. Using only the extracted location strings, then manually lookup and assign GeoNames\-based coordinates for the same extracted locations, recordingmanual\_latitudeandmanual\_longitude\(and, when available, the corresponding GeoNames name and identifier\)\.

For every sampled row with both coordinate sources present, we compute the great\-circle \(Haversine\) distance in kilometers between the automated point\(ϕm,λm\)\(\\phi\_\{m\},\\lambda\_\{m\}\)and the manual point\(ϕh,λh\)\(\\phi\_\{h\},\\lambda\_\{h\}\)and store it asdistance\_km\. We evaluate agreement under several practical tolerance thresholds by defining

matchτ=𝕀​\{distance\_km≤τ\},τ∈\{5,10,50,100\}​km,\\mathrm\{match\}\_\{\\tau\}=\\mathbb\{I\}\\\{\\texttt\{distance\\\_km\}\\leq\\tau\\\},\\qquad\\tau\\in\\\{5,10,50,100\\\}\\ \\text\{km\},and we report match rates*among rows with both machine and manual coordinates present*\(“Both coords”\)\.

Table[9](https://arxiv.org/html/2605.15886#S1.T9)summarizes coverage and agreement by corpus\. Coordinate availability is near\-complete: 99\.2% of the Kremlin RU sample \(248/250\) and 99\.6% of the Kremlin EN sample \(249/250\) contain both machine and manual coordinates, and both MID samples contain complete coverage \(250/250\)\. For the Kremlin corpora, median distances are well below 1 km \(0\.50 km for Kremlin RU→\\rightarrowEN; 0\.67 km for Kremlin EN\), indicating close alignment between the automated and manual coordinates for the typical case; match rates are approximately 72% within 5 km and approximately 96% within 100 km\. Mean distances are substantially larger than medians, reflecting a small number of large\-error outliers that heavily influence the mean\. For the MID corpora, median distances are larger \(14\.08 km in both MID samples\), consistent with more frequent ambiguity in MID location strings \(e\.g\., locations described indirectly in text, multi\-venue itineraries, or references to broader regions rather than a single city\)\. Agreement remains high at broader tolerances: 83\.6% \(MID RU→\\rightarrowEN\) and 98\.8% \(MID EN\) fall within 50 km, and 84\.8% \(MID RU→\\rightarrowEN\) and 98\.8% \(MID EN\) fall within 100 km\. Row\-level distances and threshold\-based match indicators are included in the released geolocation validation files in the replication materials\.

Table 9:Geolocation validation summary \(n=250n=250per corpus\)\. “Both coords” indicates rows with non\-missing machine and manual latitude/longitude\. Match rates are computed among rows with both coordinates present\. Distances are great\-circle \(Haversine\) distances in kilometers\.Corpusn​d​o​c​sn\\ docsBoth coords%BothMedian kmMean km%≤\\leq5%≤\\leq10%≤\\leq50%≤\\leq100Kremlin RU→\\rightarrowEN25024899\.20\.50219\.8471\.7774\.6089\.5293\.15Kremlin EN25024999\.60\.67205\.9872\.2973\.9091\.9795\.98MID RU→\\rightarrowEN250250100\.014\.08883\.0034\.4036\.8083\.6084\.80MID EN250250100\.014\.0869\.4238\.4045\.2098\.8098\.80

## Usage Notes

Users interested in comparing English\- and Russian\-language speeches within one of, or both of, our executive branch institutional corpora \(i\.e\., Kremlin or MID\) should make use of the standardized identifier linkages provided under each dataset’sidvariable\. These identifiers allow users to reliably align original Russian texts with their corresponding English\-language versions when conducting cross\-lingual or comparative analyses for each institutional corpora, though not across our Kremlin and MID datasets\. When merging Russian and English entries for the MID or Kremlin datasets, we recommend relying on these standardized IDs rather than titles or dates alone, as the latter may vary slightly across language versions or publication formats\. For the MID or Kremlin, recall too that there exist a small number of unique speeches within each respective Russian or English\-dataset, relative to its corresponding English or Russian counterpart\.

Images are distributed separately from the speech\-level CSVs as four corpus\-specific ZIP archives:kremlin\_english\_images\.zip,kremlin\_russian\_images\.zip,mid\_english\_images\.zip, andmid\_russian\_images\.zip\. After unzipping, each archive expands to a folder containing the image files referenced by that corpus\. The speech\-level CSVs do*not*embed image binaries; instead, they link each speech to its images viastored\_image\_filepaths, a list\-valued column containing relative paths \(relative to the corresponding corpus image root\)\. To retrieve images for a given speech, users can select the row byid, read itsstored\_image\_filepathslist, and join each relative path to the unzipped corpus image directory to obtain a valid on\-disk filepath for each image\. Conversely, users can link an image back to its parent speech by matching its relative filepath to entries instored\_image\_filepaths\(or by explodingstored\_image\_filepathsinto an image\-level index keyed byid\)\.

Users making use of ourBERTopic\[[27](https://arxiv.org/html/2605.15886#bib.bib206)\]\-based labels and group assignments for texts and images \(thecurated\_\*variables in each dataset\) should be mindful that these assignments reflect the dominant \(highest\-probability\) topic associated with each document or image\. A dominant topic assignment does not guarantee that a majority of the speech text or image content pertains to that topic, particularly for long or multi\-thematic speeches\. Researchers who require higher topical purity may therefore wish to filter documents based on the dominant\-topic probability \(e\.g\., retaining only speeches whose maximum topic probability exceeds a chosen threshold\)\. Alternatively, researchers may prefer to work directly with the full topic probability distributions we provide in a supplemental archive,text\_and\_image\_topic\_probability\_files\.zip, which contains long\-format document–topic and image–topic probability tables exported from the modeling pipeline \(one row per \[item, topic\] pair\), keyed by the same identifiers used in the main CSVs \(speechidfor document\-level tables, and an image identifier together with the parent speechidfor image\-level tables\); these tables support multi\-topic analyses, uncertainty\-aware inference, and robustness checks without requiring users to rely solely on the single dominant\-topic assignment stored in thecurated\_\*fields\.

Supplemental topic probability tables\.As noted in the paragraph above, we additionally provide a compressed archive,text\_and\_image\_topic\_probability\_files\.zip, containing long\-format document–topic and image–topic probability tables exported from the modeling pipeline\. Each table contains one row per \(item, topic\) pair and is keyed by the same identifiers used in the main CSVs: speechidfor document\-level tables, and \(parent speechidplus an image identifier\) for image\-level tables\.

*Document\-level tables \(speech×\\timestopic\):*

- •kremlin\_english\_text\_topic\_probs\.csv
- •kremlin\_russian\_text\_topic\_probs\.csv
- •mid\_english\_text\_topic\_probs\.csv
- •mid\_russian\_text\_topic\_probs\.csv

*Image\-level tables \(image×\\timestopic\):*

- •kremlin\_english\_image\_topic\_probs\.csv
- •kremlin\_russian\_image\_topic\_probs\.csv
- •mid\_english\_image\_topic\_probs\.csv
- •mid\_russian\_image\_topic\_probs\.csv

For analyses centered on Russian\-language texts, users should rely on the original Russian speech titles and full texts included in the Russian\-language corpora rather than the automated English translations we provide alongside them in these same Russian\-language CSV files\. Furthermore, note that English\-focused analyses based on our own machine translations of Russian speeches, speech titles, and related metadata within these Russian\-language datasets are analytically distinct from analyses using the Russian government’s officially released English\-language versions included in the English corpora\. The latter may reflect not only translation but also selective editorial or framing decisions regarding what content is translated and how\. Consequently, the appropriate choice of English\-language text should be guided by a user’s specific research question\.

Researchers working with the Russian\-language corpora should also be attentive to character encoding when reading files, as Cyrillic text may not render correctly under default settings\. InPython 3\[[51](https://arxiv.org/html/2605.15886#bib.bib217)\], we recommend reading files explicitly as UTF\-8 \(e\.g\.,pandas\.read\_csv\(\.\.\., encoding="utf\-8"\)and, when needed,encoding\_errors="strict"to surface decoding problems rather than silently replacing characters\); if decoding fails, users should avoid ad\-hoc spreadsheet “repairs” and instead re\-export or re\-save the source file in UTF\-8 and re\-read it\. InR, we recommend UTF\-8\-capable import routines such asreadr::read\_csv\(\.\.\., locale = readr::locale\(encoding = "UTF\-8"\)\)ordata\.table::fread\(\.\.\., encoding = "UTF\-8"\), and users should verify correct rendering by inspecting a few known Cyrillic tokens after import\.

Finally, opening the CSV files directly in spreadsheet software such as Microsoft Excel or Google Sheets is discouraged\. Both tools impose per\-cell character limits \(Excel supports up to 32,767 characters per cell and Google Sheets effectively caps cells at 50,000 characters\), and some speeches in our datasets contain text fields exceeding these thresholds; as a result, long texts may be truncated or omitted without obvious warning\. These limitations are language\-agnostic and can therefore affect both the Russian\- and English\-language CSVs whenever speech texts are very long\. In addition, spreadsheet programs may apply automatic type inference and may mishandle Unicode during import/export workflows, increasing the risk of silent character corruption in Cyrillic fields if files are opened and re\-saved\. We therefore recommend accessing and processing the released CSV files using programmatic tools \(e\.g\., Python or R\) rather than spreadsheet editors, and preserving UTF\-8 encoding throughout the workflow\.

## Code availability

All code used for web scraping, data cleaning, translation, variable construction, topic modeling, and validation is publicly available in a GitHub repository \(https://github\.com/bagozzib/Russian\-Speech\-Text\-and\-Images\)\. The repository provides a structured replication workflow, configuration files, and documentation describing how to reproduce the four final CSV archives \(kremlin\_english\.csv,kremlin\_russian\.csv,mid\_english\.csv,mid\_russian\.csv\), the four corresponding image archives, and the auxiliary long\-format topic–probability tables and HTML topic summaries\.

All analysis scripts were developed and executed inPython 3\(any modern 3\.x release should be sufficient; we recommendPython≥\\geq3\.10 for best compatibility with current NLP and scientific\-computing packages\) and/orR\(any modern 4\.x release\)\. To facilitate reproducibility, the GitHub repository mentioned above provides explicit environment specifications \(e\.g\.,requirements\.txtor equivalent\) that pin package versions for the primary pipelines\.

## Data Availability

All data described in this paper are publicly available via Harvard Dataverse\[[11](https://arxiv.org/html/2605.15886#bib.bib5)\]\. The release is organized into four speech\-level CSV tables \(one per corpus\), four corresponding image archives, and Auxiliary materials\. Each component is described below\.

CSV archives\.Four analysis\-ready, speech\-level CSV tables \(one per source\-language corpus\) that contain the full set of scraped metadata and the final modeling outputs used in the paper\. The following CSV files can specifically be found within the kremlin\_mid\_en\_ru\_final\_csvs\.zip on the abovementioned Dataverse page:

- •kremlin\_english\.csv: Kremlin corpus in English \(original English pages\)\.
- •kremlin\_russian\.csv: Kremlin corpus in Russian, with the translated English text fields \(e\.g\.,full\_text\_english\) used for topic modeling while preserving the original Russian content\.
- •mid\_english\.csv: MID\.ru corpus in English \(original English pages\)\.
- •mid\_russian\.csv: MID\.ru corpus in Russian, with translated English text fields used for topic modeling while preserving the original Russian content\.

Image archives\.All scraped images referenced by the CSVs are distributed as separate zipped files corresponding to our four separate corpus\-level archives \(i\.e\., one per corpus\):

- •kremlin\_english\_images\.zip
- •kremlin\_russian\_images\.zip
- •mid\_english\_images\.zip
- •mid\_russian\_images\.zip

Within each corpus\-specific archive, image files are stored in standard web formats \(predominantly\.jpg, with occasional\.png\)\. Thestored\_image\_filepathscolumn in each CSV provides the authoritative linkage from speech records to image files: each cell contains a list of image paths that are*relative to the corresponding corpus image root directory*\. To locate an image, extract the relevant corpus archive and concatenate the corpus image root directory with the relative path listed instored\_image\_filepaths\. Image captions, when available, are provided in the parallel list\-valuedimage\_captionsfield\.

Auxiliary materials distributed with the data deposit\.Alongside the four CSVs and four image archives, the Dataverse deposit includes a compact set of auxiliary resources to support inspection and reuse\. The following file and folder names are each found within the kremlin\_mid\_en\_ru\_auxiliary\_files\.zip file on the Harvard Dataverse page mentioned above:

- •topic\_summaries\_html\.zip: HTML topic\-summary files \(one per learned topic per corpus\) for qualitative inspection of keywords, representative speeches, and representative images\.
- •text\_and\_image\_topic\_probability\_files\.zip: long\-format document–topic and image–topic probability tables exported from the modeling pipeline\. The contents include: - –text\_probability\_files - \*kremlin\_english\_text\_topic\_probs\.csv - \*kremlin\_russian\_text\_topic\_probs\.csv - \*mid\_english\_text\_topic\_probs\.csv - \*mid\_russian\_text\_topic\_probs\.csv - –image\_probability\_files - \*kremlin\_english\_image\_topic\_probs\.csv - \*kremlin\_russian\_image\_topic\_probs\.csv - \*mid\_english\_image\_topic\_probs\.csv - \*mid\_russian\_image\_topic\_probs\.csv

Together, these materials provide \(i\) analysis\-ready speech\-level tables, \(ii\) the complete set of linked image files, and \(iii\) optional topic\-level and probability\-level outputs that support robustness checks, alternative topic assignment strategies, and qualitative validation\. These materials are also included on the project’s Harvard Dataverse dataset page\[[11](https://arxiv.org/html/2605.15886#bib.bib5)\]\.

## References

- \[1\]AnthropicClaude 3 haiku model documentation \(claude\-3\-haiku\-20240307\)\.Note:[https://docs\.anthropic\.com/](https://docs.anthropic.com/)Accessed 2026\-01\-04Cited by:[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p5.1)\.
- \[2\]Apache Parquet ContributorsApache parquet: columnar storage format\.Note:[https://parquet\.apache\.org/](https://parquet.apache.org/)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1)\.
- \[3\]B\. E\. Bagozzi\(2015\)The multifaceted nature of global climate change negotiations\.The Review of International Organizations10\(4\),pp\. 439–464\.Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p17.2)\.
- \[4\]T\. Baltrušaitis, C\. Ahuja, and L\. Morency\(2019\)Multimodal machine learning: a survey and taxonomy\.IEEE Transactions on Pattern Analysis and Machine Intelligence41\(2\),pp\. 423–443\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[5\]Beijing Academy of Artificial Intelligence \(BAAI\) and ContributorsBGE\-M3: multilingual, multi\-granularity text embeddings \(software/model\)\.Note:[https://github\.com/FlagOpen/FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p25.1)\.
- \[6\]K\. Benoit, K\. Munger, and A\. Spirling\(2019\)Measuring and explaining political sophistication through textual complexity\.American Journal of Political Science63\(2\),pp\. 491–508\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[7\]D\. Berliner, B\. E\. Bagozzi, B\. Palmer\-Rubin, and A\. Erlich\(2021\)The political logic of government disclosure: evidence from information requests in mexico\.The Journal of Politics83\(1\),pp\. 229–245\.Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p17.2)\.
- \[8\]L\. Birkenmaier, C\. M\. Lechner, and C\. Wagner\(2024\)The search for solid ground in text as data: a systematic review of validation practices and practical recommendations for validation\.Communication methods and measures18\(3\),pp\. 249–277\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[9\]A\. Bittermann and A\. Fischer\(2024\)Natural language processing in psychology\.Zeitschrift für Psychologie232\(3\),pp\. 143–146\.External Links:[Document](https://dx.doi.org/10.1027/2151-2604/a000568)Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[10\]R\. A\. Blair and N\. Sambanis\(2020\)Forecasting civil wars: theory and structure in an age of “big data” and machine learning\.Journal of Conflict Resolution64\(10\),pp\. 1885–1915\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p7.1)\.
- \[11\]D\. Blinova, G\. Emuru, R\. Emuru, K\. S\. Srivastava, M\. Rulis, S\. Chandrasekaran, and B\. Bagozzi\(2026\)Linked Multi\-Model Data on Russian Domestic and Foreign Policy Speeches\.Harvard Dataverse\.External Links:[Document](https://dx.doi.org/10.7910/DVN/SGI0VK),[Link](https://doi.org/10.7910/DVN/SGI0VK)Cited by:[Data Availability](https://arxiv.org/html/2605.15886#Sx6.p1.1),[Data Availability](https://arxiv.org/html/2605.15886#Sx6.p9.1)\.
- \[12\]D\. Blinova\(2025\)Priming with fear: putin’s manipulation of domestic public support\.Russian Politics10\(1\),pp\. 121–164\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1),[§0\.1](https://arxiv.org/html/2605.15886#Sx2.SS1.p4.1)\.
- \[13\]B\. Bonikowski and L\. K\. Nelson\(2022\)From ends to means: the promise of computational text analysis for theoretically driven sociological research\.Sociological Methods & Research51\(4\),pp\. 1469–1483\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[14\]R\. J\. G\. B\. Campello, D\. Moulavi, and J\. Sander\(2013\)Density\-based clustering based on hierarchical density estimates\.InAdvances in Knowledge Discovery and Data Mining \(PAKDD\),Cited by:[1st item](https://arxiv.org/html/2605.15886#Sx2.I3.i1.p1.3),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p13.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[15\]J\. Carroll\(2017\)Image and imitation the visual rhetoric of pro\-russian propaganda\.Ideology and Politics Journal2\(8\),pp\. 36–79\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[16\]A\. Casas and N\. W\. Williams\(2022\)Introduction to the special issue on images as data\.Computational Communication Research4\(1\)\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[17\]A\. Clark and P\. ContributorsPillow: the friendly PIL fork \(software\)\.Note:[https://python\-pillow\.org/](https://python-pillow.org/)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p18.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1)\.
- \[18\]V\. D’Orazio\(2020\)Conflict forecasting and prediction\.InOxford Research Encyclopedia of International Studies,Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p7.1)\.
- \[19\]Y\. Dai and L\. R\. Luqiu\(2022\)Wolf warriors and diplomacy in the new era\.China Review22\(2\),pp\. 253–283\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[20\]A\. Erlich, S\. G\. Dantas, B\. E\. Bagozzi, D\. Berliner, and B\. Palmer\-Rubin\(2022\)Multi\-label prediction for political text\-as\-data\.Political Analysis30\(4\),pp\. 463–480\.External Links:[Document](https://dx.doi.org/10.1017/pan.2021.15)Cited by:[§1\.1\.2](https://arxiv.org/html/2605.15886#S1.SS1.SSS2.p3.1),[§1\.1\.3](https://arxiv.org/html/2605.15886#S1.SS1.SSS3.p3.1),[§1\.1\.6](https://arxiv.org/html/2605.15886#S1.SS1.SSS6.p2.1),[§1\.1\.7](https://arxiv.org/html/2605.15886#S1.SS1.SSS7.p3.1),[§1\.1](https://arxiv.org/html/2605.15886#S1.SS1.p5.1)\.
- \[21\]EsriArcGIS world geocoding service documentation\.Note:[https://developers\.arcgis\.com/rest/geocode/api\-reference/overview\-world\-geocoding\-service\.htm](https://developers.arcgis.com/rest/geocode/api-reference/overview-world-geocoding-service.htm)Accessed 2026\-01\-04Cited by:[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p7.1)\.
- \[22\]F\. Feng, Y\. Yang, D\. Cer, N\. Arivazhagan, and W\. Wang\(2020\)Language\-agnostic BERT sentence embedding\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p25.1)\.
- \[23\]geopy contributors\(2024\)Geopy: geocoding library for python\.Note:[https://geopy\.readthedocs\.io/](https://geopy.readthedocs.io/)Accessed 2025\-09\-09Cited by:[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p7.1)\.
- \[24\]GoogleGoogle colaboratory documentation\.Note:[https://colab\.research\.google\.com/](https://colab.research.google.com/)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1)\.
- \[25\]J\. Grimmer and B\. M\. Stewart\(2013\)Text as data: the promise and pitfalls of automatic content analysis methods for political texts\.Political Analysis21\(3\),pp\. 267–297\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[26\]M\. Grootendorst\(2022\)BERTopic: neural topic modeling with a class\-based tf\-idf procedure\.arXiv preprint arXiv:2203\.05794\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p5.1)\.
- \[27\]M\. Grootendorst\(2022\)BERTopic: neural topic modeling with a class\-based TF\-IDF procedure\.Note:arXiv:2203\.05794Cited by:[§1\.1](https://arxiv.org/html/2605.15886#S1.SS1.p3.1),[§1](https://arxiv.org/html/2605.15886#S1.p1.1),[Figure 5](https://arxiv.org/html/2605.15886#Sx2.F5),[Figure 5](https://arxiv.org/html/2605.15886#Sx2.F5.pic1.7.7.7.1.1.2.1),[§0\.2\.1](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS1.p9.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p11.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p13.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p14.5),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p16.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p17.2),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p26.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p7.3),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p8.1),[7th item](https://arxiv.org/html/2605.15886#Sx3.I6.i7.p1.1),[8th item](https://arxiv.org/html/2605.15886#Sx3.I6.i8.p1.3),[Data Records](https://arxiv.org/html/2605.15886#Sx3.p12.2),[Usage Notes](https://arxiv.org/html/2605.15886#Sx4.p3.1)\.
- \[28\]C\. R\. Harris, K\. J\. Millman, S\. J\. van der Walt, R\. Gommers, P\. Virtanen, D\. Cournapeau, E\. Wieser, J\. Taylor, S\. Berg, N\. J\. Smith, R\. Kern, M\. Picus, S\. Hoyer, M\. H\. van Kerkwijk, M\. Brett, A\. Haldane, J\. F\. del Río, M\. Wiebe, P\. Peterson, P\. Gérard\-Marchant, K\. Sheppard, T\. Reddy, W\. Weckesser, H\. Abbasi, C\. Gohlke, and T\. E\. Oliphant\(2020\)Array programming with NumPy\.Nature585\(7825\),pp\. 357–362\.External Links:[Document](https://dx.doi.org/10.1038/s41586-020-2649-2)Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[29\]J\. R\. Hollyer, B\. P\. Rosendorff, and J\. R\. Vreeland\(2011\)Democracy and transparency\.The Journal of Politics73\(4\),pp\. 1191–1205\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[30\]M\. Honnibal, I\. Montani, S\. Van Landeghem, and A\. Boyd\(2020\)SpaCy: industrial\-strength natural language processing in python\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1212303),[Link](https://doi.org/10.5281/zenodo.1212303)Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p11.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p24.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p5.1),[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p4.1),[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p5.1),[§0\.2\.4](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS4.p4.1)\.
- \[31\]M\. Korobov and p\. ContributorsPymorphy3: russian morphological analyzer \(software\)\.Note:[https://github\.com/no\-plagiarism/pymorphy3](https://github.com/no-plagiarism/pymorphy3)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p24.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p5.1)\.
- \[32\]G\. Kress and T\. van Leeuwen\(2001\)Multimodal discourse: the modes and media of contemporary communication\.Arnold,London\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[33\]L\. La Lova\(2025\)Text\-as\-data methods to study mass\-media manipulations in autocracies\.Communist and Post\-Communist Studies,pp\. 1–17\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[34\]H\. Li and N\. Zhang\(2024\)Computer vision models for image analysis in advertising research\.Journal of Advertising53\(5\),pp\. 771–790\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[35\]D\. Liu and L\. Shao\(2024\)Nationalist propaganda and support for war in an authoritarian context: evidence from china\.Journal of Peace Research61\(6\),pp\. 985–1001\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[36\]M\. Liu, J\. Yan, and G\. Yao\(2023\)Themes and ideologies in china’s diplomatic discourse\-a corpus\-assisted discourse analysis in china’s official speeches\.Frontiers in Psychology14,pp\. 1278240\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[37\]D\. Mahajan, R\. Girshick, V\. Ramanathan, K\. He, M\. Paluri, Y\. Li, A\. Bharambe, and L\. van der Maaten\(2018\-09\)Exploring the limits of weakly supervised pretraining\.InProceedings of the European Conference on Computer Vision \(ECCV\),Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[38\]L\. McInnes, J\. Healy, and J\. Melville\(2018\)UMAP: uniform manifold approximation and projection for dimension reduction\.Note:arXiv:1802\.03426Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p13.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[39\]W\. McKinney\(2010\)Data structures for statistical computing in python\.InProceedings of the 9th Python in Science Conference \(SciPy 2010\),S\. van der Walt and J\. Millman \(Eds\.\),pp\. 56–61\.Cited by:[Webscraping](https://arxiv.org/html/2605.15886#Sx2.SSx2.p2.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[40\]M\. Mochtak and R\. Q\. Turcsanyi\(2021\)Studying chinese foreign policy narratives: introducing the ministry of foreign affairs press conferences corpus\.Journal of Chinese Political Science26\(4\),pp\. 743–761\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[41\]M\. Mochtak\(2025\)Chasing the authoritarian spectre: detecting authoritarian discourse with large language models\.European Journal of Political Research\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[42\]H\. Mueller and C\. Rauh\(2018\)Reading between the lines: prediction of political violence using newspaper text\.American Political Science Review112\(2\),pp\. 358–375\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1),[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p7.1)\.
- \[43\]National Institute of Standards and Technology\(2015\)FIPS pub 180\-4: secure hash standard \(shs\)\.Note:[https://csrc\.nist\.gov/publications/detail/fips/180/4/final](https://csrc.nist.gov/publications/detail/fips/180/4/final)Accessed 2026\-01\-04Cited by:[Image discovery, de\-duplication, and completeness\.](https://arxiv.org/html/2605.15886#Sx2.SSx2.SSSx1.Px4.p3.1)\.
- \[44\]S\. P\. O’Brien\(2002\)Anticipating the good, the bad, and the ugly: an early warning approach to conflict and instability analysis\.Journal of conflict resolution46\(6\),pp\. 791–811\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p7.1)\.
- \[45\]OpenAI\(2022\-11\)Introducing chatgpt\.Note:[https://openai\.com/index/chatgpt/](https://openai.com/index/chatgpt/)Accessed: 2026\-01\-05Cited by:[§0\.2\.4](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS4.p5.1)\.
- \[46\]OpenStreetMap contributors\(2024\)Nominatim: openstreetmap geocoding\.Note:[https://nominatim\.org/](https://nominatim.org/)Accessed 2025\-09\-09Cited by:[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p7.1)\.
- \[47\]OpenStreetMap contributors\(2024\)OpenStreetMap\.Note:[https://www\.openstreetmap\.org](https://www.openstreetmap.org/)Data and services used via Nominatim; Accessed 2025\-09\-09Cited by:[§0\.2\.3](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS3.p7.1)\.
- \[48\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[49\]A\. Paulus, M\. Rohr, R\. Dotsch, and D\. Wentura\(2016\)Positive feeling, negative meaning: visualizing the mental representations of in\-group and out\-group smiles\.PloS one11\(3\),pp\. e0151230\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[50\]A\. J\. Pineda\(2023\)Capturing political communication online using image and text data: a deep learning approach\.Ph\.D\. Thesis,The University of Michigan,Ann Arbor, MI\.Note:Doctoral dissertation in Political Science and Scientific ComputingExternal Links:[Link](https://hdl.handle.net/2027.42/176652),[Document](https://dx.doi.org/10.7302/7501)Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[51\]Python Software FoundationPython 3 documentation\.Note:[https://docs\.python\.org/3/](https://docs.python.org/3/)Accessed 2026\-01\-04Cited by:[Webscraping](https://arxiv.org/html/2605.15886#Sx2.SSx2.p2.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1),[Usage Notes](https://arxiv.org/html/2605.15886#Sx4.p8.1)\.
- \[52\]P\. Qi, Y\. Zhang, Y\. Zhang, J\. Bolton, and C\. D\. Manning\(2020\)Stanza: a Python natural language processing toolkit for many human languages\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p24.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p5.1)\.
- \[53\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 8748–8763\.External Links:[Link](http://proceedings.mlr.press/v139/radford21a.html)Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1),[Figure 5](https://arxiv.org/html/2605.15886#Sx2.F5),[Figure 5](https://arxiv.org/html/2605.15886#Sx2.F5.pic1.6.6.6.1.1.2.1),[Figure 5](https://arxiv.org/html/2605.15886#Sx2.F5.pic1.8.8.8.1.1.2.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p18.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p3.5),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p8.1)\.
- \[54\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p12.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p18.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1)\.
- \[55\]K\. Reitz and R\. contributors\(2025\)Requests: http for humans\.Note:[https://pypi\.org/project/requests/](https://pypi.org/project/requests/)Python package\. Version 2\.32\.5 \(released Aug 18, 2025\)\. Accessed Jan 4, 2026Cited by:[HTTP session management and politeness\.](https://arxiv.org/html/2605.15886#Sx2.SSx2.SSSx2.Px2.p1.1),[Webscraping](https://arxiv.org/html/2605.15886#Sx2.SSx2.p2.1)\.
- \[56\]L\. Richardson and B\. S\. ContributorsBeautiful soup documentation \(software\)\.Note:[https://www\.crummy\.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)Accessed 2026\-01\-04Cited by:[Webscraping](https://arxiv.org/html/2605.15886#Sx2.SSx2.p2.1)\.
- \[57\]M\. E\. Roberts\(2018\)Censored: distraction and diversion inside china’s great firewall\.Princeton University Press\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[58\]A\. Rozenas and D\. Stukal\(2019\)How autocrats manipulate economic news: evidence from russia’s state\-controlled television\.The Journal of Politics81\(3\),pp\. 982–996\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[59\]A\. Shahgholian, E\. Odacioglu, L\. Zhang, and R\. Allmendinger\(2023\)Big textual data research for operations management: topic modeling with grounded theory\.International Journal of Operations and Production Management\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[60\]K\. Song, X\. Tan, T\. Qin, J\. Lu, and T\. Liu\(2020\)MPNet: masked and permuted pre\-training for language understanding\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p12.1)\.
- \[61\]Z\. C\. Steinert\-Threlkeld\(2019\)The future of event data is images\.Sociological Methodology49\(1\),pp\. 68–75\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[62\]A\. O\. Tech and ContributorsArgos translate \(software\)\.Note:[https://github\.com/argosopentech/argos\-translate](https://github.com/argosopentech/argos-translate)Accessed 2026\-01\-04Cited by:[§0\.2\.1](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS1.p2.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p10.2),[§0\.2\.4](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS4.p2.1),[2nd item](https://arxiv.org/html/2605.15886#Sx3.I6.i2.p1.3)\.
- \[63\]M\. Torres\(2024\)A framework for the unsupervised and semi\-supervised analysis of visual frames\.Political Analysis32\(2\),pp\. 199–220\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[64\]G\. Varoquaux and j\. ContributorsJoblib: computing with python functions \(software\)\.Note:[https://joblib\.readthedocs\.io/](https://joblib.readthedocs.io/)Accessed 2026\-01\-04Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p6.1)\.
- \[65\]A\. Vishwanath\(2025\)Race, legislative speech, and symbolic representation in congress\.American Journal of Political Science69\(2\),pp\. 578–593\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[66\]J\. L\. Wallace\(2016\)Juking the stats? authoritarian information problems in china\.British Journal of Political Science46\(1\),pp\. 11–29\.External Links:[Document](https://dx.doi.org/10.1017/S0007123414000106)Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.
- \[67\]L\. Wanget al\.\(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.Note:arXiv:2212\.03533Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p25.1)\.
- \[68\]J\. C\. Weiss and A\. Dafoe\(2019\-09\)Authoritarian audiences, rhetoric, and propaganda in international crises: evidence from china\.International Studies Quarterly63\(4\),pp\. 963–973\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[69\]J\. C\. Weiss\(2013\)Authoritarian signaling, mass audiences, and nationalist protest in china\.International Organization67\(1\),pp\. 1–35\.External Links:[Document](https://dx.doi.org/10.1017/S0020818312000380)Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[70\]J\. Wilkerson and A\. Casas\(2017\)Large\-scale computerized text analysis in political science: opportunities and challenges\.Annual Review of Political Science20\(1\),pp\. 529–544\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1),[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p17.2)\.
- \[71\]T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. M\. Rush\(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Cited by:[§0\.2\.2](https://arxiv.org/html/2605.15886#Sx2.SSx3.SSS2.p4.1)\.
- \[72\]H\. Xu, Q\. Ye, M\. Yan, Y\. Shi, J\. Ye, Y\. Xu, C\. Li, B\. Bi, Q\. Qian, W\. Wang,et al\.\(2023\)Mplug\-2: a modularized multi\-modal foundation model across text, image and video\.InInternational Conference on Machine Learning,pp\. 38728–38748\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p1.1)\.
- \[73\]M\. Yavuz\(2025\)Crises and ideological change in authoritarian regimes: evidence from the july 2016 coup attempt in turkey\.Comparative Political Studies,pp\. 00104140251369324\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p6.1)\.
- \[74\]W\. Zhong, B\. Chen, F\. Liang, and M\. M\. Zhang\(2025\)Picturing protest: visual framing in authoritarian media on twitter\.Digital Journalism0\(0\),pp\. 1–22\.Cited by:[Background & Summary](https://arxiv.org/html/2605.15886#Sx1.p2.1)\.

## Acknowledgments

This work was supported in part by the National Science Foundation under Award No\. 2417814, SCIPE: Building a Computational and Data\-Intensive Research Workforce & Network in the Mid\-Atlantic Region \(Strengthening the Cyberinfrastructure Professionals Ecosystem\)\.

## Author contributions statement

B\.B\. and D\.B\. conceived the project\. D\.B\. G\.E\. K\.S\. and R\.E\. implemented components of the study’s webscrabing tasks\. G\.E\. K\.S\. and R\.E\. implemented components of the study’s topic modeling tasks and dataset extensions\. B\.B\., D\.B\. M\.R\. handled components of topic labeling and validation\. B\.B\. D\.B\. R\.E\. and S\.C\., helped to oversee project tasks, training, and coordination\. All authors wrote and reviewed the manuscript\.

## Competing interests

The authors declare no competing interests\.

Similar Articles

Multimodal Claim Extraction for Fact-Checking

arXiv cs.CL

Researchers present the first benchmark for multimodal claim extraction from social media, evaluating state-of-the-art multimodal LLMs and introducing MICE, an intent-aware framework that improves handling of rhetorical intent and contextual cues in combined text-image posts.