Skip to the content.


It is no surprise to anyone reading this data story that climate change has been a source of polarizing discussion for the last few years. Indeed, it becomes more and more apparent from natural events (floods, heatwaves and so on) that something is happening and our societies are reacting. Consequently, measures are taken by decision makers about how to mitigate it. Scientists are getting interviewed, politicians give speeches and all of this is delivered by the media. Such influence can be studied through the analysis of who and what they quote in their articles. The Quotebank dataset offers such a possibility, thanks to its 178M quotes from which climate change related quotes can be extracted. In this sense, the goal of this study is to better understand how the climate change debate has been evolving by analysing its speakers and how a few events are representing the development of the debate. In that way, it will be possible to explore possible directions to follow in order to improve it.

Table of Contents

From 116 million quotes to “a few” hundred thousands

Quotebank is a dataset of 178 million unique, speaker-attributed quotations that were extracted from 196 million English news articles crawled from over 377 thousand web domains between August 2008 and April 2020. Here, we focus on the years between 2015 and 2020 that are comprised in 116M quotes. The quotations were extracted and attributed using Quobert, a distantly and minimally supervised end-to-end, language-agnostic framework for quotation attribution.

We are using a small lexicon of expressions linked to climate change (using only words would lead to unreliable results). The expressions we used are the following : ‘climate change’, ‘climate emergency’, ‘renewable energy’, ‘climate crisis’, ‘greenhouse effect’, ‘renewable energies’ and ‘global warming’. We wanted very general expressions, so as not to bias our future analysis. For example, we avoided including COP21 and COP26 in our lexicon to avoid the influence it could have on the nature of the quotes. We could have used many other fancy methods to filter the dataset but at the end, it is usually enough to trust our instinct and choose the most logical words with a keyword analysis.

After some investigations in the Quotebank dataset, we noticed that there were missing data during the year 2016 (very few total number of quotes for some days in 2016). For this reason, as we study the frequency (i.e the number of climate related quotes divided by the total number of quotes), if there are only 300 quotes in total for a day, but one or even two climate quotes, then the frequency is very high compared to the mean of other days. As such, to prevent this nefast influence, we decided to use a threshold on the minimal number of total quotes. Then we keep the data of a given day only if the total number of quotes for that day is greater than the threshold. In the next figure, this effect is visually noticeable when the threshold value is 5000 total quotes per day.