Finnish news media and generative AI bots research

In a very short time, generative AI has emerged as a major global trend, impacting the working and everyday lives of almost everyone. Few sectors are experiencing such a major change in terms of generative AI as journalistic news media.

On 8-9 February, I explored how 100 Finnish newspapers and other news media websites are preparing for their original content to be used for training in generative AI. All well-known national news media and a large number of local news services running websites were included.

The purpose of the study was to understand the extent to which news media and different media groups in Finland prevent their content from being used to train generative AI services. Before the results, let's first look at why this is important.

Large language models, search bots and AI training

AI enables more efficient ways to explore different topics and create new content. At the same time, generative AI blurs the boundaries between creators and re-users of original content.

Many generative AI solutions such as ChatGPT are based on large language models. Simply put, these language models collect a lot of human-generated content and interpret the context in which words and information have been used to train generative AI algorithms. In the generative process, few people know what data source the answer is based on. The AI simply creates a best guess of the context in which different words are related and then tries to answer user questions based on the evolved training data.

Although generative AI developers do not give a precise description of their sources of educational data, it is known that many use website search bots. Like Google's search engine, these search bots, or crawler robots, scan through various websites on the internet and attempt to machine-interpret their content. Only the owner of the search bot ultimately knows how they are being used. For example, the owner of ChatGPT, OpenAI, has not to date provided precise and detailed information on how their bots work.

Generative AI and search bots in themselves are neither ethically good nor bad tools. Much depends on how they are used. In December 2023, the New York Times announced that they are suing OpenAI for copyright infringement in the AI training process. While OpenAI itself does not seek to infringe publishers' copyrights, many content developers have been able to use the New York Times' extensive content on ChatGPT to generate their own similar content.

The use of search bots can be restricted. Website owners can block search bots from accessing content on their websites in a simple way by using the "Disallow" code in the source code of their website. The challenge here in practice is that each search bot must be blocked separately. In addition to the GPTBot and ChatGPT-user bots that leverage ChatGPT, several other similar AI-training search bots already exist. A website administrator should be well aware of what bots are visiting their site.

The question is how Finnish newspapers and news media have approached AI search bots, and how many are blocking the ability of bots to train AI based on their content?

How Finnish news media will block AI search bots in February 2024

The research was conducted by Lari Numminen, the author of this article and a marketing consultant interested in AI. On February 8 and 9, 2024, I studied how 100 Finnish newspapers and other news media services block AI-training search bots in the robots.txt database of their websites. The source data can be found in Google Spreadsheet format here.

Methodology of the study

In practice, I looked through the robots.txt files of the websites of the most prominent Finnish newspapers and news media services, one by one, and noted how different AI training search bots are blocked with the "Disallow" flag.

I only examine search bots that I know have an impact on training large language models. So I did not count the use of Googlebot, or other known crawling robots, if they have no direct connection to AI training.

Definitions:

A newspaper (sanomalehti) in this case is a website that regularly publishes national or local news articles.
Other news media (muu uutismedia) in this case is any other major news media source that has a significant amount of news and written content available for reading via websites.
group (yhtymä) means the owner or publisher of a media service.The study shows significant differences in how different media groups approach the issue of blocking generative AI.
language (kieli) is the main publishing language of the news media, i.e. either Finnish or Swedish.
I did not include sites with largely user-published content (e.g. vauva.fi, or suomi24.fi), nor did I include news media whose content could not be easily read without logging into the websites.
If there are any questions or corrections to the statistics, they can be sent to lari@generatemore.ai.

Main conclusions of the study

58% of all Finnish news media services block someone from accessing their website content by a search bot training generative AI.
64% of online newspapers block AI-trained search bots, a slightly higher figure compared to the wider national news media.
CCBot, Google-Extended, GPTBot and ChatGPT-user are the most frequently blocked AI training bots in Finland.
35% of news media sites in Finland also block AI training search bots used by Facebook and Amazon.
Among the large media groups, Keskisuomalainen and Sanoma block the most bots from accessing their content.
So far, Yle does not seem to provide any guidance at all for AI-trained search bots on yle.fi.
Swedish-language news media block AI-trained search bots far less than Finnish-language sites.We found blocks on only 37.5% of Finnish Swedish-language news sites and newspaper websites.
None of the sites blocked Anthropic, the search bot that recently received significant funding from Amazon.

Researcher's findings about the study

This timely study showed that many Finnish media groups and news media are already aware of the training of generative AI in their content. On the other hand, for many media groups, tracking and blocking individual search bots may seem challenging as more and more different AI solutions come to market.

For news publishers, blocking search bots can be a clear and safe solution. Until we know more about how different search bots use journalistic content, it is safer to block popular bots from accessing the most valuable information. It remains to be seen how much OpenAI, for example, is willing to pay publishers to train its AI.

On the other hand, generative AI can also provide new creative opportunities for news content creators.Media groups and publishers can take advantage of the growing popularity of generative AI tools and create new content for them. For example, Google has started to experiment with the impact of generative AI on search results in the form of the Search Generative Experience. AI could become a way for news media to engage with new paying customers.

Ultimately, I think it would be good if the training of large language models and the source data for generative AI were discussed more openly in Finland and around the world. A mere ban on GPTBot does not support the interests of a small nation and language group.In order to keep up with development, we should have our own ways of keeping up with development. I find interesting, for example, the recently published Poro language model family by Silo.ai. The study did not show that Poro was particularly blocked on Finnish news sites, but there is no indication that search bots were used in its development either.

Monitoring of research results and access rights

I will endeavour to update the results of the study regularly and will update this article when there are significant changes to the results. The results of the study may be cited and the source data may be freely used, as long as a reference to the original study is included.

If you find an error or change in the results of the study, please send the information to lari@generatemore.ai!