Figure: Screenshot of the Datasets section of the catalogue

In this post we present the stable version of the MediaFutures data catalogue, after the various iterative improvements performed along the project, based on the needs detected among the teams, and on the feedback collected in the interviews.  

The catalogue provides a collection of resources organised in three main categories: datasets, tools, and self-training resources.

The Datasets section provides a collection of open datasets available on the internet for use.

As shown in the screenshot in Figure A1.1, the Datasets section includes three subsections:

  • General datasets: one for general datasets, on potentially any topic (created for MediaFutures’ second call, where the scope was general). This collection includes mainly general repositories of datasets, where more specific datasets can be explored and searched, to find relevant datasets on some specific topic of interest for the user.

  • Fake news datasets: one for datasets of Fake news widely used in research. The selection focuses on consolidated datasets curated by researchers and already used in research, accompanied by the corresponding academic paper, and eventually open source code.

  • Covid-19 related datasets: one specifically devoted to datasets related to Covid-19 (created for MediaFutures’ first open call, as this was the main topic of the call). The collection includes datasets related to Covid from a variety of sources including governmental and medical sources, social media, news, moods and mobility.

For each dataset, the catalogue provides a short description, the provider of the data, and a link. 

 

The Tools section of the catalogue includes a collection of tools to retrieve, manipulate, process and visualise data, as well as to generate creative work and to assess bias and discrimination in data and algorithms.

Figure: Screenshot of the Tools sections of the catalogue

As shown in the figure above, the section includes 6 subsections related to subcategories of tools having different uses and scopes:

    • Data Collection: tools to retrieve data from online platforms. These tools require some programming skills to be used. The category includes tools to scrape data from the Internet, APIs of popular platforms allowing access to their data, such as Reddit or Twitter, and tools or libraries to easily access them.

    • Data Science: tools to process, model and visualise data. The category includes a collection of popular data science libraries in Python, the most popular programming language for data science, covering different aspects such as statistical analysis, machine learning, natural language processing or data visualization. The tools require being able to write code in Python.

    • Digital Methods: tools to manipulate and visualise data through a visual interface (without programming). This is a collection of tools that are suitable also to people with little or no programming skills, allowing for performing more or less complex operations on data through a visual interface.

    • Digital Art: digital tools for helping artists and creatives develop their projects. This category provides a collection of tools that lay at the intersection between art and technology, with the main aim of making it easier to develop art through technology. It includes tools that provide simplified interfaces for creating code or for generating artistic outcomes through technology. The tools proposed are based on different programming languages, such as Python, Java, C++ or Javascript, and offer simplified interfaces. They are suitable for people having little programming skills, or willing to learn, and animated by some creative purpose.

    • Generative AI: frameworks and libraries to deploy generative AI models. The section includes open models, tools and frameworks for working with Generative AI, ranging from large language models (LLMs) and chatbots to image processing and generation. It has a special focus on open source models and tools, and on resources that can help to deal with generative models with limited hardware.

    • Algorithmic Fairness: tools for checking bias and discrimination in datasets and code. The collection includes different tools that provide more or less sophisticated interfaces for assessing imbalances and different kinds of bias in a given dataset, or in the outcomes of a piece of software. Some of the tools also propose solutions to correct the bias detected.

  • eXplainable AI: libraries and frameworks for explaining AI models. The collection includes a selection of tools that allow inspecting artificial intelligence models to understand how they work, and to interpret and explain their behaviour and outcomes. The libraries presented cover all the main state-of-the-art approaches, and support multimodal data.

For each tool, the catalogue provides a title, a subtitle and a short description, together with the link. 

The self-training section provides a collection of resources to improve one’s knowledge and skills about issues related to data, art and innovation; the catalogue includes different kinds of resources such as manuals and guides on relevant topics, self-evaluation questionnaires, canvas, videos and seminars, slides, toolboxes, and other materials freely accessible online.

Figure: Screenshot of the Self-training resources section of the catalogue

The self-training resources are grouped into 5 categories: 

  • Responsible Innovation: a diverse set of resources to improve one’s knowledge on issues relate to data, ethics and responsible innovation; the topics range from the issues of data sharing and scalability to privacy, intellectual property and GDPR.
  • Data Skills: a collection of resources for improving one’s skills in processing and managing data,with online seminars and material covering topics such as machine learning, natural language processing, web scraping, data maturity.
  • Data Access: a collection of resources to help individuals and organizations to make more informed decisions about issues related to accessing, using and sharing data, on data value, trustworthiness and revenue models.
  • Storytelling with data: online material to achieve a more effective use of data and data visualisation for conveying messages and storytelling.
  • Art-science collaboration: a collection of learnings from trans-disciplinary collaborations combining scientific research and technological innovation with artistic practices; the materials include keynotes, discussions, roundtables and interviews with the speakers and participants from relevant experiences in this transdisciplinary field. 

Beyond these three main categories (tools, datasets and self-training resources), the catalogue includes a section collecting the assets used by the teams for their projects. 

The frontpage of the catalogue offers access to user journeys, based on different user profiles  and designed to make it easier for a user to find resources that are relevant for their specific needs.

Through the Collection button, it is also possible to explore the whole collection as a list of all the resources included in the catalogue in alphabetical order, and to search among all the resources. 

The catalogue was developed during the MediaFutures project, with the primary aim of providing useful resources for the teams to perform their project, and at the same time keeping in mind other potential users of our platform, so that it may be useful beyond the scope of the MediaFutures calls, as a resource available for any internet user.