Wikipedia Academy: Research and Free Knowledge.
June 29 - July 1 2012, Berlin.


From Academy
Revision as of 23:21, 28 June 2012 by Daniel Mietchen (Talk | contribs)

Jump to: navigation, search

All three workshops take place concurrently on 29 June 2012, 09:00 – 12:00, at the conference venue.

WORKSHOP: Wikipedia data analysis for researchers

by Felipe Ortega

Max. number of participants: 15 - This workshop is fully booked.

Goals and content

Since 2003, research in many different areas has witnessed a steady growth in the number of scientific studies and publications using Wikipedia data either as a case study or as their main focus of interest. The most recent survey about academic studies on Wikipedia[1] rendered 39 theses and +2,100 research articles. However, despite this increasing popularity of Wikipedia as a versatile information source for research purposes, many tools and methods for Wikipedia data retrieval and analysis still go unnoticed to many scholars and scientists. Thus, the main goal of this tutorial is to offer a comprehensive introduction to the state-of-the-art of Wikipedia data analysis for researchers, with a special emphasis on effective methodologies to work with Wikipedia data. The tutorial focuses on the practical aspects of Wikipedia data analysis, providing numerous real examples of existing studies and solid grounds for conducting your own analyses.

Learning outcomes

  • Understand the main areas already explored in Wikipedia research, pending research questions, along with the most promising lines for future work on this topic.
  • Identify available Wikipedia data sources, where you can find data to conduct your study and the pros and cons of each alternative source for data retrieval.
  • Learn effective strategies for Wikipedia data storage and management using affordable equipment, without resorting to use expensive parallel architectures or complicated procedures for distributed computing.
  • Analyse Wikipedia data using R and R Studio. Identify suitable standard and contributed libraries in R to facilitate the analysis.
  • Adapt existing tools for Wikipedia data analysis to automate this process in your own studies, making them less error-prone.

Intended audience and background

This tutorial will offer an accessible introduction to the available methods and practice of Wikipedia data analysis for a wide variety of researchers from different areas. The tutorial will assume that attendees have some modest background in statistics (an undergraduate introductory course covering exploratory data analysis and basic inference and hypothesis testing will suffice). Some very basic programming competency (in any programming language) will be a plus to follow practical examples, though it is not a mandatory requirement. We will use R Studio to offer an easy-to-use and friendly interface for researchers who will have their first contact with R. It will also offer convenient short-cuts and added features to more experienced R users. In particular, R is used today in a wide variety of disciplines outside Computer Science and Engineering, and therefore it has proven to be quite a valuable tool for researchers without extensive training in programming.


This is a half-day tutorial, with an estimated total duration of 3 hours, including a 10-min break in between the two main sections.

Section 1
Preparing for Wikipedia data analysis (75 mins).
1.1 Understanding Wikipedia data sources (30 mins).
1.2 Data preparation and storage (25 mins).
1.3 Available tools for Wikipedia data analysis (25 mins).
Section 2
Conducting Wikipedia data analysis (90 mins).
2.1 Caveats and known issues (15 mins).
2.2 Real examples and case studies (75 mins).
2.2.1. Pages and content.
2.2.2. Community and editing process.
2.2.3. Traffic and usage.

Preparation and references

Installation of the following open source tools and programming languages is required for taking full benefit from this tutorial. All this software is open source and multi-platform:

  • R statistical environment [2] (How can R be installed) and R Studio[3].
  • Additional R packages[4]: RMySQL, DAAG, car, Hmisc, statnet.
  • MySQL database[5].
  • Python programming language[6].
  • Matplotlib[7] and scikit-learn[8].

WORKSHOP: Toolserver

by Marlen Caemmerer and Christian Thiele

Workshop content

The tutorial “Toolserver” will introduce the Wikimedia Toolserver and its possibilities and ways for working with the Wikipedia databases.

The Wikimedia Toolserver is a cluster of servers operated by Wikimedia Deutschland e. V. It offers a replicated database of Wikimedia's projects. The toolserver is used to provide hosting and combined access to tools which were written by different people for the Wikimedia projects; it could also be used for scientific data analysis of projects like Wikipedia, Wikisource, Wikimedia Commons, and others.

The participants will get an introduction to the working environment including the ways to get an account and how to communicate with the toolserver community. We will show some of the tools running on the toolserver to demonstrate the possibilities. A short introduction on how to write code that runs in the toolserver environment will be given. The database structure of Mediawiki (the server software that runs Wikipedia) and what data is available will be explained.

The workshop is intended for people who want to create tools for the Wikipedia community or want to do data analysis on the replicated databases. Participants should have some basic programming skills; some knowledge in databases (especially SQL) is helpful.


Laptop + Webbrowser + ssh client (for Windows: putty)


by Daniel Mietchen

Goals and content

GLAM:Wiki collaborations often start on the basis of personal interest and good will but in order to establish a basis for long-term partnerships, some quantitative measures are needed that capture the impact of the collaboration on parameters that are of interest to either or both of the partners, like traffic stats or article quality.

A number of tools have been written or adapted for GLAM:Wiki purposes, yet they remain not widely known. This workshop is meant as a hands-on introduction to the tools currently in use in GLAM:Wiki contexts.

Learning outcomes

  • A basic understanding of the kinds of data that are available on the wiki end
  • Practical experience with some basic options to visualize these data and to customize their visualization

Intended audience and background

  • This tutorial is aimed at
    1. GLAM professionals interested in tracking the reuse of their materials on Wikimedia projects
    2. Wikimedians with an interest in tracking the use of materials from specific sources, particularly GLAMs.


This is a half-day tutorial with an estimated total duration of 3 hours including a 10-min break in between the two main sections.

Section 1
Use cases for GLAM:Wiki tools (75 mins).
Section 2
Hands-on session (90 mins).

Preparation and notes

Participants are requested to select a set of articles, files or categories on a Wikimedia project of their choice and to prepare some questions they would like to see addressed about the usage of these resources. Notes will be taken collaboratively via .