Wikipedia User Data Extraction and Preprocessing

Using the Wikipedia API to extract Wikipedia user data and preprocess it

Ruthu S Sanketh
4 min readJan 14, 2022

Contents

  1. Introduction
  2. Code Walkthrough
  3. Further Reading
Image by ipopba on Unsplash
Image by ipopba on Unsplash

Introduction

Wikipedia is one of the most frequently visited websites on the internet worldwide. It represents a key source of information for a wide variety of purposes for a significant proportion of people and as such, it becomes very important to understand the composition of the community that is producing the content of Wikipedia. The registered editors of Wikipedia with publicly accessible profile pages are referred to as users. These profiles include both texts written by the editors about themselves, which are used as input features, as well as user boxes displaying personality traits, which are used as labels, which can be fed into an algorithm or network to get various predictions on attrition rate, topical and social biases, inclusiveness and discrimination. While this exhibits a clear potential to produce unintended side effects such as the compromise of editor privacy, it also has the potential to take actions to improve the legitimacy and neutrality of data of the world’s largest virtual encyclopedia.

Wikipedia user data can be extracted using the Wikipedia API which provides developers code-level access to the entire Wikipedia reference. The goal of the API is to provide direct, high-level access to the data contained in the MediaWiki databases. The API can be queried to return the data in HTML of JSON format, which can then be parsed and preprocessed to any desirable format.

Image by putilich on Unsplash
Image by putilich on Unsplash

Code Walkthrough

First we import the required libraries.

import wikipedia
import wikipediaapi
import pandas as pd

Now we take the list of users whose data is required and load it onto a dataframe. You can get this dataset from Wikipedia itself, or you can access a sample dataset here. We create empty lists for each type of data that we want to extract, such as gender, edit count, etc.

df = pd.read_excel(r"C:\<data_path>\user_names.xlsx")
print(df.shape)
print(df.head())
userid = []
editcount = []
gender = []
blockinfo = []
emailable = []

We format the usernames into a string with each username separated by a pipe, which can be fed directly into the API query parameter. We then query the API for the required extractables, and append it to the previously created lists.

for i in range(1,76):
a = df["Usernames"][(i-1)*49:i*49]
U = ""
for n in range(len(a)):
b = a.values[n]
U = U + str(b) + "|"

import requests
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"PARAMS = {
"action": "query",
"format": "json",
"list": "users",
"ususers": str(U),
"usprop": "blockinfo|editcount|emailable|gender"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
DATA["query"]["users"].pop(0)
USERS = DATA["query"]["users"]

for u in USERS:
if 'userid' in u:
userid.append(u["userid"])
else:
userid.append("unknown")
if 'editcount' in u:
editcount.append(u["editcount"])
else:
editcount.append("unknown")
if 'gender' in u:
gender.append(u["gender"])
else:
gender.append("unknown")
if 'emailable' in u:
emailable.append(1)
else:
emailable.append(0)
if 'blockid' in u:
blockinfo.append(1)
else:
blockinfo.append(0)

for j in range((i-1)*49,i*49):
df.loc[j, "User ID"] = int(userid[j])
df.loc[j,'Edit Count'] = int(editcount[j])
df.loc[j,'Gender'] = gender[j]
df.loc[j,'Blockinfo'] = int(blockinfo[j])
df.loc[j,'Emailable'] = int(emailable[j])

This block of code may have to be repeated depending on the number of usernames. For example, with a set of 3700 usernames, all the data can be collected by querying the API 2 times, by querying from [(i-1)*49:49] with i in the range(1,76), and once again by querying from [3675:3700] with i in the range(1,2). Once the API has been queried, we add a code block to collect the ‘About’ section of the user profile if it exists. We can save the preprocessed data to an excel sheet for further usage.

for i in range(1,3700):
a = df["Usernames"][i]
try:
df.loc[i, "About"] = wikipedia.summary("User:"+str(a))
except:
df.loc[i, "About"] = "unknown"
dfdf.to_excel(r'<data_path>\final.xlsx', index = False, engine='xlsxwriter')

The entire code used and explained in this article can be found here. The final preprocessed dataset can be found here.

Further Reading

  1. Bruckner, Lemmerich, Strohmaaier, Inferring Sociodemographic Attributes of Wikipedia Editors: State-of-the-art and Implications for Editor Privacy, April 2021, NY, USA
  2. Jaidka et. al, WikiTalkEdit: A Dataset for modeling Editors’ behaviors on Wikipedia
  3. Wikipedia, The Free Encyclopedia

--

--

Ruthu S Sanketh

IIT Kharagpur grad passionate about all things tech x entrepreneurship! https://ruthussanketh.com/