1,721,114 research outputs found
Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research
This dataset contains data and software for the following paper:
Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846
This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/
Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it.
Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users).
Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button.
Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work.
Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data.
Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.</p
Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia Research
This contains data and software for the following paper:
Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616
This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/
In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers.
Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account.
Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper.
This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/</p
Archival Data for Page Protection: Another Missing Dimension of Wikipedia Research
This dataset contains data and software for the following paper:
Hill, Benjamin Mako and Shaw, Aaron. (2015) “Page Protection: Another Missing Dimension of Wikipedia Research.” In Proceedings of the 11th International Symposium on Open Collaboration (OpenSym 2015). ACM Press. doi: 10.1145/2788993.2789846
This is an archival version of the data and software released with the paper. All of these data were (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-proetection/
Page protection is a feature of MediaWiki software that allows administrators to restrict contributions to particular pages. For example, a page can be “protected” so that only administrators or logged-in editors with a history of good editing can edit, move, or create it.
Protection might involve “full protection” where a page can only be edited by administrators (i.e., “sysops”) or “semi-protection” where a page can only be edited by accounts with a history of good edits (i.e., “autoconfirmed” users).
Although largely hidden, page protection profoundly shapes activity on the site. For example, page protection is an important tool used to manage access and participation in situations where vandalism or interpersonal conflict can threaten to undermine content quality. While protection affects only a small portion of pages in English Wikipedia, many of the most highly viewed pages are protected. For example, the “Main Page” in English Wikipedia has been protected since February, 2006 and all Featured Articles are protected at the time they appear on the site’s main page. Millions of viewers may never edit Wikipedia because they never see an edit button.
Despite it's widespread and influential nature, very little quantitative research on Wikipedia has taken page protection into account systematically. This page contains software and data to help Wikipedia researchers do exactly this in their work.
Because a page's protection status can change over time, the snapshots of page protection data stored by Wikimedia and published by Wikimedia Foundation in as dumps is incomplete. As a result, taking protection into account involves looking at several different sources of data.
Much more detail can be found in our paper Page Protection: Another Missing Dimension of Wikipedia Research. If you use this software or these data, we would appreciate if you cite the paper.</p
Archival Data for Consider the Redirect: A Missing Dimension of Wikipedia Research
This contains data and software for the following paper:
Hill, Benjamin Mako and Shaw, Aaron. (2014) "Consider the Redirect: A Missing Dimension of Wikipedia Research." In Proceedings of the 10th International Symposium on Open Collaboration (OpenSym 2014). ACM Press. doi: 10.1145/2641580.2641616
This is an archival version of the data and software released with the paper. All of these data were originally (and, at the time of writing, continue to be) hosted at: https://communitydata.cc/wiki-redirects/
In wikis, redirects are special pages in that silently take readers from the page they are visiting to another page in the wiki. In the English Wikipedia, redirects make up more than half of all article pages. Different Wikipedia data sources handle redirects differently. For example, the MediaWiki API will automatically "follow" redirects but the XML database dumps treat redirects like normal articles. In both cases, redirects are often invisible to researchers.
Because redirects constitute a majority of all pages and see a large portion of all traffic, Wikipedia researchers need to take redirects into account or their findings may be incomplete or incorrect. For example, the histogram on this page shows the distribution of edits across pages in Wikipedia for every page, and for non-redirects only. Because redirects are almost never edited, the distributions are very different. Similarly, because redirects are viewed but almost never edited, any study of views over articles should also take redirects into account.
Because redirects can change over time, the snapshots of redirects stored by Wikimedia and published by Wikimedia Foundation are incomplete. Taking redirects into account fully involves looking at the content of every single revision of every article to determine both when and where pages redirect. Much more detail can be found in Consider the Redirect: A Missing Dimension of Wikipedia Research — a short paper that we have written to accompany this dataset and these tools. If you use this software or these data, we would appreciate if you cite the paper.
This dataset was previously hosted at this now obsolete URL: http://networkcollectiv.es/wiki-redirects/</p
Archival dataset: A longitudinal dataset of five years of public activity in the Scratch online community
Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
All data tables included in this dataset are access restricted. Individuals should request access to these data by filling out a form and agreeing to the Scratch Research Data Sharing Agreement. The text of the agreement and instructions for requesting access is in the file named scratch-data-agreement-form.pdf in this repository and available at the following URL: https://dataverse.harvard.edu/file.xhtml?fileId=3102931
Documentation for this dataset can be found at https://communitydata.cc/scratch-data/</p
Replication Data for Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor
The dataset comprises revisions made by Tor users to various language versions of Wikipedia from October 2007 to February 2018. It also contains three sets of time-matched random samples of revisions made by groups of IP editors, First-time registered editors, and Registered editors to the English Wikipedia.
The access to our dataset is currently restricted. Individuals should request access to these data by emailing Dr. Benjamin Mako Hill at [email protected]
Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users
This dataset contains the data and code necessary to replicate work in
the following paper:
Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill,
and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of
an Interactive Tutorial for New Users.” in Proceedings of the 20th
ACM Conference on Computer-Supported Cooperative Work & Social
Computing (CSCW '17). New York, New York: ACM Press.
http://dx.doi.org/10.1145/2998181.2998307
The published paper contains two studies. Study 1 is a descriptive
analysis of a survey of Wikipedia editors who played a gamified
tutorial. Study 2 is a field experiment that evaluated the same the
tutorial. These data are the data used in the field experiment
described in Study 2.
Description of Files
This dataset contains the following files beyond this README:
twa.RData — An RData file that includes all variables used in Study
2.
twa_analysis.R — A GNU R script that includes all the code used to
generate the tables and plots related to Study 2 in the paper.
The RData file contains one variable (d) which is an R dataframe
(i.e., table) that includes the following columns:
userid (integer): The unique numerical ID representing each user on
in our sample. These are 8-digit integers and describe public
accounts on Wikipedia.
sample.date (date string): The day the user was recruited to the
study. Dates are formatted in “YYYY-MM-DD” format. In the case of
invitees, it is the date their invitation was sent. For users in the
control group, these is the date that they would have been invited
to the study.
edits.all (integer): The total number of edits made by the user on
Wikipedia in the 180 days after they joined the study. Edits to
user's user pages, user talk pages and subpages are ignored.
edits.ns0 (integer): The total number of edits made by user to
article pages on Wikipedia in the 180 days after they joined the
study.
edits.talk (integer): The total number of edits made by user to talk
pages on Wikipedia in the 180 days after they joined the
study. Edits to a user's user page, user talk page and subpages are
ignored.
treat (logical): TRUE if the user was invited, FALSE if the user was
in control group.
play (logical): TRUE if the user played the game. FALSE if the user
did not. All users in control are listed as FALSE because any user
who had not been invited to the game but played was removed.
twa.level (integer): Takes a value 0 of if the user has not played
the game. Ranges from 1 to 7 for those who did, indicating the
highest level they reached in the game.
quality.score (float). This is the average word persistence (over a
6 revision window) over all edits made by this userid.
Our measure of word persistence (persistent word revision per word)
is a measure of edit quality developed by Halfaker et al. that
tracks how long words in an edit persist after subsequent revisions
are made to the wiki-page. For more information on how word
persistence is calculated, see the following paper:
Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John
Riedl. 2009. “A Jury of Your Peers: Quality, Experience and
Ownership in Wikipedia.” In Proceedings of the 5th International
Symposium on Wikis and Open Collaboration (OpenSym '09),
1–10. New York, New York: ACM
Press. doi:10.1145/1641309.1641332.
Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence
How we created twa.RData
The files twa.RData combines datasets drawn from three places:
A dataset created by Wikimedia Foundation staff that tracked the
details of the experiment and how far people got in the game.
The variables userid, sample.date, treat, play, and twa.level were
all generated in a dataset created by WMF staff when The Wikipedia
Adventure was deployed. All users in the sample created their
accounts within 2 days before the date they were entered into the
study. None of them had received a Teahouse invitation, a Level 4
user warning, or been blocked from editing at the time that they
entered the study. Additionally, all users made at least one edit
after the day they were invited. Users were sorted randomly into
treatment and control groups, based on which they either received
or did not receive an invite to play The Wikipedia Adventure.
Edit and text persistence data drawn from public XML dumps created
on May 21st, 2015.
We used publicly available XML dumps to generate the outcome
variables, namely edits.all, edits.ns0, edits.talk and
quality.score. We first extracted all edits made by users in our
sample during the six month period since they joined the study,
excluding edits made to user pages or user talk pages using. We
parsed the XML dumps using the Python based wikiq and
MediaWikiUtilities software online at:
http://projects.mako.cc/source/?p=mediawiki_dump_tools
https://github.com/mediawiki-utilities/python-mediawiki-utilities
We obtained the XML dumps from: https://dumps.wikimedia.org/enwiki/
A list of edits made by users in our study that were subsequently
deleted, created on August 3rd, 2015.
The WMF staff created a dataset that listed all the edits made by
users in our study that were deleted before August 3rd, 2015. We
made the decision to include these edits in our counts, so as to
measure the total level of participation undertaken by each
editor. If a user in our study made article or talk page edits that
were subsequently deleted, we would use the deleted edit logs to
identify them, and increment the variables edits.all, edits.ns0,
and edits.talk as appropriate. We decided that all edits drawn from
the deleted edit logs would be defined to have an edit persistence
score of 0, since they were deleted from Wikipedia.
We “manually” merged these datasets together.
Contact Us
For more details about the dataset, please see our paper.
If you notice any bugs or issues with these data or code, please
contact Sneha Narayan (snehanarayan@u.northwestern.edu) or the
other authors of this paper.
</div
Replication Data for: The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users
This dataset contains the data and code necessary to replicate work in
the following paper:
Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill,
and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of
an Interactive Tutorial for New Users.” in Proceedings of the 20th
ACM Conference on Computer-Supported Cooperative Work & Social
Computing (CSCW '17). New York, New York: ACM Press.
http://dx.doi.org/10.1145/2998181.2998307
The published paper contains two studies. Study 1 is a descriptive
analysis of a survey of Wikipedia editors who played a gamified
tutorial. Study 2 is a field experiment that evaluated the same the
tutorial. These data are the data used in the field experiment
described in Study 2.
Description of Files
This dataset contains the following files beyond this README:
twa.RData — An RData file that includes all variables used in Study
2.
twa_analysis.R — A GNU R script that includes all the code used to
generate the tables and plots related to Study 2 in the paper.
The RData file contains one variable (d) which is an R dataframe
(i.e., table) that includes the following columns:
userid (integer): The unique numerical ID representing each user on
in our sample. These are 8-digit integers and describe public
accounts on Wikipedia.
sample.date (date string): The day the user was recruited to the
study. Dates are formatted in “YYYY-MM-DD” format. In the case of
invitees, it is the date their invitation was sent. For users in the
control group, these is the date that they would have been invited
to the study.
edits.all (integer): The total number of edits made by the user on
Wikipedia in the 180 days after they joined the study. Edits to
user's user pages, user talk pages and subpages are ignored.
edits.ns0 (integer): The total number of edits made by user to
article pages on Wikipedia in the 180 days after they joined the
study.
edits.talk (integer): The total number of edits made by user to talk
pages on Wikipedia in the 180 days after they joined the
study. Edits to a user's user page, user talk page and subpages are
ignored.
treat (logical): TRUE if the user was invited, FALSE if the user was
in control group.
play (logical): TRUE if the user played the game. FALSE if the user
did not. All users in control are listed as FALSE because any user
who had not been invited to the game but played was removed.
twa.level (integer): Takes a value 0 of if the user has not played
the game. Ranges from 1 to 7 for those who did, indicating the
highest level they reached in the game.
quality.score (float). This is the average word persistence (over a
6 revision window) over all edits made by this userid.
Our measure of word persistence (persistent word revision per word)
is a measure of edit quality developed by Halfaker et al. that
tracks how long words in an edit persist after subsequent revisions
are made to the wiki-page. For more information on how word
persistence is calculated, see the following paper:
Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John
Riedl. 2009. “A Jury of Your Peers: Quality, Experience and
Ownership in Wikipedia.” In Proceedings of the 5th International
Symposium on Wikis and Open Collaboration (OpenSym '09),
1–10. New York, New York: ACM
Press. doi:10.1145/1641309.1641332.
Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence
How we created twa.RData
The files twa.RData combines datasets drawn from three places:
A dataset created by Wikimedia Foundation staff that tracked the
details of the experiment and how far people got in the game.
The variables userid, sample.date, treat, play, and twa.level were
all generated in a dataset created by WMF staff when The Wikipedia
Adventure was deployed. All users in the sample created their
accounts within 2 days before the date they were entered into the
study. None of them had received a Teahouse invitation, a Level 4
user warning, or been blocked from editing at the time that they
entered the study. Additionally, all users made at least one edit
after the day they were invited. Users were sorted randomly into
treatment and control groups, based on which they either received
or did not receive an invite to play The Wikipedia Adventure.
Edit and text persistence data drawn from public XML dumps created
on May 21st, 2015.
We used publicly available XML dumps to generate the outcome
variables, namely edits.all, edits.ns0, edits.talk and
quality.score. We first extracted all edits made by users in our
sample during the six month period since they joined the study,
excluding edits made to user pages or user talk pages using. We
parsed the XML dumps using the Python based wikiq and
MediaWikiUtilities software online at:
http://projects.mako.cc/source/?p=mediawiki_dump_tools
https://github.com/mediawiki-utilities/python-mediawiki-utilities
We obtained the XML dumps from: https://dumps.wikimedia.org/enwiki/
A list of edits made by users in our study that were subsequently
deleted, created on August 3rd, 2015.
The WMF staff created a dataset that listed all the edits made by
users in our study that were deleted before August 3rd, 2015. We
made the decision to include these edits in our counts, so as to
measure the total level of participation undertaken by each
editor. If a user in our study made article or talk page edits that
were subsequently deleted, we would use the deleted edit logs to
identify them, and increment the variables edits.all, edits.ns0,
and edits.talk as appropriate. We decided that all edits drawn from
the deleted edit logs would be defined to have an edit persistence
score of 0, since they were deleted from Wikipedia.
We “manually” merged these datasets together.
Contact Us
For more details about the dataset, please see our paper.
If you notice any bugs or issues with these data or code, please
contact Sneha Narayan (snehanarayan@u.northwestern.edu) or the
other authors of this paper.
</div
- …
