Wikipedia hoaxes

The master webpage for this project is hosted at the University of Maryland:
http://cs.umd.edu/~srijan/hoax/

Dataset information

Wikipedia has over 35,000,000 articles in over 290 languages. However, not all the articles are genuine. Hoax articles are purely fabricated articles that were created to mislead people.

In the paper cited below we study all actual and wrongly suspected hoaxes ever identified in the English version of Wikipedia. Most of them have been permanently deleted from Wikipedia's version history, so we had access to them only under a non-disclosure agreement. Therefore we are unable to publish the full dataset we work with in the paper. Instead, we publish a smaller dataset of hoaxes that are also publicly available (on websites such as Speedy Deletion Wiki or Deletionpedia), alongside an equally-sized set of non-hoaxes.

This dataset contains a set of 64 hoax articles that are publicly available, and have the following properties:

It also contains a set of 64 non-hoax articles that have the above three properties as the hoax articles. In addition, these non-hoax articles are selected such that (i) for each hoax, there is a non-hoax article that was created on the same day, and (ii) the two sets have similar appearance features (see Section 6 of the paper).

Sources (citations)

Files

The dataset contains four folders:

File Description Size
wiki-hoaxes.zip Content of hoax and non-hoax articles1.0 MB