ParaNames: A Massively Multilingual Entity Name Corpus
The recently published preprint describes work in progress on ParaNames, a multilingual parallel name resource consisting of names for approximately 14 million entities. The included names span over 400 languages, and almost all entities are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, the authors create the largest resource of this type to-date. They describe their approach to filtering and standardizing the data to provide the best quality possible.
The dataset only uses the “label” property in Wikidata to identify names for entities. One of the potential limitations of this approach is that a given entity can only have a single label within a single Wikimedia language code, even though there may be multiple possible transliterations of an entity name for that language code. There is a possible solution in Wikidata for this limitation. There is an “also-known-as” (AKA) property, which for many entities contains useful examples of real-world names used to refer to it and can include alternative transliterations. In the case of Donald Trump, the AKAs contain other variations of his name (Donald John Trump, Donald J. Trump, etc.), but also pseudonyms that he has used that do not correspond to his actual name (John Barron, John Miller, David Dennison, etc.).
ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. The creators demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English.
The resource is released at url under a Creative Commons license (CC BY 4.0).