For further technical information check my GitHub repository.
This project aims to test the accuracy of patterns commonly cited in educational literature and then independently find the most accurate patterns. This process requires an initial collection of German nouns and their corresponding gender.
The most accessible source of nouns is the German Wiktionary, which consists of 197,663 articles in 200 languages, of which 90,738 are German (see: Wiktionary Sprachenübersicht).
An excellent alternative source for German nouns is the cooperative online translation platform, dict.cc. In addition to an extensive database that includes over 850,000 German-English translations, dict.cc has made their database accessible for download (see: Request the translation database of dict.cc). Once the database is downloaded it can be easily accessed with dict2, a dictionary viewing application for GNU/Linux with GTK+.
Alternatively, a 1975 study by Gehard Angst developed a list of 2,162 simple German nouns.1
The majority of the information contained within each Wiktionary article is easily accessible through the MediaWiki API used by Wiktionary. Genutiv uses python-wikitools to interact with that API. These tools allow Genutiv to quickly return a list of pages within each category. For example, the following commands issued into the interpreter will return a list containing unicode strings representing the title of each member of the Substantiv (Deutsch) category.
>>> import wikitools
>>> url = wikitools.wiki.Wiki('http://de.wiktionary.org/w/api.php')
>>> substantiv = wikitools.category.Category(url, title='Substantiv (Deutsch)')
>>> nouns = substantiv.getAllMembers(titleonly=True)
>>> len(nouns)
36228
The Substantiv (Deutsch) category consists of 36,217 pages
that each contain at least one German noun, its gender, a description
and further grammatical information. The discrepancy with the total in
the previous example results from the fact that the wikitools
getAllMembers
method includes parent and child categories
by default when returning members. Unfortunately, wikitools
does not provide access to the cmtype
parameter of the
list=categorymembers
module, which can be assigned to
page
in order to ignore those subcategories and files.
Unfortunately, the Substantiv (Deutsch) category contains proper nouns, regional and antiquated nouns, and words borrowed from other languages.2 To exclude these types of nouns Genutiv takes advantage of preexisting categories to filter out nouns from the list containing the 36,217 pages in Substantiv (Deutsch). Removing the nouns within the following categories results in a list containing 28,991 nouns, which is nearly 20% smaller than the original.3
Category | Pages | Parents |
---|---|---|
Eigenname (Deutsch) | 288 | Substantiv (Deutsch) |
Nachname (Deutsch) | 472 | Substantiv (Deutsch) |
Vorname (Deutsch) | 1175 | Substantiv (Deutsch) |
Toponym (Deutsch) | 3133 | Substantiv (Deutsch) |
Substantive (Althochdeutsch) | 2 | Substantiv (Deutsch) |
Substantive (Mittelhochdeutsch) | 0 | Substantiv (Deutsch) |
Substantive (Plattdeutsch) | 63 | Substantiv (Deutsch) |
Fremdwort | 2093 | Deutsch |
Unfortunately, the gender of each noun is buried in wiki markup,
which the MediaWiki software converts to HTML. Initially, Genutiv used
urllib2, Beautiful Soup and regular expressions to obtain the gender of
each noun. The script downloaded the source code of the article
corresponding to each noun and searched each line for an
<em>
tag with a title
attribute and then
processed the value (i.e. Maskulinum
,
Femininum
, Neutrum
). The trial run of this
script removed 149 pages from the dictionary, which contained at least
one number, dash or semi-colon. Of the remaining collection, 359 nouns
did not match a gender and similarly not passed on.
A newer version of Genutiv took advantage of the prop
module within the MediaWiki API, which can retrieve a series of
properties for each page, including a list of templates. Genutiv would
check the list of templates returned for each page and assign the gender
based on whether it found a masculine, feminine or neuter template
(i.e. {{ "{{m"}}}}
,
{{ "{{f"}}}}
or {{ "{{n"}}}}
).
The following test function accomplishes this task:
def gender(self, nouns):
"""Use MediaWiki API prop module to find relevant gender templates and
assign gender to corresponding value"""
for noun in nouns:
= wikitools.page.Page(self.site, title=noun)
page = page.getTemplates()
templates
for template in templates: #TODO(PM) Account for{{ "{{mf"}}}}
if template == u'Vorlage:f':
= "Femininum"
nouns[page.title] break
elif template == u'Vorlage:m':
= "Maskulinum"
nouns[page.title] break
elif template == u'Vorlage:n':
= "Neutrum"
nouns[page.title] break
return nouns
Unfortunately, this process leads to a large number of incorrect
assignments. The order of the templates in the list generated by
wiki-tools does not correspond to their location on the page. Instead,
they are sorted alphabetically with case sensitivity. Since the
aforementioned gender templates often appear more than once on a page it
can be difficult to determine which template accurately reflects the
gender of the noun. For example, if the page corresponding to a
masculine noun contains a {{ "{{f"}}}}
, the noun will be
assigned a feminine gender. This process is further complicated by the
fact that it is nearly impossible to determine the genders of nouns with
more than one definition (e.g. Golf). A quick comparison
of the prop
method with a list generated on September 23,
2011 revealed 26,706 nouns were assigned the same gender, but 5,141
nouns were assigned a different gender. Last, but certainly not least,
there may not be a noticeable performance difference between the
crawling method and the prop
method.
Although Wiktionary and dict.cc provide verbose sources of information they also distort the assessment of a pattern’s accuracy with the inclusion of a large number of foreign proper nouns and compound nouns.
German compound nouns always take the gender and plural ending of the
last noun in the compound. For example, the German feminine nouns Galgenfrist and Frist can both be counted as
exceptions to the ending -ist
,
which typically denotes a masculine noun. However, when a student is
learning the gender of these nouns she only needs to learn the gender of
Frist, since Galgenfrist shares the same gender.
Therefore, counting these nouns as two exceptions distorts the accuracy
of a pattern to predict the gender of a noun.
Additionally, there is a small number of German nouns that have dual gender. These nouns are spelled identically, but have different genders and definitions (e.g. Band).
These distortions are apparent in the atypical distribution of gender within the Wiktionary collection (see: Gender distribution). The first version of Genutiv returned 34,719 nouns with the following distribution: 12,087 (35%) masculine, 12,948 (37%) feminine and 9,684 (28%) neuter. This represents a significantly more even distribution than is commonly accepted. Notably, there are more feminine nouns than masculine, which is contrary to previous studies.
Another notable, but less prevalent, distortion is the result of user
errors. For example, the first analysis of Wiktionary nouns revealed
that 99.1% of the 213 nouns ending in -tät
are feminine nouns, with two
exceptions: Sozialität
and Sprachloyalität.
On Wiktionary those nouns were listed as masculine and neuter.4 Both of these errors existed since
the articles were created December 30, 2006 and March 2, 2008
respectively. Once these corrections are made the ending
-tät
becomes 100% accurate. This process presents a
surprising use of Genutiv, in that it becomes a tool to check
the gender of nouns listed on Wiktionary. Additionally, it should be
noted, that Sprachloyalität is a compound noun that ends in the
feminine noun Loyalität,
which should have been filtered from the source database anyway.
Many morphological patterns are only accurate when combined with semantics (see: Morphological patterns).