Genutiv project

Contents

September 02, 2015

For further technical information check my GitHub repository.

This project aims to test the accuracy of patterns commonly cited in educational literature and then independently find the most accurate patterns. This process requires an initial collection of German nouns and their corresponding gender.

Data sources

The most accessible source of nouns is the German Wiktionary, which consists of 197,663 articles in 200 languages, of which 90,738 are German (see: Wiktionary Sprachenübersicht).

An excellent alternative source for German nouns is the cooperative online translation platform, dict.cc. In addition to an extensive database that includes over 850,000 German-English translations, dict.cc has made their database accessible for download (see: Request the translation database of dict.cc). Once the database is downloaded it can be easily accessed with dict2, a dictionary viewing application for GNU/Linux with GTK+.

Alternatively, a 1975 study by Gehard Angst developed a list of 2,162 simple German nouns.¹

Wiktionary

The majority of the information contained within each Wiktionary article is easily accessible through the MediaWiki API used by Wiktionary. Genutiv uses python-wikitools to interact with that API. These tools allow Genutiv to quickly return a list of pages within each category. For example, the following commands issued into the interpreter will return a list containing unicode strings representing the title of each member of the Substantiv (Deutsch) category.

>>> import wikitools
>>> url = wikitools.wiki.Wiki('http://de.wiktionary.org/w/api.php')
>>> substantiv = wikitools.category.Category(url, title='Substantiv (Deutsch)')
>>> nouns = substantiv.getAllMembers(titleonly=True)
>>> len(nouns)
36228

Category	Pages	Parents
Eigenname (Deutsch)	288	Substantiv (Deutsch)
Nachname (Deutsch)	472	Substantiv (Deutsch)
Vorname (Deutsch)	1175	Substantiv (Deutsch)
Toponym (Deutsch)	3133	Substantiv (Deutsch)
Substantive (Althochdeutsch)	2	Substantiv (Deutsch)
Substantive (Mittelhochdeutsch)	0	Substantiv (Deutsch)
Substantive (Plattdeutsch)	63	Substantiv (Deutsch)
Fremdwort	2093	Deutsch

Gender

Unfortunately, the gender of each noun is buried in wiki markup, which the MediaWiki software converts to HTML. Initially, Genutiv used urllib2, Beautiful Soup and regular expressions to obtain the gender of each noun. The script downloaded the source code of the article corresponding to each noun and searched each line for an <em> tag with a title attribute and then processed the value (i.e. Maskulinum, Femininum, Neutrum). The trial run of this script removed 149 pages from the dictionary, which contained at least one number, dash or semi-colon. Of the remaining collection, 359 nouns did not match a gender and similarly not passed on.

A newer version of Genutiv took advantage of the prop module within the MediaWiki API, which can retrieve a series of properties for each page, including a list of templates. Genutiv would check the list of templates returned for each page and assign the gender based on whether it found a masculine, feminine or neuter template (i.e. {{ "{{m"}}}}, {{ "{{f"}}}} or {{ "{{n"}}}}). The following test function accomplishes this task:

def gender(self, nouns):
    """Use MediaWiki API prop module to find relevant gender templates and
     assign gender to corresponding value"""

    for noun in nouns:
        page = wikitools.page.Page(self.site, title=noun)
        templates = page.getTemplates()

        for template in templates: #TODO(PM) Account for{{ "{{mf"}}}}
            if template == u'Vorlage:f':
                nouns[page.title] = "Femininum"
                break
            elif template == u'Vorlage:m':
                nouns[page.title] = "Maskulinum"
                break
            elif template == u'Vorlage:n':
                nouns[page.title] = "Neutrum"
                break

    return nouns

Unfortunately, this process leads to a large number of incorrect assignments. The order of the templates in the list generated by wiki-tools does not correspond to their location on the page. Instead, they are sorted alphabetically with case sensitivity. Since the aforementioned gender templates often appear more than once on a page it can be difficult to determine which template accurately reflects the gender of the noun. For example, if the page corresponding to a masculine noun contains a {{ "{{f"}}}}, the noun will be assigned a feminine gender. This process is further complicated by the fact that it is nearly impossible to determine the genders of nouns with more than one definition (e.g. Golf). A quick comparison of the prop method with a list generated on September 23, 2011 revealed 26,706 nouns were assigned the same gender, but 5,141 nouns were assigned a different gender. Last, but certainly not least, there may not be a noticeable performance difference between the crawling method and the prop method.

Source distortions

Although Wiktionary and dict.cc provide verbose sources of information they also distort the assessment of a pattern’s accuracy with the inclusion of a large number of foreign proper nouns and compound nouns.

German compound nouns always take the gender and plural ending of the last noun in the compound. For example, the German feminine nouns Galgenfrist and Frist can both be counted as exceptions to the ending -ist, which typically denotes a masculine noun. However, when a student is learning the gender of these nouns she only needs to learn the gender of Frist, since Galgenfrist shares the same gender. Therefore, counting these nouns as two exceptions distorts the accuracy of a pattern to predict the gender of a noun.

Additionally, there is a small number of German nouns that have dual gender. These nouns are spelled identically, but have different genders and definitions (e.g. Band).

These distortions are apparent in the atypical distribution of gender within the Wiktionary collection (see: Gender distribution). The first version of Genutiv returned 34,719 nouns with the following distribution: 12,087 (35%) masculine, 12,948 (37%) feminine and 9,684 (28%) neuter. This represents a significantly more even distribution than is commonly accepted. Notably, there are more feminine nouns than masculine, which is contrary to previous studies.

Another notable, but less prevalent, distortion is the result of user errors. For example, the first analysis of Wiktionary nouns revealed that 99.1% of the 213 nouns ending in -tät are feminine nouns, with two exceptions: Sozialität and Sprachloyalität. On Wiktionary those nouns were listed as masculine and neuter.⁴ Both of these errors existed since the articles were created December 30, 2006 and March 2, 2008 respectively. Once these corrections are made the ending -tät becomes 100% accurate. This process presents a surprising use of Genutiv, in that it becomes a tool to check the gender of nouns listed on Wiktionary. Additionally, it should be noted, that Sprachloyalität is a compound noun that ends in the feminine noun Loyalität, which should have been filtered from the source database anyway.

Parsing patterns

Many morphological patterns are only accurate when combined with semantics (see: Morphological patterns).

Gerhard Augst, Untersuchungen zum Morpheminventar der deutschen Gegenwartssprache (Tübingen: Narr, 1975)↩︎
For further information about why certain nouns were excluded see: Noun selection ↩︎
Alternatively, one could check the categories assigned to every page in the Substantiv (Deutsch) category and then remove undesirable nouns. Although this process should produce the same list of nouns, the process is significantly longer.↩︎
See Sozialität article from April 12, 2011 and Sprachloyalität article from April 12, 2011.↩︎