Summary: Automatically Categorizing Written Texts by Author Gender*
Anat Rachel Shimoni1
Dept. of Computer Science, Bar-Ilan University
Ramat Gan 52900, Israel
Dept. of Computer Science, Jerusalem College of Technology
21 Havaad Haleumi St. Jerusalem 91102, Israel
The problem of automatically determining the gender of a document's author would appear to be a more subtle
problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated
text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender
of the author of an unseen formal written document with approximately 80% accuracy. The same techniques can
be used to determine if a document is fiction or non-fiction with approximately 98% accuracy.
1.1 Text Categorization
The last ten years has seen an explosion of research in automated text categorization (Sebastiani 2002).
In the text categorization problem, we are given a set of two or more categories and examples of texts