This project was conducted with excellent collabator and friend Thayer Alshaabi for a course final project in Principles of Complex Systems. See the full project write-up here
This project heavily features web scraping and basic NLP tools to build a dataset of information from wikipedia. Specifically, we scraped information related to notable people within fields, and and their age. We define at the highest level of interest, and most loosely, “Categories” as representative, typically, of academic disciplines. That being said, certain Categories lend themselves more readily towards our proposed analysis such as Mathematics, Chemistry and Physics in contrast towards more nebulous or highly specific topics such as History or Geodesy. Given a Category, we next define a ’Branch’ as any sub topic within that larger category, say X, such that it would appear reasonably within a list titled “Branches of X” or “Fields of study of X”, and importantly that it have an associated Wikipedia page. For each Category considered we sought to collect the following information from both its associated Wikipedia page and those of its Branches.