In my first semester of postgraduate study, my friends and I completed a project on data visualisation. The project leader at the National Library of Scotland provided us with the original data: Scottish examination papers from 1888 to 1963. The dataset contains photographs of the papers and no clean-up OCR text containing 2,849,689 words. The data holder (National Library of Scotland) wanted us to tell the story behind the data in a more vivid and engaging way.
在研究生第一学期，我和朋友们完成了一个数据可视化的课题。苏格兰国家图书馆的项目负责人给我们提供了原数据：从1888年到1963年的苏格兰考试试卷。数据集包含试卷的照片，和总共 2,849,689 个词的未校验 OCR 文本。数据持有者（苏格兰国家图书馆）希望我们将这些数据背后的故事的以一个更生动的，更吸引人的方式讲述出来。
In the group, I was responsible for analysing the changes in the subjects of the exams over time. In the data analysing phase, I chose the exam time as the anchor point to locate the exam subject. To improve the extraction accuracy, I tried two methods: assigning values to keywords and setting thresholds to filter the time rows; using regular expressions for extraction. There were two main difficulties in using the first method (assigning values to keywords). Firstly, the unchecked text, for reasons such as misspelling of keywords, makes it difficult for some of the keywords to be assigned values. Secondly, the setting of thresholds is closely related to the accuracy rate. If there is enough time, the data can be annotated for training and machine learning methods can be used to find the optimal threshold to improve the accuracy of extracting subjects.
The changes in the number of subjects examined from year to year is shown in the graph below. It is clear that as time progresses, more and more subjects are included in the examinations. Especially around 1950, the number of subjects increased significantly.
经过分析，每年考试科目数目的变化被显示在下方的图中。可以明显的看出随着时间的前进，越来越多的科目被包含在了考试当中。尤其是 1950 年前后，数目大幅度的增长。
(How subject quantity changes over years)
In terms of presentation, we decided to present the data in a serious comic style, which would appeal to a wider audience, without losing the seriousness of the data. The entire comic will be shown on a web page, for which we have also designed a number of dynamic elements to increase the interactivity of the page.
To further explore the changes in subjects, I have selected four representative years for further analysis. The 1888 student bag contained textbooks for 12 subjects, distributed in three different areas: languages, math and sciences.
为了进一步探讨科目的变迁，我选取了 4 个具有代表性的年份进行进一步分析。1888年，学生的书包中只包含12个科目，并且只覆盖了语言，数学，和科学领域。
After 1888, the number of subjects gradually increased. By 1921, a total of 16 courses had been included. More and more courses were added to the fields of languages, maths and sciences. Gaelic, for example, was a new course that was not in 1888.
Around 1950, the number of courses increased dramatically. It has doubled in less than 30 years. In addition to the original three areas, other areas were added, like music and liberal arts. Meanwhile, the number of sciences-related subjects was also increased. This has been supplemented by zoology, chemistry, etc. The modern curriculum design is emerging!
Then, the course began to develop in new areas. Dress and Design, Home Management courses were also being added to the mix. Eventually, by 1963, there were already 38 subjects in the exam!
In addition to subject subject research, my teammates discussed topics such as sex bias. You can access our page through this link: Scottish School Exam Papers
除了学科科目的研究，我的队友们还讨论了诸如性别偏见等的话题。你可以通过这个连接访问到我们的网页：Scottish School Exam Papers