清华大学
计算机科学与技术系

一个用于探究用户在文本摘要任务中注意力阅读行为的数据集

数据集里包含了50个不同个体在阅读100篇文章（来自10个热门类别）并为每篇文章撰写相应摘要时的注意力数据。该数据集由若干个高精度眼动追踪设备（每秒产生100个凝视点）采集得到。我们的数据集在一个受控的环境中收集，总共有1.57亿个注意力数据点，不仅提供了不同的人阅读文章并撰写其摘要时的基本凝视行为，还提供了不同的行为模式与其提供的摘要之间的联系。

关键词

注意力行为；文本摘要；个性化文本摘要；文本阅读；眼动仪。

主要内容

Dataset Collection Procedure
We manually collected a total of 100 samples in 10 categories from the public news websites Netease News and Tencent News for gaze collection. And Each sample includes an article and a title, where the title is used as a reference when the user writes the summary.

Length distribution of articles in different categories
Articles used in the study were collected manually from public news websites. There are 100 articles in total which belong to 10 popular categories. Each category has 10 articles. When selecting samples, we deliberately avoided samples that can summarize the entire article content in the first sentence. In addition, samples that are too short or too long were not selected. The average length of all articles is around 502 Chinese characters, and that of all titles is around 22 Chinese characters. The longest article has 842 Chinese characters, and the shortest article has 99 Chinese characters.

All users' familiarity score distribution of articles in different categories
The familiarity score ranges from 1 to 5, where 1 means very unfamiliar, and 5 means very familiar.

The average gaze time on each Chinese character of each participant

The summary similarity distributions in different categories
It is not difficult to see that there are large differences between the summaries, and many of the similarities are lower than the empirical value of 0.8. The distributions in different categories are also different, among which summaries in the cultural category have the lowest similarity.

In order to compare the similarities and differences in the gaze distribution of different people during reading, we show the collected gaze behavior in the form of a heat map. The brighter part of the heat map indicates that the participant has been reading the current area for a longer time.By comparing these groups of heat maps in the figures, we can propose the following two assumptions:
1. - When reading and summarizing text, everyone has their own stable reading patterns and preferences.
2. - When reading and summarizing text, there are different reading patterns and preferences existing between different people.

清华大学
计算机科学与技术系

一个用于探究用户在文本摘要任务中注意力阅读行为的数据集

关键词

主要内容

资源下载

联系方式

清华大学计算机科学与技术系

一个用于探究用户在文本摘要任务中注意力阅读行为的数据集

关键词

主要内容

资源下载

联系方式

清华大学
计算机科学与技术系