Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years

This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot.

The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203".

Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here.

This graphic represents the diversity of captures by year and search engine (Google and Bing).

البيانات و الموارد

معلومات إضافية

حقل القيمة
المصدر Internet Archive Wayback Machine (browser extraction of HTML versions) (publicly accessible)
المؤلف Bruno Edgar Oliveira, Carla Teixeira Lopes
آخر تحديث أغسطس 28, 2023, 08:28 (UTC)
أنشئت يوليو 26, 2021, 13:04 (UTC)
Citation OLIVEIRA, B.E., Lopes C. T. Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years [dataset]. 26 July 2021. INESC TEC research data repository. DOI: https://doi.org/10.25747/991g-f765
DOI https://doi.org/10.25747/991g-f765
dc.Contributor Carla Teixeira Lopes
dc.Coverage.Spatial Interfaces from Internet Archive related to world wide captures (no country-restriction)
dc.Coverage.Temporal 2000 to 2020
dc.Date 30/06/2021
dc.Format HTML (complete page); *.PNG
dc.Format.Extent 9.59 GB
dc.Language EN
dc.Publisher FEUP
dc.Relation Oliveira, Bruno. Master Thesis "Web Search Engines - a study on the evolution of user interfaces". FEUP. 2021; https://bedgarone.github.io/serpevolution/
dc.Type HTML version (complete page) and PNG screenshot for each SERP capture
ddi.Software Any browser and image viewer (no dependencies)
ddi.TypeInstrument Python and Selenium Webdriver (automated browser)