Was this the real Web? Quantitative overview of the Polish ccTLD Internet Archive data (1996–2001)

Marcin Wilkowski


This article is an attempt to build a quantitative panorama of the Polish country code top-level domain (ccTLD) in the years 1996–2001 on the basis of data generously provided by the Internet Archive. The purpose of analyzing over 72 million captures is to show that these resources have limited potential in reconstructing the early Polish Web. The availability of historical Web resources and tools for their easy exploration in no way determines their potential value and usefulness in research, even if we do not have access to alternative sources.

Czy to był prawdziwy Web? Ilościowy przegląd polskiej domeny krajowej w zbiorach Internet Archive (1996–2001)

Artykuł przedstawia ilościowy opis zasobów polskiej domeny krajowej (country code top-level domain, ccTLD) z lat 1996–2001, dostępnych w zbio­rach Wayback Machine, archiwum Webu prowadzonym przez Internet Archive. Celem analizy ponad 72 mln archiwizacji (captures) jest wykaza­nie, że zasoby te mają ograniczony potencjał w rekonstruowaniu polskiego wczesnego Webu. Dostępność historycznych zasobów WWW i narzędzi do ich łatwej eksploracji w żaden sposób nie przesądza o ich potencjalnej wartości i przydatności w badaniach, nawet jeśli nie mamy dostępu do al­ternatywnych źródeł.

Słowa kluczowe: Internet Archive, Polish Web, historical Web resources, polska domena krajowa, zasoby historyczne www

Baeza-Yates R., Castillo C., Efthimiadis E.N., Characterization of national Web domains, “ACM Transactions on Internet Technology” 2007, 7, 2, p. 1–32, https://doi.org/10.1145/1239971.1239973. Accessed 16.09.2021.

Ben-David A., Critical Web Archive Research, [in:] The Past Web: Exploring Web Archives, D. Gomes, E. Demidova, J. Winters, T. Risse, eds., Springer Nature Switzerland, Cham 2021, p. 181–188, https://doi.org/10.1007/978-3-030-63291-5_14. Accessed 16.09.2021.

Ben-David A., Amram A., The Internet Archive and the socio-technical construction of historical facts, “Internet Histories: Digital Technology, Culture and Society” 2018, 2, p. 1–23, https://doi.org/10.1080/24701475.2018.1455412. Accessed 16.09.2021.

Bingham N.J., Byrne H., Archival strategies for contemporary collecting in a world of big data: Challenges and opportunities with curating the UK web archive, “Big Data & Society” 2021, 8, 1, p. 1–6, https://doi.org/10.1177/2053951721990409. Accessed 16.09.2021.

Brügger N., When the Present Web is Later the Past: Web Historiography, Digital History, and Internet Studies, “Historical Social Research” / “Historische Sozialforschung” 2012, 37, 4 (142), s. 102–117.

Brügger N., Ditte L., Historical studies of National Web Domains, [in:] The SAGE Handbook of Web History, N. Brügger, I. Milligan, eds., Sage, Los Angeles–London–New Delhi 2018, p. 413–427. 

Brügger N., Nielsen J., Laursen D., Big data experiments with the archived Web: Methodological reflections on studying the development of a nation’s Web, “First Monday” 2020, 25, 3, https://doi.org/10.5210/fm.v25i3.10384. Accessed 16.09.2021.

Cocciolo A., Quantitative Web History Methods, [in:] The SAGE Handbook of Web History, N. Brügger, I. Milligan, eds., Sage, Los Angeles–London–New Delhi 2018, p. 138–152.

Denev D. et al., The SHARC framework for data quality in Web archiving, “The VLDB Journal” 2011, 20, p. 184–207, https://doi.org/10.1007/s00778-011-0219-9. Accessed 16.09.2021.

Foot K., Web Sphere Analysis and Cybercultural Studies, [in:] Critical Cyberculture Studies, D. Silver, A. Massanari, eds, NYU Press, New York 2006, p. 88–96.

Hale S.A., Blank G., Alexander V.D., Live versus archive: Comparing a web archive to a population of web pages, [in:] The Web as History. Using Web Archives to Understand the Past and the Present, N. Brügger and R. Schroeder, eds., UCL Press, London 2017, p. 45–61.

Helmond A., A Historiography of the Hyperlink: Periodizing the Web Through the Changing Role of the Hyperlink, [in:] The SAGE Handbook of Web History, N. Brügger, I. Milligan, eds., Sage, Los Angeles–London–New Delhi 2018, p. 227–241.

Holzmann H., Goel V., Anand A., ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation, [in:] 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Newark, New Jersey, p. 83– 92, https://doi.org/10.1145/2910896.2910902. Accessed 16.09.2021.

Jones S.M. et al., Scholarly context adrift: three out of four URI lead to changed content, “PLOS One” 2016, 11 (12), p. 1–32, https://doi.org/10.1371/journal.pone.0167475. Accessed 16.09.2021.

Kimpton M., Ubois J., Year-by-Year: From an Archive of the Internet to an Archive on the Internet, [in:] Web Archiving, Julien Masanes, ed., Springer, Berlin–Heidelberg–New York 2006, p. 201–212. Milligan I., Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives, “International Journal of Humanities and Arts Computing” 2016, 10, 1, p. 78–94, https://doi.org/10.3366/ijhac.2016.0161. Accessed 16.09.2021.

Milligan I., Ruest N., Lin J., Content Selection and Curation for Web Archiving: The Gatekeepers vs. the Masses, [in:] Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, 2016, p. 107–110.

Rauber A. et al., Uncovering Information Hidden in Web Archives. A Glimpse at Web Analysis building on Data Warehouses, “D-Lib Magazine” 2002, 8, 12, https://www.doi.org/10.1045/december2002-rauber. Accessed 16.09.2021.

Spaniol M. et al., Data quality in Web Archiving, [in:] Proceedings of the 3rd Workshop on Information Credibility on the Web, Association for Computing Machinery, New York 2009, p. 19–26, https://doi.org/10.1145/1526993.1526999. Accessed 16.09.2021.

Trotman A., Zhang J., Future Web Growth and its Consequences for Web Search Architectures, arXiv.org, p. 1–41, https://arxiv.org/abs/1307.1179. Accessed 16.09.2021.

Tufekci Z., Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls, [in:] Eighth International AAAI Conference on Weblogs and Social Media, Association for the Advancement of Artificial Intelligence, 2014, p. 505–514.

Vlassenroot E., Chambers S., Di Pretoro E. et al., Web archives as a data resource for digital scholars, “International Journal of Digital Humanities” 2019, 1, p. 85–111, https://doi.org/10.1007/s42803-019-00007-7. Accessed 16.09.2021.

Weber M.S., Web Archives: A Critical Method for the Future of Digital Research, WARCnet Papers, Aarhus 2020, p. 1–17, https://cc.au.dk/fileadmin/user_upload/WARCnet/Weber_Web_Archives_A_Critical_Method.pdf. Accessed 16.09.2021.

Wilkowski W., Polish Web resources described in the “Polish World” directory (1997). Characteristics of domains and their conservation state, “Archiwa – Kancelarie – Zbiory” 2020, 11, 13, p. 119–140, https://doi.org/10.12775/AKZ.2020.005. Accessed 16.09.2021.

Internet Resources

Common Crawl, https://commoncrawl.org//. Accessed 16.09.2021.

GitHub, https://github.com. Accessed 18.10.2021. Internet Archive, https://archive.org. Accessed 16.09.2021.

Internet Domain Survey Background (2003), https://web.archive.org/web/20031002012504/http://www.isc.org/ds/new-survey.html. Accessed 16.09.2021.

HTTP Archive, https://httparchive.org/. Accessed 16.09.2021.

MDN Web Docs. HTTP response status codes, https://developer.mozilla.org/en-US/docs/Web/HTTP/Status. Accessed 16.09.2021.

SHINE, https://www.webarchive.org.uk/shine. Accessed 16.09.2021. Spark SQL, https://spark.apache.org. Accessed 16.09.2021.

The World Bank Data, https://data.worldbank.org. Accessed 16.09.2021.

Wikipedia. Spacer GIF, https://en.wikipedia.org/wiki/Spacer_GIF. Accessed 16.09.2021.